Large Scale Text Classification with Multi-label Naive Bayes

(Dept. of Computer Science, University of Waikato, Hamilton, New Zealand)

Abstract-Large scale multi-label text classification tasks are becoming common with online databases, such as Wikipedia and the DMOZ web directory. Current machine learning classifiers lack the scaling capabilities required to manage these databases. An efficient multi-label mixture model is proposed in this paper called Multi-label Naive Bayes (MLNB). The MLNB can be extremely efficiently trained with closed form and linear time estimation. An approximate inference algorithm called Extended Greedy Iterative Addition (EGIA) is proposed for sparse generative classifiers, using pruning techniques and exploiting data sparsity to reduce uhe practical time complexity of classification. The model and the inference algorithm are evaluated on the LSHTC2 datasets for large scale multi-label text classification, resulting in accuracy comparable to K-Nearest Neighbour, with sub-second classification times for multi-label classification of over 325000 labels and 1.6 million features.

Key words-multi-label; text classification; mixture model; Wikipedia; naive Bayes; DMOZ; LSHTC2; MLNB; EGIA

Manuscript Number: 1674-8042(2011)supp1.-0038-08

doi: 10.3969/j.issn.1674-8042.2011.supp1.009

References

[1]De Vries C M, Nayak R, Kutty S, et al. Overview of the INEX 2010 XML mining track : clustering and classification of XML documents. Initiative for the Evaluation of XML Retrieval (INEX) 2010, Amsterdam, 2011.
[2]Tsoumakas G., Ioannis K. Multi-label classification: an overview. International Journal of Data Warehousing and Mining, 2007, 3(3): 1-13.
[3]Vens C, Struyf J, Schietgat L, et al. Decision trees for hierarchical multi-label classification. Machine Learning, 2008, 73: 185-214.
[4]Read J, Pfahringer B, Holmes G, et al. Classifier chains for multi-label classification. Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2009, 5782: 254-269.
[5]McCallum A K. Multi-label text classification with a mixture model trained by EM. AAAI 99 Workshop on Text Learning, 1999.
[6]Ueda N, Saito K. Parametric mixture models for multi-labeled text. Advances in Neural Information Processing Systems, 2002, 15.
[7]Wang H, Huang M, Zhu X. A generative probabilistic model for multi-label classification. 8th IEEE International Conference on Data Mining,2008: 628-637.
[8]Maron M E. Automatic indexing: an experimental inquiry. Journal of the ACM, 1961, 8:404-417.
[9]Rennie J D, Shih L, Teevan J, et al. Tackling the poor assumptions of naive bayes text classifiers. Proceedings of the Twentieth International Conference on Machine Learning, 2003:616-623.
[10] Wang K, Li X, Gao J. Multi-style language model for web scale information retrieval. Proceedings of SIGIR, SIGIR'10, 2010: 467-474.
[11]http://lshtc.iit.demokritos.gr.
[12]http://www.dmoz.org.
[13]http:/en.wikipedia.org.
[14] Nigam K, McCallum A K, Thrun S, et al. Text classification from labeled and unlabeled documents using em. Machine Learning, 2000, 39: 104-134.
[15]Li H, Yamanishi K. Document classification using a finite mixture mode. Proceedings of ACL, Spain,1997: 39-47
[16]Blei D, Ng A, Jordan M I Latent dirichlet allocation. J. Mach. Learn. Res., 2003, 3: 993-1022 .
[17]Neal R M, Hinton G E. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in graphical models, MIT Press, 1999: 355-368.
[18] Zhai C, Lafferty J. Two-stage language models forinformation retrieval. SIGIR 2002: 49-56.
[19]Matyas J. Random optimization. Automation and Remote Cntrol, 1965, 26: 246-253.
[20]Spall J C. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans. on Automatic Control, 1992, 37: 332-341.
[21]Sato M, Ishii S. On-line EM Algorithm for the Normalized Gaussian Network. Neural Comput., 2000, 12: 407-432.

[full text view]

此页面上的内容需要较新版本的 Adobe Flash Player。

Large Scale Text Classification with Multi-label Naive Bayes