The classification of web pages content is essential to many information retrieval tasks. In this paper, we propose a new methodology for a multilayer soft classification. Our approach is based on the connection between the semi-supervised Latent Dirichlet Allocation (LDA) and the Random Forest classifier. We compute with LDA the distribution of topics in each document and use the results to train the Random Forest classifier. The trained classifier is then able to categorize each web document in different layers of the categories hierarchy. We have applied our methodology on a collected data set from dmoz and have obtained satisfactory results.
@inproceedings{i4cs15,
Author = {Karim Sayadi and Quang Vu Bui and Marc Bui},
Booktitle = {15th International Conference on Innovations for Community Services, {I4CS} 2015, Nuremberg, Germany, July 8-10, 2015},
Pages = {1--7},
Title = {Multilayer classification of web pages using random forest and semi-supervised latent dirichlet allocation},
Year = {2015}}