CL IR LGJun 27, 2012

Cross Language Text Classification via Subspace Co-Regularized Multi-View Learning

arXiv:1206.6481v164 citations

Originality Incremental advance

AI Analysis

This addresses the problem of reducing labeling costs for multilingual text classification, though it appears incremental as it builds on existing multi-view learning and domain adaptation techniques.

The paper tackles cross-language text classification by developing a subspace co-regularized multi-view learning method that uses parallel corpora from machine translation to transfer label knowledge between languages, resulting in consistent outperformance over various baseline methods in empirical studies.

In many multilingual text classification problems, the documents in different languages often share the same set of categories. To reduce the labeling cost of training a classification model for each individual language, it is important to transfer the label knowledge gained from one language to another language by conducting cross language classification. In this paper we develop a novel subspace co-regularized multi-view learning method for cross language text classification. This method is built on parallel corpora produced by machine translation. It jointly minimizes the training error of each classifier in each language while penalizing the distance between the subspace representations of parallel documents. Our empirical study on a large set of cross language text classification tasks shows the proposed method consistently outperforms a number of inductive methods, domain adaptation methods, and multi-view learning methods.

View on arXiv PDF

Similar