ML LGSep 25, 2012

Optimal Weighting of Multi-View Data with Low Dimensional Hidden States

arXiv:1209.5477v2

AI Analysis

This work addresses the challenge of handling noisy multi-view data in NLP tasks, offering a method to enhance feature weighting for better performance in supervised learning with limited labeled data.

The paper tackles the problem of multi-view data in NLP where some views are noisier than others, proposing an unsupervised algorithm that optimally weights features from different views when they originate from low-dimensional hidden states, achieving improved feature selection for supervised learning.

In Natural Language Processing (NLP) tasks, data often has the following two properties: First, data can be chopped into multi-views which has been successfully used for dimension reduction purposes. For example, in topic classification, every paper can be chopped into the title, the main text and the references. However, it is common that some of the views are less noisier than other views for supervised learning problems. Second, unlabeled data are easy to obtain while labeled data are relatively rare. For example, articles occurred on New York Times in recent 10 years are easy to grab but having them classified as 'Politics', 'Finance' or 'Sports' need human labor. Hence less noisy features are preferred before running supervised learning methods. In this paper we propose an unsupervised algorithm which optimally weights features from different views when these views are generated from a low dimensional hidden state, which occurs in widely used models like Mixture Gaussian Model, Hidden Markov Model (HMM) and Latent Dirichlet Allocation (LDA).

View on arXiv PDF

Similar