CL LGDec 14, 2013

Domain adaptation for sequence labeling using hidden Markov models

Edouard Grave, Guillaume Obozinski, Francis Bach

arXiv:1312.4092v11 citations

Originality Synthesis-oriented

AI Analysis

This work addresses domain adaptation for sequence labeling, which is an incremental improvement for NLP systems to handle data from different domains like web text.

The paper tackles the problem of domain shift in natural language processing by proposing hidden Markov models to learn word representations for part-of-speech tagging, resulting in an analysis of how using data from source, target, or both domains affects performance.

Most natural language processing systems based on machine learning are not robust to domain shift. For example, a state-of-the-art syntactic dependency parser trained on Wall Street Journal sentences has an absolute drop in performance of more than ten points when tested on textual data from the Web. An efficient solution to make these methods more robust to domain shift is to first learn a word representation using large amounts of unlabeled data from both domains, and then use this representation as features in a supervised learning algorithm. In this paper, we propose to use hidden Markov models to learn word representations for part-of-speech tagging. In particular, we study the influence of using data from the source, the target or both domains to learn the representation and the different ways to represent words using an HMM.

View on arXiv PDF

Similar