CLNov 1, 2016

Recurrent Neural Network Language Model Adaptation Derived Document Vector

arXiv:1611.00196v10.8

Originality Incremental advance

AI Analysis

This addresses the limitation of ignoring word order in document representations for tasks like genre classification, offering an incremental improvement over existing methods.

The paper tackles the problem of representing documents for NLP tasks by proposing DV-RNN and DV-LSTM vectors, which adapt RNN language model parameters to capture sequential information, and shows that DV-LSTM significantly outperforms TF-IDF and PV-DM in genre classification across three corpora.

In many natural language processing (NLP) tasks, a document is commonly modeled as a bag of words using the term frequency-inverse document frequency (TF-IDF) vector. One major shortcoming of the frequency-based TF-IDF feature vector is that it ignores word orders that carry syntactic and semantic relationships among the words in a document, and they can be important in some NLP tasks such as genre classification. This paper proposes a novel distributed vector representation of a document: a simple recurrent-neural-network language model (RNN-LM) or a long short-term memory RNN language model (LSTM-LM) is first created from all documents in a task; some of the LM parameters are then adapted by each document, and the adapted parameters are vectorized to represent the document. The new document vectors are labeled as DV-RNN and DV-LSTM respectively. We believe that our new document vectors can capture some high-level sequential information in the documents, which other current document representations fail to capture. The new document vectors were evaluated in the genre classification of documents in three corpora: the Brown Corpus, the BNC Baby Corpus and an artificially created Penn Treebank dataset. Their classification performances are compared with the performance of TF-IDF vector and the state-of-the-art distributed memory model of paragraph vector (PV-DM). The results show that DV-LSTM significantly outperforms TF-IDF and PV-DM in most cases, and combinations of the proposed document vectors with TF-IDF or PV-DM may further improve performance.

View on arXiv PDF

Similar