NMT-based Cross-lingual Document Embeddings
This is an incremental improvement for cross-lingual document processing, making the method lighter and more flexible by eliminating the need for a translator during testing.
This paper tackled the problem of cross-lingual document embeddings by adding a distance constraint to an existing Neural machine Translation-based Document Vector (NV) method, resulting in a new constrained NV (cNV) that performs as well as NV in classification tasks and outperforms other methods requiring forward-pass decoding.
This paper investigates a cross-lingual document embedding method that improves the current Neural machine Translation framework based Document Vector (NTDV or simply NV). NV is developed with a self-attention mechanism under the neural machine translation (NMT) framework. In NV, each pair of parallel documents in different languages are projected to the same shared layer in the model. However, the pair of NV embeddings are not guaranteed to be similar. This paper further adds a distance constraint to the training objective function of NV so that the two embeddings of a parallel document are required to be as close as possible. The new method will be called constrained NV (cNV). In a cross-lingual document classification task, the new cNV performs as well as NV and outperforms other published studies that require forward-pass decoding. Compared with the previous NV, cNV does not need a translator during testing, and so the method is lighter and more flexible.