CLMar 8, 2013

Mining and Exploiting Domain-Specific Corpora in the PANACEA Platform

Núria Bel, Vassilis Papavasiliou, Prokopis Prokopidis, Antonio Toral, Victoria Arranz

arXiv:1303.1932v16 citations

Originality Synthesis-oriented

AI Analysis

This addresses the need for efficient domain-specific language resources in machine translation, but it is incremental as it builds on existing web crawling and alignment techniques.

The paper tackles the problem of automating the acquisition of large language resources for machine translation by developing a Corpus Acquisition Component (CAC) that crawls web documents in specific languages and domains, and it successfully used crawled parallel corpora for domain adaptation of MT systems.

The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stages involved in the acquisition, production, updating and maintenance of the large language resources required by, among others, MT systems. The development of a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web is one of the most innovative building blocks of PANACEA. The CAC, which is the first stage in the PANACEA pipeline for building Language Resources, adopts an efficient and distributed methodology to crawl for web documents with rich textual content in specific languages and predefined domains. The CAC includes modules that can acquire parallel data from sites with in-domain content available in more than one language. In order to extrinsically evaluate the CAC methodology, we have conducted several experiments that used crawled parallel corpora for the identification and extraction of parallel sentences using sentence alignment. The corpora were then successfully used for domain adaptation of Machine Translation Systems.

View on arXiv PDF

Similar