CLMay 15, 2024

A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining

Masaaki Nagata, Makoto Morishita, Katsuki Chousa, Norihito Yasuda

arXiv:2405.09017v11.03 citationsh-index: 14

Originality Synthesis-oriented

AI Analysis

This provides a feasible method for building parallel corpora for machine translation, particularly for language pairs like Japanese-Chinese, though it is incremental as it adapts existing web mining techniques with crowdsourcing.

The researchers tackled the problem of creating a Japanese-Chinese parallel corpus by using crowdsourcing to collect 10,000 URL pairs and extract 4.6M sentence pairs, achieving translation accuracy comparable to a larger existing corpus (CCMatrix with 12.4M pairs) despite being one-third the size.

Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs to train a parallel corpus filter based on statistical language models and word translation probabilities. We compared the translation accuracy of the model trained on these 4.6M sentence pairs with that of the model trained on Japanese-Chinese sentence pairs from CCMatrix (12.4M), a parallel corpus from global web mining. Although our corpus is only one-third the size of CCMatrix, we found that the accuracy of the two models was comparable and confirmed that it is feasible to use crowdsourcing for web mining of parallel data.

View on arXiv PDF

Similar