Syntax and Domain Aware Model for Unsupervised Program Translation
This addresses the costly and error-prone task of software migration for developers, though it is incremental as it builds on existing unsupervised methods.
The paper tackles the problem of automatic program translation without parallel data by proposing SDA-Trans, a syntax and domain-aware model that uses unsupervised training on a smaller corpus; it outperforms large-scale pre-trained models, especially for translating unseen languages like C++.
There is growing interest in software migration as the development of software and society. Manually migrating projects between languages is error-prone and expensive. In recent years, researchers have begun to explore automatic program translation using supervised deep learning techniques by learning from large-scale parallel code corpus. However, parallel resources are scarce in the programming language domain, and it is costly to collect bilingual data manually. To address this issue, several unsupervised programming translation systems are proposed. However, these systems still rely on huge monolingual source code to train, which is very expensive. Besides, these models cannot perform well for translating the languages that are not seen during the pre-training procedure. In this paper, we propose SDA-Trans, a syntax and domain-aware model for program translation, which leverages the syntax structure and domain knowledge to enhance the cross-lingual transfer ability. SDA-Trans adopts unsupervised training on a smaller-scale corpus, including Python and Java monolingual programs. The experimental results on function translation tasks between Python, Java, and C++ show that SDA-Trans outperforms many large-scale pre-trained models, especially for unseen language translation.