Knowledge Transfer for Pseudo-code Generation from Low Resource Programming Language
This addresses software maintenance challenges for legacy codebases by automating pseudo-code generation, though it is incremental as it builds on existing neural models and back-translation methods.
The paper tackles the problem of generating pseudo-code from legacy programming languages with no parallel training data by transferring knowledge from a model trained on a high-resource language (C++) to a low-resource one (C), using an Iterative Back Translation approach with test-case filtration, resulting in a 23.27% improvement in success rate over iterations.
Generation of pseudo-code descriptions of legacy source code for software maintenance is a manually intensive task. Recent encoder-decoder language models have shown promise for automating pseudo-code generation for high resource programming languages such as C++, but are heavily reliant on the availability of a large code-pseudocode corpus. Soliciting such pseudocode annotations for codes written in legacy programming languages (PL) is a time consuming and costly affair requiring a thorough understanding of the source PL. In this paper, we focus on transferring the knowledge acquired by the code-to-pseudocode neural model trained on a high resource PL (C++) using parallel code-pseudocode data. We aim to transfer this knowledge to a legacy PL (C) with no PL-pseudocode parallel data for training. To achieve this, we utilize an Iterative Back Translation (IBT) approach with a novel test-cases based filtration strategy, to adapt the trained C++-to-pseudocode model to C-to-pseudocode model. We observe an improvement of 23.27% in the success rate of the generated C codes through back translation, over the successive IBT iteration, illustrating the efficacy of our approach.