CL SENov 27, 2023

Novel Preprocessing Technique for Data Embedding in Engineering Code Generation Using Large Language Model

Yu-Chen Lin, Akhilesh Kumar, Norman Chang, Wenliang Zhang, Muhammad Zakir, Rucha Apte, Haiyang He, Chao Wang, Jyh-Shing Roger Jang

arXiv:2311.16267v21.37 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work addresses code generation for engineering domains, but it appears incremental as it builds on existing methods like Retrieval-Augmented Generation.

The paper tackled the problem of improving domain-specific code generation with Large Language Models by introducing novel preprocessing techniques, achieving a 73.33% 'Percentage of Correct Lines' for MapReduce applications.

We present four main contributions to enhance the performance of Large Language Models (LLMs) in generating domain-specific code: (i) utilizing LLM-based data splitting and data renovation techniques to improve the semantic representation of embeddings' space; (ii) introducing the Chain of Density for Renovation Credibility (CoDRC), driven by LLMs, and the Adaptive Text Renovation (ATR) algorithm for assessing data renovation reliability; (iii) developing the Implicit Knowledge Expansion and Contemplation (IKEC) Prompt technique; and (iv) effectively refactoring existing scripts to generate new and high-quality scripts with LLMs. By using engineering simulation software RedHawk-SC as a case study, we demonstrate the effectiveness of our data pre-processing method for expanding and categorizing scripts. When combined with IKEC, these techniques enhance the Retrieval-Augmented Generation (RAG) method in retrieving more relevant information, ultimately achieving a 73.33% "Percentage of Correct Lines" for code generation problems in MapReduce applications.

View on arXiv PDF

Similar