C-ing Clearly: Enhanced Binary Code Explanations using C code
This addresses the challenge of LLM limitations in low-level programming for applications like security analysis, though it appears incremental as it builds on existing synthetic data generation approaches.
The authors tackled the problem of LLMs performing poorly on assembly language tasks by proposing a synthetic data generation method that uses corresponding C code to enhance LLM understanding, resulting in improved performance for binary code summarization and vulnerability detection with consistent gains across different LLM families and sizes.
Large Language Models (LLMs) typically excel at coding tasks involving high-level programming languages, as opposed to lower-level programming languages, such as assembly. We propose a synthetic data generation method named C-ing Clearly, which leverages the corresponding C code to enhance an LLM's understanding of assembly. By fine-tuning on data generated through our method, we demonstrate improved LLM performance for binary code summarization and vulnerability detection. Our approach demonstrates consistent gains across different LLM families and model sizes.