SEAINov 28, 2025

Asm2SrcEval: Evaluating Large Language Models for Assembly-to-Source Code Translation

arXiv:2512.00134v1
Originality Synthesis-oriented
AI Analysis

This work addresses a critical problem in reverse engineering, cybersecurity, and software maintenance by providing a benchmark for future research, though it is incremental as it evaluates existing models without introducing new methods.

The paper tackled the lack of systematic benchmarks for evaluating large language models on assembly-to-source code translation by conducting the first comprehensive evaluation of five state-of-the-art models, revealing trade-offs in metrics like BLEU, ROUGE, METEOR, BERTScore, perplexity, and inference time.

Assembly-to-source code translation is a critical task in reverse engineering, cybersecurity, and software maintenance, yet systematic benchmarks for evaluating large language models on this problem remain scarce. In this work, we present the first comprehensive evaluation of five state-of-the-art large language models on assembly-to-source translation. We assess model performance using a diverse set of metrics capturing lexical similarity (BLEU, ROUGE, and METEOR), semantic alignment (BERTScore), fluency (Perplexity), and efficiency (time prediction). Our results reveal clear trade-offs: while certain models excel in text similarity metrics, others demonstrate lower perplexity or faster inference times. We further provide qualitative analyses of typical model successes and failure cases, highlighting challenges such as control flow recovery and identifier reconstruction. Taken together, our benchmark offers actionable insights into the strengths and limitations of current large language models for program translation, establishing a foundation for future research in combining accuracy with efficiency for real-world applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes