SEAICLPLOct 19, 2025

When Many-Shot Prompting Fails: An Empirical Study of LLM Code Translation

arXiv:2510.16809v1h-index: 3
Originality Incremental advance
AI Analysis

This addresses the problem of optimizing prompting strategies for software engineers using LLMs, showing it's task-dependent and incremental in refining ICL methods.

The study tackled the assumption that many-shot prompting improves LLM performance for code translation, finding that functional correctness peaks with 5-25 examples and degrades with more, challenging the 'more is better' approach.

Large Language Models (LLMs) with vast context windows offer new avenues for in-context learning (ICL), where providing many examples ("many-shot" prompting) is often assumed to enhance performance. We investigate this assumption for the complex task of code translation. Through a large-scale empirical study of over 90,000 translations, we systematically evaluate the impact of scaling in-context examples from zero-shot to many-shot configurations of up to 625 examples, with prompts spanning from approximately 100,000 to 800,000 tokens. Our findings reveal a "many-shot paradox": while static similarity metrics may modestly improve with more examples, functional correctness consistently peaks with few-shot prompting (5-25 examples). Providing substantially more examples often degrades this crucial functional performance. This study highlights that for code translation, the quality of a few well-chosen examples outweighs sheer quantity, challenging the universal efficacy of "more is better" for ICL and underscoring the task-dependent nature of optimal prompting strategies. Our results have significant implications for effectively leveraging LLMs in software engineering.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes