CLAIHCLGDec 2, 2025

Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages

arXiv:2512.02841v12 citationsh-index: 18
Originality Incremental advance
AI Analysis

This addresses the need for robust cross-lingual performance in LLMs, offering a scalable solution for real-world deployments, though it is incremental as it builds on existing prompt engineering methods.

The paper tackled the problem of making system prompts work reliably across languages for large language models, and found that optimizing prompts can improve all metrics by 5-10% in multilingual settings.

System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes