AIJun 2

Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

arXiv:2606.036186.7h-index: 1

Predicted impact top 89% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For developers using AI coding agents, this work reduces inference costs without sacrificing coding quality, offering a proactive alternative to reactive compression methods.

The paper addresses input-token cost bottlenecks in AI-assisted coding agents by introducing a pre-flight, edge-side prompt-rewriting middleware that performs cross-lingual translation and structural rewriting. The method reduces prompt tokens by 34-47% and total tokens by up to 18.8% while preserving or improving task accuracy across three commercial LLM backends.

AI-assisted coding agents are bottlenecked by input-token cost. Two pathologies of raw human input drive much of this overhead: tokenization inefficiency for non-English text and structural entropy in conversational prompts. Existing approaches act reactively by compressing already-bloated contexts or intervening after failures occur. We introduce a pre-flight, edge-side prompt-rewriting middleware that operates between the developer and the cloud agent. A local Llama 3.2 (3B) model performs cross-lingual translation into English, structural rewriting into a compact task-oriented format, and regex-validated rewrite-with-fallback safeguards to ensure the optimized prompt is never larger than the original. We evaluate on OMH-Polyglot, a multilingual coding benchmark spanning Turkish, Arabic, Chinese, and code-switched specifications. Across three commercial LLM backends, the middleware reduces prompt tokens by 34-47 percent and total tokens by up to 18.8 percent while preserving or improving task accuracy. Ablation studies show that gains arise primarily from the rewriting stage rather than simple function-name extraction. Compared with LLMLingua-2 at matched compression rates, our method consistently achieves superior OckScore performance across all evaluated backends. These results demonstrate that proactive prompt optimization can substantially reduce inference costs without sacrificing coding quality.

View on arXiv PDF

Similar