66.0CLMay 27Code
HardMTBench: Stress-Testing Chinese-English Translation on Knowledge-Intensive DomainsZheng Li, Mao Zheng, Mingyang Song et al.
General-purpose machine translation benchmarks such as FLORES-200 have reached a saturation regime on Chinese-English pairs, where modern large language models cluster within a narrow band of high scores. Across 22 systems, FLORES-200 zh-en GEMBA scores fall in a 7.87-point range with a standard deviation of 2.29, which compresses the separation between systems on knowledge-intensive domains such as finance, healthcare, law, and science and technology. We introduce HardMTBench, a difficulty-aware diagnostic benchmark for bidirectional Chinese-English domain translation. HardMTBench covers 12 domains and contains 10,000 hand-curated source sentences with reference translations, packaged as 20,000 directional test items. A three-stage construction pipeline builds a domain-balanced candidate pool of 84{,}566 pairs, applies an LLM-based multi-signal judge over knowledge density, translation difficulty, terminology load and reference correctness, and assembles the final test set under a hardness fusion rule with per-domain quotas. Across 22 systems spanning general LLMs, commercial engines and specialised MT models, HardMTBench widens the cross-system GEMBA range by roughly a factor of two over FLORES-200, induces visible rank reorderings, and exposes domain-specific terminology and knowledge weaknesses that quality-only metrics tend to flatten. All data and code are open-sourced at https://github.com/jasonNLP/HardMTBench.
91.5CLMay 21Code
Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the WildMao Zheng, Zheng Li, Tao Chen et al.
Hy-MT2 is a family of fast-thinking multilingual translation models designed for complex real-world scenarios. It includes three model sizes: 1.8B, 7B, and 30B-A3B (MoE), all of which support translation among 33 languages and effectively follow translation instructions in multiple languages. For on-device deployment, with AngelSlim 1.25-bit extreme quantization, the 1.8B model requires only 440 MB of storage and improves inference speed by 1.5x. Multi-dimensional evaluations show that Hy-MT2 delivers outstanding performance across general, real-world business, domain-specific, and instruction-following translation tasks. The 7B and 30B models outperform open-source models such as DeepSeek-V4-Pro and Kimi K2.6 in fast-thinking mode, while the lightweight 1.8B model also surpasses mainstream commercial APIs from providers such as Microsoft and Doubao overall.
CLJan 21
CodeDelegator: Mitigating Context Pollution via Role Separation in Code-as-Action AgentsTianxiang Fei, Cheng Chen, Yue Pan et al.
Recent advances in large language models (LLMs) allow agents to represent actions as executable code, offering greater expressivity than traditional tool-calling. However, real-world tasks often demand both strategic planning and detailed implementation. Using a single agent for both leads to context pollution from debugging traces and intermediate failures, impairing long-horizon performance. We propose CodeDelegator, a multi-agent framework that separates planning from implementation via role specialization. A persistent Delegator maintains strategic oversight by decomposing tasks, writing specifications, and monitoring progress without executing code. For each sub-task, a new Coder agent is instantiated with a clean context containing only its specification, shielding it from prior failures. To coordinate between agents, we introduce Ephemeral-Persistent State Separation (EPSS), which isolates each Coder's execution state while preserving global coherence, preventing debugging traces from polluting the Delegator's context. Experiments on various benchmarks demonstrate the effectiveness of CodeDelegator across diverse scenarios.