IFMTBench: A Comprehensive Benchmark for Multilingual Translation Instruction Following
For researchers and practitioners in machine translation, this benchmark addresses the lack of evaluation for constraint adherence in translation, revealing systematic gaps missed by prior metrics.
IFMTBench is a benchmark for multilingual translation instruction following, covering 7 languages with 4,506 single-constraint and 2,838 multi-constraint items across 6 constraint dimensions. Evaluation of 15 models reveals that instruction following scales more sharply with model size than translation quality, and general instruction following rankings correlate weakly with translation behavior.
Modern translation workflows demand more than semantic equivalence. Users routinely require models to preserve JSON or HTML schemas, honor curated glossaries, disambiguate with provided context, and match prescribed registers, often several at once. Conventional metrics such as BLEU and xCOMET capture semantic fidelity but provide little signal on constraint adherence, while general instruction following benchmarks ignore the cross-lingual nature of translation. We introduce \bench, a benchmark for multilingual translation instruction following covering seven languages, with 4,506 single-constraint and 2,838 multi-constraint items spanning six constraint dimensions and five compositional patterns with instructions issued in all seven languages. Constraints are split into a gating subset verified by deterministic checkers and a continuous subset scored by a rubric-based LLM judge, combined under a multiplicative rule that resists reward hacking. Evaluating 15 models reveals systematic gaps that prior protocols miss: Instruction following scales with size more sharply than translation quality, glossary and structured-format constraints dominate the difficulty gradient, and general instruction following rankings correlate only weakly with translation behavior. Our benchmark are available at https://github.com/Tencent-Hunyuan/Hy-MT2/tree/main/IFMTBench.