CL AIJan 1

JP-TL-Bench: Anchored Pairwise LLM Evaluation for Bidirectional Japanese-English Translation

arXiv:2601.00223v1h-index: 4

Originality Incremental advance

AI Analysis

This provides a more nuanced evaluation method for Japanese-English translation systems, addressing subtle linguistic aspects like politeness and implicature.

The authors tackled the challenge of evaluating Japanese-English translation quality by introducing JP-TL-Bench, a benchmark that uses pairwise LLM comparisons against a fixed anchor set, resulting in a normalized 0-10 score with reported win rates.

We introduce JP-TL-Bench, a lightweight, open benchmark designed to guide the iterative development of Japanese-English translation systems. In this context, the challenge is often "which of these two good translations is better?" rather than "is this translation acceptable?" This distinction matters for Japanese-English, where subtle choices in politeness, implicature, ellipsis, and register strongly affect perceived naturalness. JP-TL-Bench uses a protocol built to make LLM judging both reliable and affordable: it evaluates a candidate model via reference-free, pairwise LLM comparisons against a fixed, versioned anchor set. Pairwise results are aggregated with a Bradley-Terry model and reported as win rates plus a normalized 0-10 "LT" score derived from a logistic transform of fitted log-strengths. Because each candidate is scored against the same frozen anchor set, scores are structurally stable given the same base set, judge, and aggregation code.

View on arXiv PDF

Similar