When Mean CE Fails: Median CE Can Better Track Language Model Quality
For practitioners training language models, this work identifies a failure mode of the standard validation metric (mean CE) and proposes a simple, low-cost diagnostic using median CE to better track model quality.
Mean cross-entropy (CE) fails to track language model quality during training in two scenarios: Qwen2.5-1.5B SFT on synthetic fact-learning and top-K distillation on TinyStories. Median CE correlates much more closely with task performance than mean CE, and the authors recommend reporting percentile CE summaries alongside the mean.
Mean cross-entropy is the standard validation metric for language models, but it can fail to track model quality during training. We examine this in two common scenarios. First, in Qwen2.5-1.5B SFT on synthetic fact-learning, we find that mean CE rises substantially after the initial learning phase while held-out fact-recall accuracy remains near its peak. Second, we find that in top-K distillation on TinyStories, decreasing K improves median CE while worsening mean CE; the Top-5 student attains the highest LLM-judge score and crosses below its teacher on median CE, despite having the worst mean CE. In both cases, median CE correlates much more closely with task performance than does mean CE. Analyzing how bulk and tail percentile CE move during training reveals that training reshapes the empirical per-token CE distribution. In top-K distillation, smaller K yields a distribution with more mass at both extremes, decreasing the median and increasing the mean. In Qwen SFT, the bulk saturates quickly while the tail extends in the latter half of training. In both, the task-evaluation metric appears more sensitive to the bulk than to the tail. Practically, we recommend reporting a small set of percentile CE summaries alongside the mean, and using concordance among them as a tool to keep track of distribution reshaping, as well as a low-cost diagnostic for when mean and median CE disagree on model selection.