LGMay 25
Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot AuditZexin Zhuang, Yanhang Li, Zhichao Fan
This is a planning-method note with an unpaired pilot audit. We adapt the classical paired-binary sample-size calculation (Miettinen, 1968) to quantization benchmarks, giving a conservative minimum detectable effect (MDE) bound $δ^{*} \le (z_{1-α/2}+z_{1-β})\sqrt{ρ_d/m}$ in the paired item count $m$ and the FP16-NF4 disagreement rate $ρ_d$. The bound turns "how reliable is my quantization claim?" into a one-line budget a benchmark designer can commit to before running. We illustrate the bound on four models and four benchmarks ($k=5$ splits of $n=100$), and add a parallel MMLU prompt-template study to put the bound's quantization-noise scale alongside the prompt-noise scale. Assuming $ρ_d=0.10$ (an unmeasured planning value), all observed NF4-FP16 deltas fall below the implied MDE, and most cross-split SDs lie within $\pm 1.5$ pp of the binomial reference $\sqrt{p(1-p)/n}$, so much of the variance reported as "benchmark unreliability" on $n=100$ subsamples is binomial sampling noise. The single borderline cell (OPT-WinoGrande, $|Δ|=3.2$ pp) is below the implied MDE at $ρ_d=0.10$ but above it at $ρ_d=0.05$, illustrating the planning trade-off the bound makes explicit. On MMLU, prompt-template ranges of 2-10 pp meet or exceed the largest observed quantization delta (3.2 pp), so a quantization audit that does not first fix the prompt template absorbs template variance into its noise floor. We complement the bound with a five-line pre-registration template.
LGMay 26
Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete ChoiceYingshuo Wang, Xian Sun, Yanhang Li et al.
Tabular foundation models achieve strong accuracy on choice prediction tasks, but their predictions often violate the economic logic those tasks require: raising a price sometimes increases predicted demand, and implied willingness-to-pay estimates are frequently negative or implausible. We propose a two-stage adapter that embeds foundation model predictions within a utility-maximization framework. In the first stage, we estimate a standard choice model whose parameters are constrained to obey economic theory. In the second stage, we freeze those parameters and train a correction term that incorporates the foundation model's predictions as additional information. The result is a model that inherits the foundation model's accuracy gains while guaranteeing monotonic price-demand relationships under policy perturbation and producing analytically computable trade-off measures. On two transportation datasets, the adapter recovers up to 13 percentage points of accuracy over a standard logit model while maintaining perfect economic consistency, something neither the raw foundation models nor conventional distillation achieve.
LGMay 25
SafetyRepro: Configuration-Conditional Rank Instability on Alignment BenchmarksYanhang Li, Zhichao Fan, Zexin Zhuang
Pairwise model comparisons drawn from foundation-model benchmarks ("A is safer than B") are read as quantitative verdicts but hinge on harness choices benchmark papers under-specify. We close one theory-benchmark loop on this primitive: a finite-envelope proposition tying a measurable pairwise-disagreement rate to whether the strict ordering admits a configuration-pair reversal, paired with a commit-stamped evaluation protocol that operationalises it on widely cited alignment benchmarks. On every benchmark we test, configuration choice alone can flip the pairwise verdict; the proposition isolates this strict-reversal failure mode.
LGMay 17
Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned CanariesYanhang Li, Zhichao Fan, Zexin Zhuang
Evaluations of unlearning on reasoning models sometimes show a bypass pattern. The answer side looks unlearned, but the model's own thinking trace keeps emitting the forgotten content, and the gap is taken as evidence that the weights still remember. We audit this reading on DeepSeek-R1-Distill-Qwen-7B with LoRA-memorized fictional authors and NPO unlearning, conditioned on a six-token canary head. On one seed, swapping the thinking trace for a short non-canary prefill on the same weights drops the answer rate by as much as the bypass gap itself, whether the prefill mimics the training template or not. On a second seed the bypass gap shrinks rather than vanishing, and the prefill swap reverses direction and brings the answer rate to ceiling. A positive parser-split bypass gap thus does not by itself identify hidden weight-level memorization, and does not rule it out either. On a different distillate the same metric flips sign because the parser cannot find the closing tag. We recommend a decode-time template swap as a cheap sanity check alongside the canonical audit.