Maxime Griot

CL
4papers
15citations
Novelty43%
AI Score49

4 Papers

55.4LGJun 4
Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

Maxime Griot, Paul Steven Scotti, Tanishq Mathew Abraham

Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and gpt-oss-120B, generate about 283k correct traces each; two instruction-tuned models then compress them to 8.6-21.0% of their original character length. Across a 48-run main grid plus seven Qwen-teacher truncation ablations, compressed traces reduce training tokens to 12-30% of raw, speed up training by 2.0-7.6x, and shorten inference outputs by 3-19x with smaller reductions under the shorter gpt-oss teacher. However, raw traces retain the highest downstream accuracy at every scale and for both teachers. A length-matched raw-trace truncation ablation shows that compression is not merely benefiting from a smaller token budget: model-compressed traces usually beat or match naive truncation, especially for smaller students, while maintaining shorter inference outputs. Overall, reasoning-trace compression offers an accuracy-efficiency trade-off rather than a free improvement: students retain up to 96% of raw-trace accuracy while gaining up to 18x higher per-token efficiency, and at the 0.8B scale under LoRA compressed traces narrow the raw-vs-compressed gap but do not exceed raw.

93.4CLMay 2Code
Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks

Benjamin Warner, Ratna Sagari Grandhi, Max Kieffer et al.

Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavily depend on restricted datasets, or lack comprehensive model coverage. We introduce Medmarks, a fully open-source evaluation suite with 30 benchmarks spanning question answering, information extraction, medical calculations, and open-ended clinical reasoning. We perform a systematic evaluation of 61 models across 71 configurations using verifiable metrics and LLM-as-a-Judge. Our results show that frontier reasoning models (Gemini 3 Pro Preview, GPT-5.1, & GPT-5.2) achieve the highest performance across both benchmarks, most frontier proprietary models are significantly more token efficient than open-weight alternatives, medically fine-tuned models outperform their generalist counterparts, and that models are susceptible to answer-order bias (particularly smaller models and Grok 4). A subset of our evals (Medmarks-T) can be directly used as reinforcement learning environments to post-train LLMs for medical reasoning. Code is available at https://github.com/MedARC-AI/Medmarks

63.3LGMay 20
torchtune: PyTorch native post-training library

Mark Obozov, Maxime Griot, Joseph Cummings et al.

Modern LLMs typically require multistage training pipelines to achieve strong downstream performance, with post-training serving as the main interface for adapting open-weight models. We introduce torchtune, a PyTorch-native library designed to streamline the post-training lifecycle of LLMs, enabling efficient fine-tuning, experimentation, and deployment-oriented workflows. Unlike many existing fine-tuning frameworks, which often optimize for ease of use, specialized recipes, or hardware efficiency at the cost of transparency and extensibility, torchtune emphasizes modularity, hackability, and direct access to the underlying PyTorch components. In this paper, we present the design principles behind torchtune, describe how they are reflected in its model builders, training recipes, and distributed training stack, and evaluate the library across representative post-training settings. We compare against popular fine-tuning frameworks, including Axolotl and Unsloth, and show that torchtune provides strong performance and memory efficiency across many settings while remaining flexible enough for rapid research iteration. These results position torchtune as a practical foundation for reproducible LLMs post-training research.

CLJun 4, 2024Code
Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine

Maxime Griot, Jean Vanderdonckt, Demet Yuksel et al.

Large Language Models (LLMs) such as ChatGPT demonstrate significant potential in the medical domain and are often evaluated using multiple-choice questions (MCQs) modeled on exams like the USMLE. However, such benchmarks may overestimate true clinical understanding by rewarding pattern recognition and test-taking heuristics. To investigate this, we created a fictional medical benchmark centered on an imaginary organ, the Glianorex, allowing us to separate memorized knowledge from reasoning ability. We generated textbooks and MCQs in English and French using leading LLMs, then evaluated proprietary, open-source, and domain-specific models in a zero-shot setting. Despite the fictional content, models achieved an average score of 64%, while physicians scored only 27%. Fine-tuned medical models outperformed base models in English but not in French. Ablation and interpretability analyses revealed that models frequently relied on shallow cues, test-taking strategies, and hallucinated reasoning to identify the correct choice. These results suggest that standard MCQ-based evaluations may not effectively measure clinical reasoning and highlight the need for more robust, clinically meaningful assessment methods for LLMs.