Richmond Sin Jing Xuan

2papers

2 Papers

97.7AIMay 7
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

Richmond Sin Jing Xuan, Rishabh Bhardwaj, Soujanya Poria

As the widespread adoption of Large Language Models (LLMs) accelerates, token consumption from intermediate reasoning traces increasingly contributes to inference latency and operational cost. Recent studies suggest that many real-world tasks require little to no explicit reasoning, with additional reasoning sometimes even degrading performance. In this work, we propose \textbf{Post-Reasoning}, a simple yet effective approach that improves instruction-tuned models by conditioning them to justify their answers after generating the final response. By design, it enables the final answer to be obtained without additional latency or token cost, while still improving performance through simple instruction augmentation. We evaluate Post-Reasoning across \(117\) model--benchmark settings spanning \(13\) open and proprietary models, \(4\) model families, and \(9\) diverse reasoning and knowledge-intensive benchmarks, including AMC, HMMT, GSM8K, GPQA, MMLU-Pro, and BIG-Bench Hard. Post-Reasoning improves performance in over \(88.19\%\) of evaluated settings, achieving a mean relative improvements of \(17.37\%\). Furthermore, we propose supervised post-reason tuning, which further improves performance in over \(91.11\%\) of evaluated settings, and exceeds the prompt-based post-reasoning baseline by an average of \(8.01\%\), demonstrating that post-reasoning can be effectively internalized through training. Ultimately, Post-Reasoning establishes a new performance ceiling for direct-answer capabilities.

CLJul 25, 2025
Uncovering Cross-Linguistic Disparities in LLMs using Sparse Autoencoders

Richmond Sin Jing Xuan, Jalil Huseynov, Yang Zhang

Multilingual large language models (LLMs) exhibit strong cross-linguistic generalization, yet medium to low resource languages underperform on common benchmarks such as ARC-Challenge, MMLU, and HellaSwag. We analyze activation patterns in Gemma-2-2B across all 26 residual layers and 10 languages: Chinese (zh), Russian (ru), Spanish (es), Italian (it), medium to low resource languages including Indonesian (id), Catalan (ca), Marathi (mr), Malayalam (ml), and Hindi (hi), with English (en) as the reference. Using Sparse Autoencoders (SAEs), we reveal systematic disparities in activation patterns. Medium to low resource languages receive up to 26.27 percent lower activations in early layers, with a persistent gap of 19.89 percent in deeper layers. To address this, we apply activation-aware fine-tuning via Low-Rank Adaptation (LoRA), leading to substantial activation gains, such as 87.69 percent for Malayalam and 86.32 percent for Hindi, while maintaining English retention at approximately 91 percent. After fine-tuning, benchmark results show modest but consistent improvements, highlighting activation alignment as a key factor in enhancing multilingual LLM performance.