CLMar 6
SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language ModelsYunlong Chu, Minglai Shao, Yuhang Liu et al.
Explicit Chain-of-Thought improves the reasoning performance of large language models but often incurs high inference cost due to verbose token-level traces. While recent approaches reduce this overhead via concise prompting or step pruning, they largely truncate what the model says rather than internalize what the model thinks. Latent reasoning offers a promising alternative by performing computation in the hidden space, yet prior methods face two critical challenges. Many existing approaches rely on rigid point-to-point alignment, forcing a latent token to approximate the final representation of a reasoning step, which can be insufficient to capture the dense, variable-length semantics of an entire reasoning segment. Furthermore, these methods often suffer from a lack of interpretability: latent states are commonly produced by unconstrained optimization or embedding mixing, yielding vectors that are difficult to decode or audit under the pretrained language head. We propose SPOT, a flexible framework that compresses explicit CoT into compact latent pause tokens without enforcing a fixed response template. At the core of SPOT is Span-level Semantic Alignment, a Sinkhorn optimal-transport objective that softly matches each pause token to the semantics of an entire reasoning segment, overcoming the rigidity of step-end alignment. To further improve interpretability, SPOT introduces a Frozen-Head Decoding Constraint that keeps latent states directly decodable as token distributions under the frozen pretrained LM head, enabling readable keyword interpretations of latent thoughts. Experiments on reasoning benchmarks demonstrate that SPOT improves accuracy by 2.3 points on average while reducing generated tokens by 37.5% and provides faithful semantic interpretations of the latent reasoning process.
CLMar 6
RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts ReasoningYuhang Liu, Ruijie Wang, Yunlong Chu et al.
Large Language Models (LLMs) excel at multi-step reasoning, yet increasing the structural complexity of inference does not consistently improve system-level returns. Methods such as Tree of Thoughts (ToT), Graph of Thoughts (GoT), and Adaptive Graph of Thoughts (AGoT) can boost accuracy on some benchmarks, but often introduce substantial overhead in token consumption and latency, and their gains can be unstable across task distributions-sometimes underperforming simpler Chain-of-Thought (CoT) or direct input-output prompting (IO). We attribute this inefficiency to stage-wise and node-wise heterogeneity inside GoT-style reasoning pipelines: high-quality planning and final synthesis are globally coupled and typically benefit from strong models, whereas many intermediate subtasks are localized and can be solved accurately by lighter models with far fewer tokens. Motivated by these observations, we propose RouteGoT, a budget-controllable, node-adaptive routing framework for graph-structured reasoning. RouteGoT performs in-graph routing by prioritizing strong models for planning and synthesis, while dynamically allocating lightweight models and cost-effective strategies to leaf subtasks based on predicted difficulty. It further integrates explicit budget constraints into a global inference scheduler to control graph expansion under a user-specified token budget, enabling predictable performance-cost trade-offs. Experiments across reasoning, retrieval, and multi-hop QA benchmarks show that RouteGoT matching or improving accuracy while substantially reducing token usage; specifically, it achieves an average 8.1 percentage points accuracy improvement and 79.1\% output token reduction compared to AGoT. Furthermore, RouteGoT outperforms existing routing baselines by maintaining a superior cost-accuracy trade-off, demonstrating improved robustness under varying budget targets and tasks.
CVAug 31, 2025
Face4FairShifts: A Large Image Benchmark for Fairness and Robust Learning across Visual DomainsYumeng Lin, Dong Li, Xintao Wu et al.
Ensuring fairness and robustness in machine learning models remains a challenge, particularly under domain shifts. We present Face4FairShifts, a large-scale facial image benchmark designed to systematically evaluate fairness-aware learning and domain generalization. The dataset includes 100,000 images across four visually distinct domains with 39 annotations within 14 attributes covering demographic and facial features. Through extensive experiments, we analyze model performance under distribution shifts and identify significant gaps. Our findings emphasize the limitations of existing related datasets and the need for more effective fairness-aware domain adaptation techniques. Face4FairShifts provides a comprehensive testbed for advancing equitable and reliable AI systems. The dataset is available online at https://meviuslab.github.io/Face4FairShifts/.
CLJun 29, 2025
Information Loss in LLMs' Multilingual Translation: The Role of Training Data, Language Proximity, and Language FamilyYumeng Lin, Xufeng Duan, David Haslett et al.
Large language models have achieved impressive progress in multilingual translation, yet they continue to face challenges with certain language pairs-particularly those with limited training data or significant linguistic divergence from English. This study systematically investigates how training data, language proximity, and language family affect information loss in multilingual translation. We evaluate two large language models, GPT-4 and Llama 2, by performing round-trip translations. Translation quality was assessed using BLEU scores and BERT similarity metrics. Our results reveal a robust interaction between training data size and language distance: while abundant training data can mitigate the effects of linguistic divergence, languages structurally closer to English consistently yield higher translation quality in low-resource conditions. Among various distance metrics, orthographic, phylogenetic, syntactic, and geographical distances emerge as strong predictors of translation performance. Language family also exerts an independent influence. These findings contribute to a deeper understanding of the linguistic constraints shaping multilingual translation in large language models, emphasizing that translation quality is shaped not only by data volume but also by structural and typological relationships between languages.