CLSep 30, 2022
Out-of-Distribution Detection and Selective Generation for Conditional Language ModelsJie Ren, Jiaming Luo, Yao Zhao et al. · mit
Machine learning algorithms typically assume independent and identically distributed samples in training and at test time. Much work has shown that high-performing ML classifiers can degrade significantly and provide overly-confident, wrong classification predictions, particularly for out-of-distribution (OOD) inputs. Conditional language models (CLMs) are predominantly trained to classify the next token in an output sequence, and may suffer even worse degradation on OOD inputs as the prediction is done auto-regressively over many steps. Furthermore, the space of potential low-quality outputs is larger as arbitrary text can be generated and it is important to know when to trust the generated output. We present a highly accurate and lightweight OOD detection method for CLMs, and demonstrate its effectiveness on abstractive summarization and translation. We also show how our method can be used under the common and realistic setting of distribution shift for selective generation (analogous to selective prediction for classification) of high-quality outputs, while automatically abstaining from low-quality ones, enabling safer deployment of generative language models.
CLDec 20, 2022
Improving the Robustness of Summarization Models by Detecting and Removing Input NoiseKundan Krishna, Yao Zhao, Jie Ren et al. · mit
The evaluation of abstractive summarization models typically uses test data that is identically distributed as training data. In real-world practice, documents to be summarized may contain input noise caused by text extraction artifacts or data pipeline bugs. The robustness of model performance under distribution shift caused by such noise is relatively under-studied. We present a large empirical study quantifying the sometimes severe loss in performance (up to 12 ROUGE-1 points) from different types of input noise for a range of datasets and model sizes. We then propose a light-weight method for detecting and removing such noise in the input during model inference without requiring any extra training, auxiliary models, or even prior knowledge of the type of noise. Our proposed approach effectively mitigates the loss in performance, recovering a large fraction of the performance drop, sometimes as large as 11 ROUGE-1 points.
LGSep 4, 2024
Building Math Agents with Multi-Turn Iterative Preference LearningWei Xiong, Chengshuai Shi, Jiaming Shen et al.
Recent studies have shown that large language models' (LLMs) mathematical problem-solving capabilities can be enhanced by integrating external tools, such as code interpreters, and employing multi-turn Chain-of-Thought (CoT) reasoning. While current methods focus on synthetic data generation and Supervised Fine-Tuning (SFT), this paper studies the complementary direct preference learning approach to further improve model performance. However, existing direct preference learning algorithms are originally designed for the single-turn chat task, and do not fully address the complexities of multi-turn reasoning and external tool integration required for tool-integrated mathematical reasoning tasks. To fill in this gap, we introduce a multi-turn direct preference learning framework, tailored for this context, that leverages feedback from code interpreters and optimizes trajectory-level preferences. This framework includes multi-turn DPO and multi-turn KTO as specific implementations. The effectiveness of our framework is validated through training of various language models using an augmented prompt set from the GSM8K and MATH datasets. Our results demonstrate substantial improvements: a supervised fine-tuned Gemma-1.1-it-7B model's performance increased from 77.5% to 83.9% on GSM8K and from 46.1% to 51.2% on MATH. Similarly, a Gemma-2-it-9B model improved from 84.1% to 86.3% on GSM8K and from 51.0% to 54.5% on MATH.
CLSep 20, 2024
RRM: Robust Reward Model Training Mitigates Reward HackingTianqi Liu, Wei Xiong, Jie Ren et al.
Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human preferences. However, traditional RM training, which relies on response pairs tied to specific prompts, struggles to disentangle prompt-driven preferences from prompt-independent artifacts, such as response length and format. In this work, we expose a fundamental limitation of current RM training methods, where RMs fail to effectively distinguish between contextual signals and irrelevant artifacts when determining preferences. To address this, we introduce a causal framework that learns preferences independent of these artifacts and propose a novel data augmentation technique designed to eliminate them. Extensive experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model (RRM). Our RRM improves the performance of a pairwise reward model trained on Gemma-2-9b-it, on RewardBench, increasing accuracy from 80.61% to 84.15%. Additionally, we train two DPO policies using both the RM and RRM, demonstrating that the RRM significantly enhances DPO-aligned policies, improving MT-Bench scores from 7.27 to 8.31 and length-controlled win-rates in AlpacaEval-2 from 33.46% to 52.49%.
CLSep 13, 2023
Statistical Rejection Sampling Improves Preference OptimizationTianqi Liu, Yao Zhao, Rishabh Joshi et al.
Improving the alignment of language models with human preferences remains an active research challenge. Previous approaches have primarily utilized Reinforcement Learning from Human Feedback (RLHF) via online RL methods such as Proximal Policy Optimization (PPO). Recently, offline methods such as Sequence Likelihood Calibration (SLiC) and Direct Preference Optimization (DPO) have emerged as attractive alternatives, offering improvements in stability and scalability while maintaining competitive performance. SLiC refines its loss function using sequence pairs sampled from a supervised fine-tuned (SFT) policy, while DPO directly optimizes language models based on preference data, foregoing the need for a separate reward model. However, the maximum likelihood estimator (MLE) of the target optimal policy requires labeled preference pairs sampled from that policy. DPO's lack of a reward model constrains its ability to sample preference pairs from the optimal policy, and SLiC is restricted to sampling preference pairs only from the SFT policy. To address these limitations, we introduce a novel approach called Statistical Rejection Sampling Optimization (RSO) that aims to source preference data from the target optimal policy using rejection sampling, enabling a more accurate estimation of the optimal policy. We also propose a unified framework that enhances the loss functions used in both SLiC and DPO from a preference modeling standpoint. Through extensive experiments across three diverse tasks, we demonstrate that RSO consistently outperforms both SLiC and DPO on evaluations from both Large Language Model (LLM) and human raters.
CLSep 30, 2022
Calibrating Sequence likelihood Improves Conditional Language GenerationYao Zhao, Misha Khalman, Rishabh Joshi et al.
Conditional language models are predominantly trained with maximum likelihood estimation (MLE), giving probability mass to sparsely observed target sequences. While MLE trained models assign high probability to plausible sequences given the context, the model probabilities often do not accurately rank-order generated sequences by quality. This has been empirically observed in beam search decoding as output quality degrading with large beam sizes, and decoding strategies benefiting from heuristics such as length normalization and repetition-blocking. In this work, we introduce sequence likelihood calibration (SLiC) where the likelihood of model generated sequences are calibrated to better align with reference sequences in the model's latent space. With SLiC, decoding heuristics become unnecessary and decoding candidates' quality significantly improves regardless of the decoding method. Furthermore, SLiC shows no sign of diminishing returns with model scale, and presents alternative ways to improve quality with limited training and inference budgets. With SLiC, we exceed or match SOTA results on a wide range of generation tasks spanning abstractive summarization, question generation, abstractive question answering and data-to-text generation, even with modest-sized models.
CLMar 8, 2024
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextGemini Team, Petko Georgiev, Ving Ian Lei et al. · deepmind, mila
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
CLJul 7, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic CapabilitiesGheorghe Comanici, Eric Bieber, Mike Schaekermann et al. · amazon-science, baidu
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
CLFeb 2, 2024
LiPO: Listwise Preference Optimization through Learning-to-RankTianqi Liu, Zhen Qin, Junru Wu et al.
Aligning language models (LMs) with curated human feedback is critical to control their behaviors in real-world applications. Several recent policy optimization methods, such as DPO and SLiC, serve as promising alternatives to the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. Multiple responses can also be ranked by reward models or AI feedback. There lacks such a thorough study on directly fitting upon a list of responses. In this work, we formulate the LM alignment as a \textit{listwise} ranking problem and describe the LiPO framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. This view draws an explicit connection to Learning-to-Rank (LTR), where most existing preference optimization work can be mapped to existing ranking objectives. Following this connection, we provide an examination of ranking objectives that are not well studied for LM alignment with DPO and SLiC as special cases when list size is two. In particular, we highlight a specific method, LiPO-$λ$, which leverages a state-of-the-art \textit{listwise} ranking objective and weights each preference pair in a more advanced manner. We show that LiPO-$λ$ can outperform DPO variants and SLiC by a clear margin on several preference alignment tasks with both curated and real rankwise preference data.
13.1CVApr 14
Explainable Fall Detection for Elderly Care via Temporally Stable SHAP in Skeleton-Based Human Activity RecognitionMohammad Saleh, Azadeh Tabatabaei
Fall detection in elderly care requires not only accurate classification but also reliable explanations that clinicians can trust. However, existing post-hoc explainability methods, when applied frame-by-frame to sequential data, produce temporally unstable attribution maps that clinicians cannot reliably act upon. To address this issue, we propose a lightweight and explainable framework for skeleton-based fall detection that combines an efficient LSTM model with T-SHAP, a temporally aware post-hoc aggregation strategy that stabilizes SHAP-based feature attributions over contiguous time windows. Unlike standard SHAP, which treats each frame independently, T-SHAP applies a linear smoothing operator to the attribution sequence, reducing high-frequency variance while preserving the theoretical guarantees of Shapley values, including local accuracy and consistency. Experiments on the NTU RGB+D Dataset demonstrate that the proposed framework achieves 94.3% classification accuracy with an end-to-end inference latency below 25 milliseconds, satisfying real-time constraints on mid-range hardware and indicating strong potential for deployment in clinical monitoring scenarios. Quantitative evaluation using perturbation-based faithfulness metrics shows that T-SHAP improves explanation reliability compared to standard SHAP (AUP: 0.89 vs. 0.91) and Grad-CAM (0.82), with consistent improvements observed across five-fold cross-validation, indicating enhanced explanation reliability. The resulting attributions consistently highlight biomechanically relevant motion patterns, including lower-limb instability and changes in spinal alignment, aligning with established clinical observations of fall dynamics and supporting their use as transparent decision aids in long-term care environments
CRApr 14, 2025
Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision-Language TasksMohammad Saleh, Azadeh Tabatabaei
Objective: This review explores the trustworthiness of multimodal artificial intelligence (AI) systems, specifically focusing on vision-language tasks. It addresses critical challenges related to fairness, transparency, and ethical implications in these systems, providing a comparative analysis of key tasks such as Visual Question Answering (VQA), image captioning, and visual dialogue. Background: Multimodal models, particularly vision-language models, enhance artificial intelligence (AI) capabilities by integrating visual and textual data, mimicking human learning processes. Despite significant advancements, the trustworthiness of these models remains a crucial concern, particularly as AI systems increasingly confront issues regarding fairness, transparency, and ethics. Methods: This review examines research conducted from 2017 to 2024 focusing on forenamed core vision-language tasks. It employs a comparative approach to analyze these tasks through the lens of trustworthiness, underlining fairness, explainability, and ethics. This study synthesizes findings from recent literature to identify trends, challenges, and state-of-the-art solutions. Results: Several key findings were highlighted. Transparency: Explainability of vision language tasks is important for user trust. Techniques, such as attention maps and gradient-based methods, have successfully addressed this issue. Fairness: Bias mitigation in VQA and visual dialogue systems is essential for ensuring unbiased outcomes across diverse demographic groups. Ethical Implications: Addressing biases in multilingual models and ensuring ethical data handling is critical for the responsible deployment of vision-language systems. Conclusion: This study underscores the importance of integrating fairness, transparency, and ethical considerations in developing vision-language models within a unified framework.
CLDec 19, 2023
Gemini: A Family of Highly Capable Multimodal ModelsGemini Team, Rohan Anil, Sebastian Borgeaud et al.
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
CLMay 17, 2023
SLiC-HF: Sequence Likelihood Calibration with Human FeedbackYao Zhao, Rishabh Joshi, Tianqi Liu et al.
Learning from human feedback has been shown to be effective at aligning language models with human preferences. Past work has often relied on Reinforcement Learning from Human Feedback (RLHF), which optimizes the language model using reward scores assigned from a reward model trained on human preference data. In this work we show how the recently introduced Sequence Likelihood Calibration (SLiC), can also be used to effectively learn from human preferences (SLiC-HF). Furthermore, we demonstrate this can be done with human feedback data collected for a different model, similar to off-policy, offline RL data. Automatic and human evaluation experiments on the TL;DR summarization task show that SLiC-HF significantly improves supervised fine-tuning baselines. Furthermore, SLiC-HF presents a competitive alternative to the PPO RLHF implementation used in past work while being much simpler to implement, easier to tune and more computationally efficient in practice.
CLFeb 18, 2021
WebRED: Effective Pretraining And Finetuning For Relation Extraction On The WebRobert Ormandi, Mohammad Saleh, Erin Winter et al.
Relation extraction is used to populate knowledge bases that are important to many applications. Prior datasets used to train relation extraction models either suffer from noisy labels due to distant supervision, are limited to certain domains or are too small to train high-capacity models. This constrains downstream applications of relation extraction. We therefore introduce: WebRED (Web Relation Extraction Dataset), a strongly-supervised human annotated dataset for extracting relationships from a variety of text found on the World Wide Web, consisting of ~110K examples. We also describe the methods we used to collect ~200M examples as pre-training data for this task. We show that combining pre-training on a large weakly supervised dataset with fine-tuning on a small strongly-supervised dataset leads to better relation extraction performance. We provide baselines for this new dataset and present a case for the importance of human annotation in improving the performance of relation extraction from text found on the web.
CLJun 18, 2020
SEAL: Segment-wise Extractive-Abstractive Long-form Text SummarizationYao Zhao, Mohammad Saleh, Peter J. Liu
Most prior work in the sequence-to-sequence paradigm focused on datasets with input sequence lengths in the hundreds of tokens due to the computational constraints of common RNN and Transformer architectures. In this paper, we study long-form abstractive text summarization, a sequence-to-sequence setting with input sequence lengths up to 100,000 tokens and output sequence lengths up to 768 tokens. We propose SEAL, a Transformer-based model, featuring a new encoder-decoder attention that dynamically extracts/selects input snippets to sparsely attend to for each output segment. Using only the original documents and summaries, we derive proxy labels that provide weak supervision for extractive layers simultaneously with regular supervision from abstractive summaries. The SEAL model achieves state-of-the-art results on existing long-form summarization tasks, and outperforms strong baseline models on a new dataset/task we introduce, Search2Wiki, with much longer input text. Since content selection is explicit in the SEAL model, a desirable side effect is that the selection can be inspected for enhanced interpretability.
CLDec 18, 2019
PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive SummarizationJingqing Zhang, Yao Zhao, Mohammad Saleh et al.
Recent work pre-training Transformers with self-supervised objectives on large text corpora has shown great success when fine-tuned on downstream NLP tasks including text summarization. However, pre-training objectives tailored for abstractive text summarization have not been explored. Furthermore there is a lack of systematic evaluation across diverse domains. In this work, we propose pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective. In PEGASUS, important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary. We evaluated our best PEGASUS model on 12 downstream summarization tasks spanning news, science, stories, instructions, emails, patents, and legislative bills. Experiments demonstrate it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores. Our model also shows surprising performance on low-resource summarization, surpassing previous state-of-the-art results on 6 datasets with only 1000 examples. Finally we validated our results using human evaluation and show that our model summaries achieve human performance on multiple datasets.
CLMay 30, 2019
Assessing The Factual Accuracy of Generated TextBen Goodrich, Vinay Rao, Mohammad Saleh et al.
We propose a model-based metric to estimate the factual accuracy of generated text that is complementary to typical scoring schemes like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy). We introduce and release a new large-scale dataset based on Wikipedia and Wikidata to train relation classifiers and end-to-end fact extraction models. The end-to-end models are shown to be able to extract complete sets of facts from datasets with full pages of text. We then analyse multiple models that estimate factual accuracy on a Wikipedia text summarization task, and show their efficacy compared to ROUGE and other model-free variants by conducting a human evaluation study.
CLJan 30, 2018
Generating Wikipedia by Summarizing Long SequencesPeter J. Liu, Mohammad Saleh, Etienne Pot et al.
We show that generating English Wikipedia articles can be approached as a multi- document summarization of source documents. We use extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. For the abstractive model, we introduce a decoder-only architecture that can scalably attend to very long sequences, much longer than typical encoder- decoder architectures used in sequence transduction. We show that this model can generate fluent, coherent multi-sentence paragraphs and even whole Wikipedia articles. When given reference documents, we show it can extract relevant factual information as reflected in perplexity, ROUGE scores and human evaluations.