CLOct 17, 2024Code
FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMsForrest Sheng Bao, Miaoran Li, Renyi Qu et al.
Summarization is one of the most common tasks performed by large language models (LLMs), especially in applications like Retrieval-Augmented Generation (RAG). However, existing evaluations of hallucinations in LLM-generated summaries, and evaluations of hallucination detection models both suffer from a lack of diversity and recency in the LLM and LLM families considered. This paper introduces FaithBench, a summarization hallucination benchmark comprising challenging hallucinations made by 10 modern LLMs from 8 different families, with ground truth annotations by human experts. ``Challenging'' here means summaries on which popular, state-of-the-art hallucination detection models, including GPT-4o-as-a-judge, disagreed on. Our results show GPT-4o and GPT-3.5-Turbo produce the least hallucinations. However, even the best hallucination detection models have near 50\% accuracies on FaithBench, indicating lots of room for future improvement. The repo is https://github.com/vectara/FaithBench
CLMay 7, 2025Code
Benchmarking LLM Faithfulness in RAG with Evolving LeaderboardsManveer Singh Tamber, Forrest Sheng Bao, Chenyu Xu et al.
Retrieval-augmented generation (RAG) aims to reduce hallucinations by grounding responses in external context, yet large language models (LLMs) still frequently introduce unsupported information or contradictions even when provided with relevant context. This paper presents two complementary efforts at Vectara to measure and benchmark LLM faithfulness in RAG. First, we describe our original hallucination leaderboard, which has tracked hallucination rates for LLMs since 2023 using our HHEM hallucination detection model. Motivated by limitations observed in current hallucination detection methods, we introduce FaithJudge, an LLM-as-a-judge framework that leverages a pool of diverse human-annotated hallucination examples to substantially improve the automated hallucination evaluation of LLMs. We introduce an enhanced hallucination leaderboard centered on FaithJudge that benchmarks LLMs on RAG faithfulness in summarization, question-answering, and data-to-text generation tasks. FaithJudge enables a more reliable benchmarking of LLM hallucinations in RAG and supports the development of more trustworthy generative AI systems: https://github.com/vectara/FaithJudge.
AIDec 20, 2022
DocAsRef: An Empirical Study on Repurposing Reference-Based Summary Quality Metrics Reference-FreelyForrest Sheng Bao, Ruixuan Tu, Ge Luo et al.
Automated summary quality assessment falls into two categories: reference-based and reference-free. Reference-based metrics, historically deemed more accurate due to the additional information provided by human-written references, are limited by their reliance on human input. In this paper, we hypothesize that the comparison methodologies used by some reference-based metrics to evaluate a system summary against its corresponding reference can be effectively adapted to assess it against its source document, thereby transforming these metrics into reference-free ones. Experimental results support this hypothesis. After being repurposed reference-freely, the zero-shot BERTScore using the pretrained DeBERTa-large-MNLI model of <0.5B parameters consistently outperforms its original reference-based version across various aspects on the SummEval and Newsroom datasets. It also excels in comparison to most existing reference-free metrics and closely competes with zero-shot summary evaluators based on GPT-3.5.
AIJun 24, 2020
Circuit Routing Using Monte Carlo Tree Search and Deep Neural NetworksYoubiao He, Forrest Sheng Bao
Circuit routing is a fundamental problem in designing electronic systems such as integrated circuits (ICs) and printed circuit boards (PCBs) which form the hardware of electronics and computers. Like finding paths between pairs of locations, circuit routing generates traces of wires to connect contacts or leads of circuit components. It is challenging because finding paths between dense and massive electronic components involves a very large search space. Existing solutions are either manually designed with domain knowledge or tailored to specific design rules, hence, difficult to adapt to new problems or design needs. Therefore, a general routing approach is highly desired. In this paper, we model the circuit routing as a sequential decision-making problem, and solve it by Monte Carlo tree search (MCTS) with deep neural network (DNN) guided rollout. It could be easily extended to routing cases with more routing constraints and optimization goals. Experiments on randomly generated single-layer circuits show the potential to route complex circuits. The proposed approach can solve the problems that benchmark methods such as sequential A* method and Lee's algorithm cannot solve, and can also outperform the vanilla MCTS approach.
LGMay 13, 2020
Triaging moderate COVID-19 and other viral pneumonias from routine blood testsForrest Sheng Bao, Youbiao He, Jie Liu et al.
The COVID-19 is sweeping the world with deadly consequences. Its contagious nature and clinical similarity to other pneumonias make separating subjects contracted with COVID-19 and non-COVID-19 viral pneumonia a priority and a challenge. However, COVID-19 testing has been greatly limited by the availability and cost of existing methods, even in developed countries like the US. Intrigued by the wide availability of routine blood tests, we propose to leverage them for COVID-19 testing using the power of machine learning. Two proven-robust machine learning model families, random forests (RFs) and support vector machines (SVMs), are employed to tackle the challenge. Trained on blood data from 208 moderate COVID-19 subjects and 86 subjects with non-COVID-19 moderate viral pneumonia, the best result is obtained in an SVM-based classifier with an accuracy of 84%, a sensitivity of 88%, a specificity of 80%, and a precision of 92%. The results are found explainable from both machine learning and medical perspectives. A privacy-protected web portal is set up to help medical personnel in their practice and the trained models are released for developers to further build other applications. We hope our results can help the world fight this pandemic and welcome clinical verification of our approach on larger populations.
CLMay 13, 2020
SueNes: A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative SamplingForrest Sheng Bao, Hebi Li, Ge Luo et al.
Canonical automatic summary evaluation metrics, such as ROUGE, focus on lexical similarity which cannot well capture semantics nor linguistic quality and require a reference summary which is costly to obtain. Recently, there have been a growing number of efforts to alleviate either or both of the two drawbacks. In this paper, we present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries. Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries. In cross-domain tests, our strategy outperforms baselines with promising improvements, and show a great advantage in gauging linguistic qualities over all metrics.
DCOct 20, 2019
RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement LearningDi Zhang, Dong Dai, Youbiao He et al.
Today high-performance computing (HPC) platforms are still dominated by batch jobs. Accordingly, effective batch job scheduling is crucial to obtain high system efficiency. Existing HPC batch job schedulers typically leverage heuristic priority functions to prioritize and schedule jobs. But, once configured and deployed by the experts, such priority functions can hardly adapt to the changes of job loads, optimization goals, or system settings, potentially leading to degraded system efficiency when changes occur. To address this fundamental issue, we present RLScheduler, an automated HPC batch job scheduler built on reinforcement learning. RLScheduler relies on minimal manual interventions or expert knowledge, but can learn high-quality scheduling policies via its own continuous 'trial and error'. We introduce a new kernel-based neural network structure and trajectory filtering mechanism in RLScheduler to improve and stabilize the learning process. Through extensive evaluations, we confirm that RLScheduler can learn high-quality scheduling policies towards various workloads and various optimization goals with relatively low computation cost. Moreover, we show that the learned models perform stably even when applied to unseen workloads, making them practical for production use.
CLFeb 26, 2017
Detecting (Un)Important Content for Single-Document News SummarizationYinfei Yang, Forrest Sheng Bao, Ani Nenkova
We present a robust approach for detecting intrinsic sentence importance in news, by training on two corpora of document-summary pairs. When used for single-document summarization, our approach, combined with the "beginning of document" heuristic, outperforms a state-of-the-art summarizer and the beginning-of-article baseline in both automatic and manual evaluations. These results represent an important advance because in the absence of cross-document repetition, single document summarizers for news have not been able to consistently outperform the strong beginning-of-article baseline.