CLAug 2, 2024Code
RAGEval: Scenario Specific RAG Evaluation Dataset Generation FrameworkKunlun Zhu, Yifan Luo, Dingling Xu et al.
Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance to evaluate LLM generated responses rigorously. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications. The code and dataset are released at https://github.com/OpenBMB/RAGEval.
IRAug 27, 2023Code
Text Matching Improves Sequential Recommendation by Reducing Popularity BiasesZhenghao Liu, Sen Mei, Chenyan Xiong et al.
This paper proposes Text mAtching based SequenTial rEcommendation model (TASTE), which maps items and users in an embedding space and recommends items by matching their text representations. TASTE verbalizes items and user-item interactions using identifiers and attributes of items. To better characterize user behaviors, TASTE additionally proposes an attention sparsity method, which enables TASTE to model longer user-item interactions by reducing the self-attention computations during encoding. Our experiments show that TASTE outperforms the state-of-the-art methods on widely used sequential recommendation datasets. TASTE alleviates the cold start problem by representing long-tail items using full-text modeling and bringing the benefits of pretrained language models to recommendation systems. Our further analyses illustrate that TASTE significantly improves the recommendation accuracy by reducing the popularity bias of previous item id based recommendation models and returning more appropriate and text-relevant items to satisfy users. All codes are available at https://github.com/OpenMatch/TASTE.
IRMay 4, 2022Code
P^3 Ranker: Mitigating the Gaps between Pre-training and Ranking Fine-tuning with Prompt-based Learning and Pre-finetuningXiaomeng Hu, Shi Yu, Chenyan Xiong et al.
Compared to other language tasks, applying pre-trained language models (PLMs) for search ranking often requires more nuances and training signals. In this paper, we identify and study the two mismatches between pre-training and ranking fine-tuning: the training schema gap regarding the differences in training objectives and model architectures, and the task knowledge gap considering the discrepancy between the knowledge needed in ranking and that learned during pre-training. To mitigate these gaps, we propose Pre-trained, Prompt-learned and Pre-finetuned Neural Ranker (P^3 Ranker). P^3 Ranker leverages prompt-based learning to convert the ranking task into a pre-training like schema and uses pre-finetuning to initialize the model on intermediate supervised tasks. Experiments on MS MARCO and Robust04 show the superior performances of P^3 Ranker in few-shot ranking. Analyses reveal that P^3 Ranker is able to better accustom to the ranking task through prompt-based learning and retrieve necessary ranking-oriented knowledge gleaned in pre-finetuning, resulting in data-efficient PLM adaptation. Our code is available at https://github.com/NEUIR/P3Ranker.
CLApr 12, 2023Code
Rethinking Dense Retrieval's Few-Shot AbilitySi Sun, Yida Lu, Shi Yu et al. · tsinghua
Few-shot dense retrieval (DR) aims to effectively generalize to novel search scenarios by learning a few samples. Despite its importance, there is little study on specialized datasets and standardized evaluation protocols. As a result, current methods often resort to random sampling from supervised datasets to create "few-data" setups and employ inconsistent training strategies during evaluations, which poses a challenge in accurately comparing recent progress. In this paper, we propose a customized FewDR dataset and a unified evaluation benchmark. Specifically, FewDR employs class-wise sampling to establish a standardized "few-shot" setting with finely-defined classes, reducing variability in multiple sampling rounds. Moreover, the dataset is disjointed into base and novel classes, allowing DR models to be continuously trained on ample data from base classes and a few samples in novel classes. This benchmark eliminates the risk of novel class leakage, providing a reliable estimation of the DR model's few-shot ability. Our extensive empirical results reveal that current state-of-the-art DR models still face challenges in the standard few-shot scene. Our code and data will be open-sourced at https://github.com/OpenMatch/ANCE-Tele.
CLSep 2, 2024
Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip ScriptsYingfa Chen, Chenlong Hu, Cong Feng et al. · tsinghua
This study presents a multi-modal multi-granularity tokenizer specifically designed for analyzing ancient Chinese scripts, focusing on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China. Considering the complex hierarchical structure of ancient Chinese scripts, where a single character may be a combination of multiple sub-characters, our tokenizer first adopts character detection to locate character boundaries, and then conducts character recognition at both the character and sub-character levels. Moreover, to support the academic community, we have also assembled the first large-scale dataset of CBSs with over 100K annotated character image scans. On the part-of-speech tagging task built on our dataset, using our tokenizer gives a 5.5% relative improvement in F1-score compared to mainstream sub-word tokenizers. Our work not only aids in further investigations of the specific script but also has the potential to advance research on other forms of ancient Chinese scripts.
IROct 14, 2024Code
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality DocumentsShi Yu, Chaoyue Tang, Bokai Xu et al.
Retrieval-augmented generation (RAG) is an effective technique that enables large language models (LLMs) to utilize external knowledge sources for generation. However, current RAG systems are solely based on text, rendering it impossible to utilize vision information like layout and images that play crucial roles in real-world multi-modality documents. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process. We collect both open-source and synthetic data to train the retriever in VisRAG and explore a variety of generation methods. Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 20--40% end-to-end performance gain over traditional text-based RAG pipeline. Further analysis reveals that VisRAG is efficient in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. Our code and data are available at https://github.com/openbmb/visrag.
CLOct 17, 2024Code
RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data RewardsXinze Li, Sen Mei, Zhenghao Liu et al.
Retrieval-Augmented Generation (RAG) has proven its effectiveness in mitigating hallucinations in Large Language Models (LLMs) by retrieving knowledge from external resources. To adapt LLMs for the RAG systems, current approaches use instruction tuning to optimize LLMs, improving their ability to utilize retrieved knowledge. This supervised fine-tuning (SFT) approach focuses on equipping LLMs to handle diverse RAG tasks using different instructions. However, it trains RAG modules to overfit training signals and overlooks the varying data preferences among agents within the RAG system. In this paper, we propose a Differentiable Data Rewards (DDR) method, which end-to-end trains RAG systems by aligning data preferences between different RAG modules. DDR works by collecting the rewards to optimize each agent in the RAG system with the rollout method, which prompts agents to sample some potential responses as perturbations, evaluates the impact of these perturbations on the whole RAG system, and subsequently optimizes the agent to produce outputs that improve the performance of the RAG system. Our experiments on various knowledge-intensive tasks demonstrate that DDR significantly outperforms the SFT method, particularly for LLMs with smaller-scale parameters that depend more on the retrieved knowledge. Additionally, DDR exhibits a stronger capability to align the data preference between RAG modules. The DDR method makes the generation module more effective in extracting key information from documents and mitigating conflicts between parametric memory and external knowledge. All codes are available at https://github.com/OpenMatch/RAG-DDR.
CLFeb 25, 2025Code
RankCoT: Refining Knowledge for Retrieval-Augmented Generation through Ranking Chain-of-ThoughtsMingyan Wu, Zhenghao Liu, Yukun Yan et al.
Retrieval-Augmented Generation (RAG) enhances the performance of Large Language Models (LLMs) by incorporating external knowledge. However, LLMs still encounter challenges in effectively utilizing the knowledge from retrieved documents, often being misled by irrelevant or noisy information. To address this issue, we introduce RankCoT, a knowledge refinement method that incorporates reranking signals in generating CoT-based summarization for knowledge refinement based on given query and all retrieval documents. During training, RankCoT prompts the LLM to generate Chain-of-Thought (CoT) candidates based on the query and individual documents. It then fine-tunes the LLM to directly reproduce the best CoT from these candidate outputs based on all retrieved documents, which requires LLM to filter out irrelevant documents during generating CoT-style summarization. Additionally, RankCoT incorporates a self-reflection mechanism that further refines the CoT outputs, resulting in higher-quality training data. Our experiments demonstrate the effectiveness of RankCoT, showing its superior performance over other knowledge refinement models. Further analysis reveals that RankCoT can provide shorter but effective refinement results, enabling the generator to produce more accurate answers. All code and data are available at https://github.com/NEUIR/RankCoT.
CLFeb 19, 2025Code
Craw4LLM: Efficient Web Crawling for LLM PretrainingShi Yu, Zhiyuan Liu, Chenyan Xiong
Web crawl is a main source of large language models' (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Craw4LLM, an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Specifically, it leverages the influence of a webpage in LLM pretraining as the priority score of the web crawler's scheduler, replacing the standard graph connectivity based priority. Our experiments on a web graph containing 900 million webpages from a commercial search engine's index demonstrate the efficiency of Craw4LLM in obtaining high-quality pretraining data. With just 21% URLs crawled, LLMs pretrained on Craw4LLM data reach the same downstream performances of previous crawls, significantly reducing the crawling waste and alleviating the burdens on websites. Our code is publicly available at https://github.com/cxcscmu/Craw4LLM.
CLFeb 21, 2024Code
ThinkNote: Enhancing Knowledge Integration and Utilization of Large Language Models via Constructivist Cognition ModelingZhipeng Xu, Zhenghao Liu, Yukun Yan et al.
Large Language Models (LLMs) have demonstrated strong performance across a wide range of NLP tasks. However, they often exhibit suboptimal behaviors and inconsistencies when exposed to unfamiliar external information, underscoring their limitations in effectively leveraging such knowledge. Inspired by constructivist learning theory, we propose ThinkNote, a novel framework that enhances the external knowledge utilization of LLMs through a two-stage constructivist cognitive modeling process. Specifically, ThinkNote performs knowledge assimilation to align new information with the model's parametric memory, forming a coherent internal representation. It then applies thought accommodation to adapt internal reasoning, thereby promoting more consistent and reliable outputs. Extensive experimental results demonstrate that ThinkNote achieves a 10% improvement over strong baseline methods on various question-answering benchmarks. Further analysis indicates that ThinkNote effectively integrates and utilizes external knowledge to help LLMs generate accurate responses and improves their self-consistency. All data and codes are available at https://github.com/OpenMatch/ThinkNote.
CLFeb 25, 2024Code
Say More with Less: Understanding Prompt Learning Behaviors through Gist CompressionXinze Li, Zhenghao Liu, Chenyan Xiong et al.
Large language models (LLMs) require lengthy prompts as the input context to produce output aligned with user intentions, a process that incurs extra costs during inference. In this paper, we propose the Gist COnditioned deCOding (Gist-COCO) model, introducing a novel method for compressing prompts which also can assist the prompt interpretation and engineering. Gist-COCO employs an encoder-decoder based language model and then incorporates an additional encoder as a plugin module to compress prompts with inputs using gist tokens. It finetunes the compression plugin module and uses the representations of gist tokens to emulate the raw prompts in the vanilla language model. By verbalizing the representations of gist tokens into gist prompts, the compression ability of Gist-COCO can be generalized to different LLMs with high compression rates. Our experiments demonstrate that Gist-COCO outperforms previous prompt compression models in both passage and instruction compression tasks. Further analysis on gist verbalization results suggests that our gist prompts serve different functions in aiding language models. They may directly provide potential answers, generate the chain-of-thought, or simply repeat the inputs. All data and codes are available at https://github.com/OpenMatch/Gist-COCO .
IRFeb 24, 2025Code
ExpandR: Teaching Dense Retrievers Beyond Queries with LLM GuidanceSijia Yao, Pengcheng Huang, Zhenghao Liu et al.
Large language models (LLMs) have demonstrated significant potential in enhancing dense retrieval through query augmentation. However, most existing methods treat the LLM and the retriever as separate modules, overlooking the alignment between generation and ranking objectives. In this work, we propose ExpandR, a unified LLM-augmented dense retrieval framework that jointly optimizes both the LLM and the retriever. ExpandR employs the LLM to generate semantically rich query expansions, which are leveraged to enhance the retriever's training. Simultaneously, the LLM is trained using Direct Preference Optimization (DPO), guided by a carefully designed reward function that balances retrieval effectiveness and generation consistency. This joint optimization paradigm enables mutual adaptation between the LLM and the retriever, resulting in query expansions that are both informative and well-suited for retrieval. Experimental results on multiple benchmarks show that ExpandR consistently outperforms strong baselines, achieving more than a 5% improvement in retrieval performance. All codes are available at https://github.com/NEUIR/ExpandR.
AIFeb 4
WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement LearningZelai Xu, Zhexuan Xu, Ruize Zhang et al.
Recent advancements in Large Language Models (LLMs) have largely focused on depth scaling, where a single agent solves long-horizon problems with multi-turn reasoning and tool use. However, as tasks grow broader, the key bottleneck shifts from individual competence to organizational capability. In this work, we explore a complementary dimension of width scaling with multi-agent systems to address broad information seeking. Existing multi-agent systems often rely on hand-crafted workflows and turn-taking interactions that fail to parallelize work effectively. To bridge this gap, we propose WideSeek-R1, a lead-agent-subagent framework trained via multi-agent reinforcement learning (MARL) to synergize scalable orchestration and parallel execution. By utilizing a shared LLM with isolated contexts and specialized tools, WideSeek-R1 jointly optimizes the lead agent and parallel subagents on a curated dataset of 20k broad information-seeking tasks. Extensive experiments show that WideSeek-R1-4B achieves an item F1 score of 40.0% on the WideSearch benchmark, which is comparable to the performance of single-agent DeepSeek-R1-671B. Furthermore, WideSeek-R1-4B exhibits consistent performance gains as the number of parallel subagents increases, highlighting the effectiveness of width scaling.
CLNov 22, 2024Code
KBAlign: Efficient Self Adaptation on Specific Knowledge BasesZheni Zeng, Yuxuan Chen, Shi Yu et al.
Although retrieval-augmented generation (RAG) remains essential for knowledge-based question answering (KBQA), current paradigms face critical challenges under specific domains. Existing methods struggle with targeted adaptation on small-scale KBs: vanilla unsupervised training exhibits poor effectiveness, while fine-tuning incurs prohibitive costs of external signals. We present KBAlign, a self-supervised framework that enhances RAG systems through efficient model adaptation. Our key insight is to leverage the model's intrinsic capabilities for knowledge alignment through two innovative mechanisms: multi-grained self-annotation that captures global knowledge for data construction, and iterative tuning that accelerates convergence through self verification. This framework enables cost-effective model adaptation to specific textual KBs, without human supervision or external model assistance. Experiments demonstrate that KBAlign can achieve 90\% of the performance gain obtained through GPT-4-supervised adaptation, while relying entirely on self-annotation of much smaller models. KBAlign significantly improves downstream QA accuracy across multiple domains with tiny costs, particularly benefiting scenarios requiring deep knowledge integration from specialized corpora. We release our experimental data, models, and process analyses to the community for further exploration (https://github.com/thunlp/KBAlign).
CLOct 11, 2024Code
DeepNote: Note-Centric Deep Retrieval-Augmented GenerationRuobing Wang, Qingfei Zhao, Yukun Yan et al.
Retrieval-Augmented Generation (RAG) mitigates factual errors and hallucinations in Large Language Models (LLMs) for question-answering (QA) by incorporating external knowledge. However, existing adaptive RAG methods rely on LLMs to predict retrieval timing and directly use retrieved information for generation, often failing to reflect real information needs and fully leverage retrieved knowledge. We develop DeepNote, an adaptive RAG framework that achieves in-depth and robust exploration of knowledge sources through note-centric adaptive retrieval. DeepNote employs notes as carriers for refining and accumulating knowledge. During in-depth exploration, it uses these notes to determine retrieval timing, formulate retrieval queries, and iteratively assess knowledge growth, ultimately leveraging the best note for answer generation. Extensive experiments and analyses demonstrate that DeepNote significantly outperforms all baselines (+10.2% to +20.1%) and exhibits the ability to gather knowledge with both high density and quality. Additionally, DPO further improves the performance of DeepNote. The code and data are available at https://github.com/thunlp/DeepNote.
CLMay 27, 2023Code
Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-InZichun Yu, Chenyan Xiong, Shi Yu et al.
Retrieval augmentation can aid language models (LMs) in knowledge-intensive tasks by supplying them with external information. Prior works on retrieval augmentation usually jointly fine-tune the retriever and the LM, making them closely coupled. In this paper, we explore the scheme of generic retrieval plug-in: the retriever is to assist target LMs that may not be known beforehand or are unable to be fine-tuned together. To retrieve useful documents for unseen target LMs, we propose augmentation-adapted retriever (AAR), which learns LM's preferences obtained from a known source LM. Experiments on the MMLU and PopQA datasets demonstrate that our AAR trained with a small source LM is able to significantly improve the zero-shot generalization of larger target LMs ranging from 250M Flan-T5 to 175B InstructGPT. Further analysis indicates that the preferences of different LMs overlap, enabling AAR trained with a single source LM to serve as a generic plug-in for various target LMs. Our code is open-sourced at https://github.com/OpenMatch/Augmentation-Adapted-Retriever.
IRMay 10, 2021Code
Few-Shot Conversational Dense RetrievalShi Yu, Zhenghao Liu, Chenyan Xiong et al.
Dense retrieval (DR) has the potential to resolve the query understanding challenge in conversational search by matching in the learned embedding space. However, this adaptation is challenging due to DR models' extra needs for supervision signals and the long-tail nature of conversational search. In this paper, we present a Conversational Dense Retrieval system, ConvDR, that learns contextualized embeddings for multi-turn conversational queries and retrieves documents solely using embedding dot products. In addition, we grant ConvDR few-shot ability using a teacher-student framework, where we employ an ad hoc dense retriever as the teacher, inherit its document encodings, and learn a student query encoder to mimic the teacher embeddings on oracle reformulated queries. Our experiments on TREC CAsT and OR-QuAC demonstrate ConvDR's effectiveness in both few-shot and fully-supervised settings. It outperforms previous systems that operate in the sparse word space, matches the retrieval accuracy of oracle query reformulations, and is also more efficient thanks to its simplicity. Our analyses reveal that the advantages of ConvDR come from its ability to capture informative context while ignoring the unrelated context in previous conversation rounds. This makes ConvDR more effective as conversations evolve while previous systems may get confused by the increased noise from previous turns. Our code is publicly available at https://github.com/thunlp/ConvDR.
IRNov 3, 2020Code
CMT in TREC-COVID Round 2: Mitigating the Generalization Gaps from Web to Special Domain SearchChenyan Xiong, Zhenghao Liu, Si Sun et al.
Neural rankers based on deep pretrained language models (LMs) have been shown to improve many information retrieval benchmarks. However, these methods are affected by their the correlation between pretraining domain and target domain and rely on massive fine-tuning relevance labels. Directly applying pretraining methods to specific domains may result in suboptimal search quality because specific domains may have domain adaption problems, such as the COVID domain. This paper presents a search system to alleviate the special domain adaption problem. The system utilizes the domain-adaptive pretraining and few-shot learning technologies to help neural rankers mitigate the domain discrepancy and label scarcity problems. Besides, we also integrate dense retrieval to alleviate traditional sparse retrieval's vocabulary mismatch obstacle. Our system performs the best among the non-manual runs in Round 2 of the TREC-COVID task, which aims to retrieve useful information from scientific literature related to COVID-19. Our code is publicly available at https://github.com/thunlp/OpenMatch.
CLFeb 17, 2020Code
A Financial Service Chatbot based on Deep Bidirectional TransformersShi Yu, Yuxin Chen, Hussain Zaidi
We develop a chatbot using Deep Bidirectional Transformer models (BERT) to handle client questions in financial investment customer service. The bot can recognize 381 intents, and decides when to say "I don't know" and escalates irrelevant/uncertain questions to human operators. Our main novel contribution is the discussion about uncertainty measure for BERT, where three different approaches are systematically compared on real problems. We investigated two uncertainty metrics, information entropy and variance of dropout sampling in BERT, followed by mixed-integer programming to optimize decision thresholds. Another novel contribution is the usage of BERT as a language model in automatic spelling correction. Inputs with accidental spelling errors can significantly decrease intent classification performance. The proposed approach combines probabilities from masked language model and word edit distances to find the best corrections for misspelled words. The chatbot and the entire conversational AI system are developed using open-source tools, and deployed within our company's intranet. The proposed approach can be useful for industries seeking similar in-house solutions in their specific business domains. We share all our code and a sample chatbot built on a public dataset on Github.
CLOct 21, 2024
Building A Coding Assistant via the Retrieval-Augmented Language ModelXinze Li, Hanbin Wang, Zhenghao Liu et al.
Pretrained language models have shown strong effectiveness in code-related tasks, such as code retrieval, code generation, code summarization, and code completion tasks. In this paper, we propose COde assistaNt viA retrieval-augmeNted language model (CONAN), which aims to build a code assistant by mimicking the knowledge-seeking behaviors of humans during coding. Specifically, it consists of a code structure aware retriever (CONAN-R) and a dual-view code representation-based retrieval-augmented generation model (CONAN-G). CONAN-R pretrains CodeT5 using Code-Documentation Alignment and Masked Entity Prediction tasks to make language models code structure-aware and learn effective representations for code snippets and documentation. Then CONAN-G designs a dual-view code representation mechanism for implementing a retrieval-augmented code generation model. CONAN-G regards the code documentation descriptions as prompts, which help language models better understand the code semantics. Our experiments show that CONAN achieves convincing performance on different code generation tasks and significantly outperforms previous retrieval augmented code generation models. Our further analyses show that CONAN learns tailored representations for both code snippets and documentation by aligning code-documentation data pairs and capturing structural semantics by masking and predicting entities in the code data. Additionally, the retrieved code snippets and documentation provide necessary information from both program language and natural language to assist the code generation process. CONAN can also be used as an assistant for Large Language Models (LLMs), providing LLMs with external knowledge in shorter code document lengths to improve their effectiveness on various code tasks. It shows the ability of CONAN to extract necessary information and help filter out the noise from retrieved code documents.
LGNov 18, 2025
Extending Test-Time Scaling: A 3D Perspective with Context, Batch, and TurnChao Yu, Qixin Tan, Jiaxuan Gao et al.
Reasoning reinforcement learning (RL) has recently revealed a new scaling effect: test-time scaling. Thinking models such as R1 and o1 improve their reasoning accuracy at test time as the length of the reasoning context increases. However, compared with training-time scaling, test-time scaling is fundamentally limited by the limited context length of base models, which remains orders of magnitude smaller than the amount of tokens consumed during training. We revisit test-time enhancement techniques through the lens of scaling effect and introduce a unified framework of multi-dimensional test-time scaling to extend the capacity of test-time reasoning. Beyond conventional context-length scaling, we consider two additional dimensions: batch scaling, where accuracy improves with parallel sampling, and turn scaling, where iterative self-refinement enhances reasoning quality. Building on this perspective, we propose 3D test-time scaling, which integrates context, batch, and turn scaling. We show that: (1) each dimension demonstrates a test-time scaling effect, but with a bounded capacity; (2) combining all three dimensions substantially improves the reasoning performance of challenging testbeds, including IOI, IMO, and CPHO, and further benefits from human preference feedback; and (3) the human-in-the-loop framework naturally extends to a more open-ended domain, i.e., embodied learning, which enables the design of humanoid control behaviors.
CLOct 10, 2025
VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented GenerationYubo Sun, Chunyi Peng, Yukun Yan et al.
Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations. Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions. In this paper, we propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue. The model first observes retrieved images and records per-image evidence, then derives the final answer from the aggregated evidence. To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs. Experimental results on multiple visual question answering benchmarks demonstrate that EVisRAG delivers substantial end-to-end gains over backbone VLM with 27\% improvements on average. Further analysis shows that, powered by RS-GRPO, EVisRAG improves answer accuracy by precisely perceiving and localizing question-relevant evidence across multiple images and deriving the final answer from that evidence, much like a real detective.
CLJun 3, 2025
KARE-RAG: Knowledge-Aware Refinement and Enhancement for RAGYongjian Li, HaoCheng Chu, Yukun Yan et al.
Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to access broader knowledge sources, yet factual inconsistencies persist due to noise in retrieved documents-even with advanced retrieval methods. We demonstrate that enhancing generative models' capacity to process noisy content is equally critical for robust performance. In this paper, we present KARE-RAG (Knowledge-Aware Refinement and Enhancement for RAG), which improves knowledge utilization through three key innovations: (1) structured knowledge representations that facilitate error detection during training, (2) Dense Direct Preference Optimization (DDPO)-a refined training objective that prioritizes correction of critical errors, and (3) a contrastive data generation pipeline that maintains semantic consistency while rectifying factual inaccuracies. Experiments show our method significantly enhances standard RAG pipelines across model scales, improving both in-domain and out-of-domain task performance without compromising general capabilities. Notably, these gains are achieved with modest training data, suggesting data-efficient optimization is possible through targeted learning strategies. Our findings establish a new direction for RAG improvement: by improving how models learn to process retrieved content, we can enhance performance across diverse inference paradigms. All data and code will be publicly available on Github.
PMMay 19, 2021
Robo-Advising: Enhancing Investment with Inverse Optimization and Deep Reinforcement LearningHaoran Wang, Shi Yu
Machine Learning (ML) has been embraced as a powerful tool by the financial industry, with notable applications spreading in various domains including investment management. In this work, we propose a full-cycle data-driven investment robo-advising framework, consisting of two ML agents. The first agent, an inverse portfolio optimization agent, infers an investor's risk preference and expected return directly from historical allocation data using online inverse optimization. The second agent, a deep reinforcement learning (RL) agent, aggregates the inferred sequence of expected returns to formulate a new multi-period mean-variance portfolio optimization problem that can be solved using deep RL approaches. The proposed investment pipeline is applied on real market data from April 1, 2016 to February 1, 2021 and has shown to consistently outperform the S&P 500 benchmark portfolio that represents the aggregate market optimal allocation. The outperformance may be attributed to the the multi-period planning (versus single-period planning) and the data-driven RL approach (versus classical estimation approach).
CLDec 18, 2020
Exploring Fluent Query Reformulations with Text-to-Text Transformers and Reinforcement LearningJerry Zikun Chen, Shi Yu, Haoran Wang
Query reformulation aims to alter noisy or ambiguous text sequences into coherent ones closer to natural language questions. This is to prevent errors from propagating in a client-facing pipeline and promote better communication with users. Besides, it is crucial to maintain performance in downstream environments like question answering when rephrased queries are given as input. We show that under the previous framework (AQA), attempts to alter RL algorithms do not bring significant benefits to either reward acquisition or sequence fluency. Instead, we leverage a query-reformulating text-to-text transformer (QRT5) and apply policy-based RL algorithms to further nudge this reformulator and obtain better answers downstream by generating reward-acquiring query trajectories. QRT5 shows better sample efficiency in RL to achieve the same level of QA performance as the previous approach. It can generate reformulations with more readability based on query well-formedness evaluations and can generalize to out-of-sample data. Our framework is demonstrated to be flexible, allowing reward signals to be sourced from different downstream environments such as intent classification.
PMOct 4, 2020
Learning Risk Preferences from Investment Portfolios Using Inverse OptimizationShi Yu, Haoran Wang, Chaosheng Dong
The fundamental principle in Modern Portfolio Theory (MPT) is based on the quantification of the portfolio's risk related to performance. Although MPT has made huge impacts on the investment world and prompted the success and prevalence of passive investing, it still has shortcomings in real-world applications. One of the main challenges is that the level of risk an investor can endure, known as \emph{risk-preference}, is a subjective choice that is tightly related to psychology and behavioral science in decision making. This paper presents a novel approach of measuring risk preference from existing portfolios using inverse optimization on the mean-variance portfolio allocation framework. Our approach allows the learner to continuously estimate real-time risk preferences using concurrent observed portfolios and market price data. We demonstrate our methods on real market data that consists of 20 years of asset pricing and 10 years of mutual fund portfolio holdings. Moreover, the quantified risk preference parameters are validated with two well-known risk measurements currently applied in the field. The proposed methods could lead to practical and fruitful innovations in automated/personalized portfolio management, such as Robo-advising, to augment financial advisors' decision intelligence in a long-term investment horizon.
IRJun 9, 2020
Few-Shot Generative Conversational Query RewritingShi Yu, Jiahua Liu, Jingqin Yang et al.
Conversational query rewriting aims to reformulate a concise conversational query to a fully specified, context-independent query that can be effectively handled by existing information retrieval systems. This paper presents a few-shot generative approach to conversational query rewriting. We develop two methods, based on rules and self-supervised learning, to generate weak supervision data using large amounts of ad hoc search sessions, and to fine-tune GPT-2 to rewrite conversational queries. On the TREC Conversational Assistance Track, our weakly supervised GPT-2 rewriter improves the state-of-the-art ranking accuracy by 12%, only using very limited amounts of manual query rewrites. In the zero-shot learning setting, the rewriter still gives a comparable result to previous state-of-the-art systems. Our analyses reveal that GPT-2 effectively picks up the task syntax and learns to capture context dependencies, even for hard cases that involve group references and long-turn dependencies.
OCApr 22, 2020
Eigendecomposition of Q in Equally Constrained Quadratic ProgrammingShi Yu
When applying eigenvalue decomposition on the quadratic term matrix in a type of linear equally constrained quadratic programming (EQP), there exists a linear mapping to project optimal solutions between the new EQP formulation where $Q$ is diagonalized and the original formulation. Although such a mapping requires a particular type of equality constraints, it is generalizable to some real problems such as efficient frontier for portfolio allocation and classification of Least Square Support Vector Machines (LSSVM). The established mapping could be potentially useful to explore optimal solutions in subspace, but it is not very clear to the author. This work was inspired by similar work proved on unconstrained formulation discussed earlier in \cite{Tan}, but its current proof is much improved and generalized. To the author's knowledge, very few similar discussion appears in literature.