CLMay 18, 2025
Not All Documents Are What You Need for Extracting Instruction Tuning DataChi Zhang, Huaping Zhong, Hongtao Li et al.
Instruction tuning improves the performance of large language models (LLMs), but it heavily relies on high-quality training data. Recently, LLMs have been used to synthesize instruction data using seed question-answer (QA) pairs. However, these synthesized instructions often lack diversity and tend to be similar to the input seeds, limiting their applicability in real-world scenarios. To address this, we propose extracting instruction tuning data from web corpora that contain rich and diverse knowledge. A naive solution is to retrieve domain-specific documents and extract all QA pairs from them, but this faces two key challenges: (1) extracting all QA pairs using LLMs is prohibitively expensive, and (2) many extracted QA pairs may be irrelevant to the downstream tasks, potentially degrading model performance. To tackle these issues, we introduce EQUAL, an effective and scalable data extraction framework that iteratively alternates between document selection and high-quality QA pair extraction to enhance instruction tuning. EQUAL first clusters the document corpus based on embeddings derived from contrastive learning, then uses a multi-armed bandit strategy to efficiently identify clusters that are likely to contain valuable QA pairs. This iterative approach significantly reduces computational cost while boosting model performance. Experiments on AutoMathText and StackOverflow across four downstream tasks show that EQUAL reduces computational costs by 5-10x and improves accuracy by 2.5 percent on LLaMA-3.1-8B and Mistral-7B
IRFeb 23, 2022
A Semi-Synthetic Dataset Generation Framework for Causal Inference in Recommender SystemsYan Lyu, Sunhao Dai, Peng Wu et al.
Accurate recommendation and reliable explanation are two key issues for modern recommender systems. However, most recommendation benchmarks only concern the prediction of user-item ratings while omitting the underlying causes behind the ratings. For example, the widely-used Yahoo!R3 dataset contains little information on the causes of the user-movie ratings. A solution could be to conduct surveys and require the users to provide such information. In practice, the user surveys can hardly avoid compliance issues and sparse user responses, which greatly hinders the exploration of causality-based recommendation. To better support the studies of causal inference and further explanations in recommender systems, we propose a novel semi-synthetic data generation framework for recommender systems where causal graphical models with missingness are employed to describe the causal mechanism of practical recommendation scenarios. To illustrate the use of our framework, we construct a semi-synthetic dataset with Causal Tags And Ratings (CTAR), based on the movies as well as their descriptive tags and rating information collected from a famous movie rating website. Using the collected data and the causal graph, the user-item-ratings and their corresponding user-item-tags are automatically generated, which provides the reasons (selected tags) why the user rates the items. Descriptive statistics and baseline results regarding the CTAR dataset are also reported. The proposed data generation framework is not limited to recommendation, and the released APIs can be used to generate customized datasets for other research tasks.
IRJan 18, 2022
On the Opportunity of Causal Learning in Recommendation Systems: Foundation, Estimation, Prediction and ChallengesPeng Wu, Haoxuan Li, Yuhao Deng et al.
Recently, recommender system (RS) based on causal inference has gained much attention in the industrial community, as well as the states of the art performance in many prediction and debiasing tasks. Nevertheless, a unified causal analysis framework has not been established yet. Many causal-based prediction and debiasing studies rarely discuss the causal interpretation of various biases and the rationality of the corresponding causal assumptions. In this paper, we first provide a formal causal analysis framework to survey and unify the existing causal-inspired recommendation methods, which can accommodate different scenarios in RS. Then we propose a new taxonomy and give formal causal definitions of various biases in RS from the perspective of violating the assumptions adopted in causal analysis. Finally, we formalize many debiasing and prediction tasks in RS, and summarize the statistical and machine learning-based causal estimation methods, expecting to provide new research opportunities and perspectives to the causal RS community.