CLDec 20, 2023Code
WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction TuningZhaojian Yu, Xin Zhang, Ning Shang et al.
Recent work demonstrates that, after instruction tuning, Code Large Language Models (Code LLMs) can obtain impressive capabilities to address a wide range of code-related tasks. However, current instruction tuning methods for Code LLMs mainly focus on the traditional code generation task, resulting in poor performance in complex multi-task scenarios. In this paper, we concentrate on multiple code-related tasks and present WaveCoder, a series of Code LLMs trained with Widespread And Versatile Enhanced instruction data. To enable the models to tackle complex code-related tasks, we propose a method to stably generate diverse, high-quality instruction data from open source code dataset in multi-task scenarios and obtain CodeSeaXDataset, a dataset comprising 19,915 instruction instances across 4 code-related tasks, which is aimed at improving the generalization ability of Code LLM. Our experiments demonstrate that WaveCoder models significantly outperform other open-source models in terms of the generalization ability across different code-related tasks. Moreover, WaveCoder-Ultra-6.7B presents the state-of-the-art generalization abilities on a wide range of code-related tasks.
CLDec 19, 2024Code
MMLU-CF: A Contamination-free Multi-task Language Understanding BenchmarkQihao Zhao, Yangyu Huang, Tengchao Lv et al.
Multiple-choice question (MCQ) datasets like Massive Multitask Language Understanding (MMLU) are widely used to evaluate the commonsense, understanding, and problem-solving abilities of large language models (LLMs). However, the open-source nature of these benchmarks and the broad sources of training data for LLMs have inevitably led to benchmark contamination, resulting in unreliable evaluation results. To alleviate this issue, we propose a contamination-free and more challenging MCQ benchmark called MMLU-CF. This benchmark reassesses LLMs' understanding of world knowledge by averting both unintentional and malicious data leakage. To avoid unintentional data leakage, we source data from a broader domain and design three decontamination rules. To prevent malicious data leakage, we divide the benchmark into validation and test sets with similar difficulty and subject distributions. The test set remains closed-source to ensure reliable results, while the validation set is publicly available to promote transparency and facilitate independent verification. Our evaluation of mainstream LLMs reveals that the powerful GPT-4o achieves merely a 5-shot score of 73.4% and a 0-shot score of 71.9% on the test set, which indicates the effectiveness of our approach in creating a more rigorous and contamination-free evaluation standard. The GitHub repository is available at https://github.com/microsoft/MMLU-CF and the dataset refers to https://huggingface.co/datasets/microsoft/MMLU-CF.
CLDec 4, 2024
RedStone: Curating General, Code, Math, and QA Data for Large Language ModelsYaoyao Chang, Lei Cui, Li Dong et al.
Pre-training Large Language Models (LLMs) on high-quality, meticulously curated datasets is widely recognized as critical for enhancing their performance and generalization capabilities. This study explores the untapped potential of Common Crawl as a comprehensive and flexible resource for pre-training LLMs, addressing both general-purpose language understanding and specialized domain knowledge. We introduce RedStone, an innovative and scalable pipeline engineered to extract and process data from Common Crawl, facilitating the creation of extensive and varied pre-training datasets. Unlike traditional datasets, which often require expensive curation and domain-specific expertise, RedStone leverages the breadth of Common Crawl to deliver datasets tailored to a wide array of domains. In this work, we exemplify its capability by constructing pre-training datasets across multiple fields, including general language understanding, code, mathematics, and question-answering tasks. The flexibility of RedStone allows for easy adaptation to other specialized domains, significantly lowering the barrier to creating valuable domain-specific datasets. Our findings demonstrate that Common Crawl, when harnessed through effective pipelines like RedStone, can serve as a rich, renewable source of pre-training data, unlocking new avenues for domain adaptation and knowledge discovery in LLMs. This work also underscores the importance of innovative data acquisition strategies and highlights the role of web-scale data as a powerful resource in the continued evolution of LLMs. RedStone code and data samples will be publicly available at \url{https://aka.ms/redstone}.
CLMar 8
Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging ProblemsZongqian Li, Tengchao Lv, Shaohan Huang et al.
Training next-generation code generation models requires high-quality datasets, yet existing datasets face difficulty imbalance, format inconsistency, and data quality problems. We address these challenges through systematic data processing and difficulty scaling. We introduce a four-stage Data Processing Framework encompassing collection, processing, filtering, and verification, incorporating Automatic Difficulty Filtering via an LLM-based predict-calibrate-select framework that leverages multi-dimensional difficulty metrics across five weighted dimensions to retain challenging problems while removing simplistic ones. The resulting MicroCoder dataset comprises tens of thousands of curated real competitive programming problems from diverse platforms, emphasizing recency and difficulty. Evaluations on strictly unseen LiveCodeBench demonstrate that MicroCoder achieves 3x larger performance gains within 300 training steps compared to widely-used baseline datasets of comparable size, with consistent advantages under both GRPO and its variant training algorithms. The MicroCoder dataset delivers obvious improvements on medium and hard problems across different model sizes, achieving up to 17.2% relative gains in overall performance where model capabilities are most stretched. These results validate that difficulty-aware data curation improves model performance on challenging tasks, providing multiple insights for dataset creation in code generation.
CRJul 13, 2021
Argus: A Fully Transparent Incentive System for Anti-Piracy Campaigns (Extended Version)Xian Zhang, Xiaobing Guo, Zixuan Zeng et al.
Anti-piracy is fundamentally a procedure that relies on collecting data from the open anonymous population, so how to incentivize credible reporting is a question at the center of the problem. Industrial alliances and companies are running anti-piracy incentive campaigns, but their effectiveness is publicly questioned due to the lack of transparency. We believe that full transparency of a campaign is necessary to truly incentivize people. It means that every role, e.g., content owner, licensee of the content, or every person in the open population, can understand the mechanism and be assured about its execution without trusting any single role. We see this as a distributed system problem. In this paper, we present Argus, a fully transparent incentive system for anti-piracy campaigns. The groundwork of Argus is to formulate the objectives for fully transparent incentive mechanisms, which securely and comprehensively consolidate the different interests of all roles. These objectives form the core of the Argus design, highlighted by our innovations about a Sybil-proof incentive function, a commit-and-reveal scheme, and an oblivious transfer scheme. In the implementation, we overcome a set of unavoidable obstacles to ensure security despite full transparency. Moreover, we effectively optimize several cryptographic operations so that the cost for a piracy reporting is reduced to an equivalent cost of sending about 14 ETH-transfer transactions to run on the public Ethereum network, which would otherwise correspond to thousands of transactions. With the security and practicality of Argus, we hope real-world anti-piracy campaigns will be truly effective by shifting to a fully transparent incentive mechanism.