CLSep 7, 2023Code
FLM-101B: An Open LLM and How to Train It with $100K BudgetXiang Li, Yiqun Yao, Xin Jiang et al. · tencent-ai, tsinghua
Large language models (LLMs) are considered important approaches towards foundational machine intelligence, achieving remarkable success in Natural Language Processing and multimodal tasks, among others. However, the carbon footprints and financial costs originating from heavy pre-training computation is a non-negligible issue. Progressive training methods, inspired by the neurogenesis process that grows neural structures, have shown potential to accelerate LLM pre-training. However, the algorithms, implementation, and practices for progressively training LLMs beyond 100B parameters remain underexplored. In this paper, we show that our model, namely FLM-101B, trained with our growth strategy under a budget of \$100K, reaches 80\% of the baselines' performances with only 10\% of their floating-point operations. We believe that further studies on progressive training will benefit the community by cutting down the costs and promoting green AI. The checkpoint of FLM-101B is released at https://huggingface.co/CofeAI/FLM-101B.
CVNov 29, 2022
NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision TransformersYijiang Liu, Huanrui Yang, Zhen Dong et al. · berkeley
The complicated architecture and high training cost of vision transformers urge the exploration of post-training quantization. However, the heavy-tailed distribution of vision transformer activations hinders the effectiveness of previous post-training quantization methods, even with advanced quantizer designs. Instead of tuning the quantizer to better fit the complicated activation distribution, this paper proposes NoisyQuant, a quantizer-agnostic enhancement for the post-training activation quantization performance of vision transformers. We make a surprising theoretical discovery that for a given quantizer, adding a fixed Uniform noisy bias to the values being quantized can significantly reduce the quantization error under provable conditions. Building on the theoretical insight, NoisyQuant achieves the first success on actively altering the heavy-tailed activation distribution with additive noisy bias to fit a given quantizer. Extensive experiments show NoisyQuant largely improves the post-training quantization performance of vision transformer with minimal computation overhead. For instance, on linear uniform 6-bit activation quantization, NoisyQuant improves SOTA top-1 accuracy on ImageNet by up to 1.7%, 1.1% and 0.5% for ViT, DeiT, and Swin Transformer respectively, achieving on-par or even higher performance than previous nonlinear, mixed-precision quantization.
LGJun 2Code
LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and ProjectionLiulu He, XuanAng Liu, Juntao Liu et al.
Existing quantization methods are fundamentally limited by rigid, integer-based bit-widths (e.g., 2, 3-bit), resulting in a ``deployment gap" where Large Language Models cannot be optimally fitted to specific memory budgets. To bridge this gap, we introduce LiftQuant, a novel framework that enables continuous bit-width control for true Pareto-optimal deployment. The core innovation is a ``lift-then-project" mechanism which approximates low-dimensional weight vectors by projecting a simple 1-bit lattice from a higher-dimensional ``lifted" space. Crucially, the effective bit-width is determined simply by the ratio of the lifted dimension to the original dimension, which allows the bit-width to be tuned quasi-continuous as the dimension is a flexible structural parameter. This projection generates a structured yet non-uniform codebook, capturing the expressive power of Vector Quantization (VQ). While beneficial over VQ, LiftQuant's decoding path relies solely on linear transformations and 1-bit uniform quantizers, retaining hardware-friendly nature. This flexibility is transformative: LiftQuant enables a 70B LLM to be compressed to 2.4 bits to precisely fit a 24GB GPU, where its performance significantly surpasses state-of-the-art 2-bit models fitted on the same device. Our code and ckpt is available at https://github.com/Heliulu/LiftQuant.
CVAug 21, 2023
QD-BEV : Quantization-aware View-guided Distillation for Multi-view 3D Object DetectionYifan Zhang, Zhen Dong, Huanrui Yang et al. · berkeley
Multi-view 3D detection based on BEV (bird-eye-view) has recently achieved significant improvements. However, the huge memory consumption of state-of-the-art models makes it hard to deploy them on vehicles, and the non-trivial latency will affect the real-time perception of streaming applications. Despite the wide application of quantization to lighten models, we show in our paper that directly applying quantization in BEV tasks will 1) make the training unstable, and 2) lead to intolerable performance degradation. To solve these issues, our method QD-BEV enables a novel view-guided distillation (VGD) objective, which can stabilize the quantization-aware training (QAT) while enhancing the model performance by leveraging both image features and BEV features. Our experiments show that QD-BEV achieves similar or even better accuracy than previous methods with significant efficiency gains. On the nuScenes datasets, the 4-bit weight and 6-bit activation quantized QD-BEV-Tiny model achieves 37.2% NDS with only 15.8 MB model size, outperforming BevFormer-Tiny by 1.8% with an 8x model compression. On the Small and Base variants, QD-BEV models also perform superbly and achieve 47.9% NDS (28.2 MB) and 50.9% NDS (32.9 MB), respectively.
CVDec 6, 2022
CSQ: Growing Mixed-Precision Quantization Scheme with Bi-level Continuous SparsificationLirui Xiao, Huanrui Yang, Zhen Dong et al. · berkeley
Mixed-precision quantization has been widely applied on deep neural networks (DNNs) as it leads to significantly better efficiency-accuracy tradeoffs compared to uniform quantization. Meanwhile, determining the exact precision of each layer remains challenging. Previous attempts on bit-level regularization and pruning-based dynamic precision adjustment during training suffer from noisy gradients and unstable convergence. In this work, we propose Continuous Sparsification Quantization (CSQ), a bit-level training method to search for mixed-precision quantization schemes with improved stability. CSQ stabilizes the bit-level mixed-precision training process with a bi-level gradual continuous sparsification on both the bit values of the quantized weights and the bit selection in determining the quantization precision of each layer. The continuous sparsification scheme enables fully-differentiable training without gradient approximation while achieving an exact quantized model in the end.A budget-aware regularization of total model size enables the dynamic growth and pruning of each layer's precision towards a mixed-precision quantization scheme of the desired size. Extensive experiments show CSQ achieves better efficiency-accuracy tradeoff than previous methods on multiple models and datasets.
CLDec 20, 2022
A Measure-Theoretic Characterization of Tight Language ModelsLi Du, Lucas Torroba Hennigen, Tiago Pimentel et al. · cambridge, microsoft-research
Language modeling, a central task in natural language processing, involves estimating a probability distribution over strings. In most cases, the estimated distribution sums to 1 over all finite strings. However, in some pathological cases, probability mass can ``leak'' onto the set of infinite sequences. In order to characterize the notion of leakage more precisely, this paper offers a measure-theoretic treatment of language modeling. We prove that many popular language model families are in fact tight, meaning that they will not leak in this sense. We also generalize characterizations of tightness proposed in previous works.
CLAug 30, 2023Code
FPTQ: Fine-grained Post-Training Quantization for Large Language ModelsQingyuan Li, Yifan Zhang, Liang Li et al.
In the era of large-scale language models, the substantial parameter size poses significant challenges for deployment. Being a prevalent compression technique, quantization has emerged as the mainstream practice to tackle this issue, which is mainly centered on two recipes W8A8 and W4A16 (i.e. weights and activations in such bit widths). In this study, we propose a novel W4A8 post-training quantization method for the available open-sourced LLMs, which combines the advantages of both two recipes. Therefore, we can leverage the benefit in the I/O utilization of 4-bit weight quantization and the acceleration due to 8-bit matrix computation. Nevertheless, the W4A8 faces notorious performance degradation. As a remedy, we involve layerwise activation quantization strategies which feature a novel logarithmic equalization for most intractable layers, and we combine them with fine-grained weight quantization. Without whistles and bells, we eliminate the necessity for further fine-tuning and obtain the state-of-the-art W4A8 quantized performance on BLOOM, LLaMA, and LLaMA-2 on standard benchmarks. We confirm that the W4A8 quantization is achievable for the deployment of large language models, fostering their wide-spreading real-world applications.
CLJun 29, 2023
Tokenization and the Noiseless ChannelVilém Zouhar, Clara Meister, Juan Luis Gastaldi et al. · eth-zurich
Subword tokenization is a key part of many NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to better downstream model performance than others. We propose that good tokenizers lead to \emph{efficient} channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum possible entropy of the token distribution. Yet, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency tokens and very short codes to high-frequency tokens. Defining efficiency in terms of Rényi entropy, on the other hand, penalizes distributions with either very high or very low-frequency tokens. In machine translation, we find that across multiple tokenizers, the Rényi entropy with $α= 2.5$ has a very strong correlation with \textsc{Bleu}: $0.78$ in comparison to just $-0.32$ for compressed length.
CLJun 29, 2023
A Formal Perspective on Byte-Pair EncodingVilém Zouhar, Clara Meister, Juan Luis Gastaldi et al. · eth-zurich
Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a $\frac{1}{σ(\boldsymbolμ^\star)}(1-e^{-{σ(\boldsymbolμ^\star)}})$-approximation of an optimal merge sequence, where ${σ(\boldsymbolμ^\star)}$ is the total backward curvature with respect to the optimal merge sequence $\boldsymbolμ^\star$. Empirically the lower bound of the approximation is $\approx 0.37$. We provide a faster implementation of BPE which improves the runtime complexity from $\mathcal{O}\left(N M\right)$ to $\mathcal{O}\left(N \log M\right)$, where $N$ is the sequence length and $M$ is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.
AISep 11, 2023
Quantifying and Attributing the Hallucination of Large Language Models via Association AnalysisLi Du, Yequan Wang, Xingrun Xing et al. · tencent-ai, tsinghua
Although demonstrating superb performance on various NLP tasks, large language models (LLMs) still suffer from the hallucination problem, which threatens the reliability of LLMs. To measure the level of hallucination of LLMs, previous works first categorize the hallucination according to the phenomenon similarity, then quantify the proportion that model outputs contain hallucinatory contents. However, such hallucination rates could easily be distorted by confounders. Moreover, such hallucination rates could not reflect the reasons for the hallucination, as similar hallucinatory phenomena may originate from different sources. To address these issues, we propose to combine the hallucination level quantification and hallucination reason investigation through an association analysis, which builds the relationship between the hallucination rate of LLMs with a set of risk factors. In this way, we are able to observe the hallucination level under each value of each risk factor, examining the contribution and statistical significance of each risk factor, meanwhile excluding the confounding effect of other factors. Additionally, by recognizing the risk factors according to a taxonomy of model capability, we reveal a set of potential deficiencies in commonsense memorization, relational reasoning, and instruction following, which may further provide guidance for the pretraining and supervised fine-tuning process of LLMs to mitigate the hallucination.
AIMay 12, 2022
e-CARE: a New Dataset for Exploring Explainable Causal ReasoningLi Du, Xiao Ding, Kai Xiong et al.
Understanding causality has vital importance for various Natural Language Processing (NLP) applications. Beyond the labeled instances, conceptual explanations of the causality can provide deep understanding of the causal facts to facilitate the causal reasoning process. However, such explanation information still remains absent in existing causal reasoning resources. In this paper, we fill this gap by presenting a human-annotated explainable CAusal REasoning dataset (e-CARE), which contains over 21K causal reasoning questions, together with natural language formed explanations of the causal questions. Experimental results show that generating valid explanations for causal facts still remains especially challenging for the state-of-the-art models, and the explanation information can be helpful for promoting the accuracy and stability of causal reasoning models.
CVDec 1, 2022
BEV-LGKD: A Unified LiDAR-Guided Knowledge Distillation Framework for BEV 3D Object DetectionJianing Li, Ming Lu, Jiaming Liu et al. · pku
Recently, Bird's-Eye-View (BEV) representation has gained increasing attention in multi-view 3D object detection, which has demonstrated promising applications in autonomous driving. Although multi-view camera systems can be deployed at low cost, the lack of depth information makes current approaches adopt large models for good performance. Therefore, it is essential to improve the efficiency of BEV 3D object detection. Knowledge Distillation (KD) is one of the most practical techniques to train efficient yet accurate models. However, BEV KD is still under-explored to the best of our knowledge. Different from image classification tasks, BEV 3D object detection approaches are more complicated and consist of several components. In this paper, we propose a unified framework named BEV-LGKD to transfer the knowledge in the teacher-student manner. However, directly applying the teacher-student paradigm to BEV features fails to achieve satisfying results due to heavy background information in RGB cameras. To solve this problem, we propose to leverage the localization advantage of LiDAR points. Specifically, we transform the LiDAR points to BEV space and generate the foreground mask and view-dependent mask for the teacher-student paradigm. It is to be noted that our method only uses LiDAR points to guide the KD between RGB models. As the quality of depth estimation is crucial for BEV perception, we further introduce depth distillation to our framework. Our unified framework is simple yet effective and achieves a significant performance boost. Code will be released.
CVAug 26, 2022
Uncertainty Guided Depth Fusion for Spike CameraJianing Li, Jiaming Liu, Xiaobao Wei et al. · pku
Depth estimation is essential for various important real-world applications such as autonomous driving. However, it suffers from severe performance degradation in high-velocity scenario since traditional cameras can only capture blurred images. To deal with this problem, the spike camera is designed to capture the pixel-wise luminance intensity at high frame rate. However, depth estimation with spike camera remains very challenging using traditional monocular or stereo depth estimation algorithms, which are based on the photometric consistency. In this paper, we propose a novel Uncertainty-Guided Depth Fusion (UGDF) framework to fuse the predictions of monocular and stereo depth estimation networks for spike camera. Our framework is motivated by the fact that stereo spike depth estimation achieves better results at close range while monocular spike depth estimation obtains better results at long range. Therefore, we introduce a dual-task depth estimation architecture with a joint training strategy and estimate the distributed uncertainty to fuse the monocular and stereo results. In order to demonstrate the advantage of spike depth estimation over traditional camera depth estimation, we contribute a spike-depth dataset named CitySpike20K, which contains 20K paired samples, for spike depth estimation. UGDF achieves state-of-the-art results on CitySpike20K, surpassing all monocular or stereo spike depth estimation baselines. We conduct extensive experiments to evaluate the effectiveness and generalization of our method on CitySpike20K. To the best of our knowledge, our framework is the first dual-task fusion framework for spike camera depth estimation. Code and dataset will be released.
CLNov 7, 2023
Formal Aspects of Language ModelingRyan Cotterell, Anej Svete, Clara Meister et al. · allen-ai, eth-zurich
Large language models have become one of the most commonly deployed NLP inventions. In the past half-decade, their integration into core natural language processing tools has dramatically increased the performance of such tools, and they have entered the public discourse surrounding artificial intelligence. Consequently, it is important for both developers and researchers alike to understand the mathematical foundations of large language models, as well as how to implement them. These notes are the accompaniment to the theoretical portion of the ETH Zürich course on large language models, covering what constitutes a language model from a formal, theoretical perspective.
CLMay 22, 2022
A Graph Enhanced BERT Model for Event PredictionLi Du, Xiao Ding, Yue Zhang et al.
Predicting the subsequent event for an existing event context is an important but challenging task, as it requires understanding the underlying relationship between events. Previous methods propose to retrieve relational features from event graph to enhance the modeling of event correlation. However, the sparsity of event graph may restrict the acquisition of relevant graph information, and hence influence the model performance. To address this issue, we consider automatically building of event graph using a BERT model. To this end, we incorporate an additional structured variable into BERT to learn to predict the event connections in the training process. Hence, in the test process, the connection relationship for unseen events can be predicted by the structured variable. Results on two event prediction tasks: script event prediction and story ending prediction, show that our approach can outperform state-of-the-art baseline methods.
LGAug 21, 2022
DiscrimLoss: A Universal Loss for Hard Samples and Incorrect Samples DiscriminationTingting Wu, Xiao Ding, Hao Zhang et al.
Given data with label noise (i.e., incorrect data), deep neural networks would gradually memorize the label noise and impair model performance. To relieve this issue, curriculum learning is proposed to improve model performance and generalization by ordering training samples in a meaningful (e.g., easy to hard) sequence. Previous work takes incorrect samples as generic hard ones without discriminating between hard samples (i.e., hard samples in correct data) and incorrect samples. Indeed, a model should learn from hard samples to promote generalization rather than overfit to incorrect ones. In this paper, we address this problem by appending a novel loss function DiscrimLoss, on top of the existing task loss. Its main effect is to automatically and stably estimate the importance of easy samples and difficult samples (including hard and incorrect samples) at the early stages of training to improve the model performance. Then, during the following stages, DiscrimLoss is dedicated to discriminating between hard and incorrect samples to improve the model generalization. Such a training strategy can be formulated dynamically in a self-supervised manner, effectively mimicking the main principle of curriculum learning. Experiments on image classification, image regression, text sequence regression, and event relation reasoning demonstrate the versatility and effectiveness of our method, particularly in the presence of diversified noise levels.
AIDec 16, 2022
ReCo: Reliable Causal Chain Reasoning via Structural Causal Recurrent Neural NetworksKai Xiong, Xiao Ding, Zhongyang Li et al.
Causal chain reasoning (CCR) is an essential ability for many decision-making AI systems, which requires the model to build reliable causal chains by connecting causal pairs. However, CCR suffers from two main transitive problems: threshold effect and scene drift. In other words, the causal pairs to be spliced may have a conflicting threshold boundary or scenario. To address these issues, we propose a novel Reliable Causal chain reasoning framework~(ReCo), which introduces exogenous variables to represent the threshold and scene factors of each causal pair within the causal chain, and estimates the threshold and scene contradictions across exogenous variables via structural causal recurrent neural networks~(SRNN). Experiments show that ReCo outperforms a series of strong baselines on both Chinese and English CCR datasets. Moreover, by injecting reliable causal chain knowledge distilled by ReCo, BERT can achieve better performances on four downstream causal-related tasks than BERT models enhanced by other kinds of knowledge.
CLMar 20Code
LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant TasksXiang Long, Li Du, Yilong Xu et al.
LLM-based agents are increasingly expected to handle real-world assistant tasks, yet existing benchmarks typically evaluate them under isolated sources of difficulty, such as a single environment or fully specified instructions. This leaves a substantial gap between current evaluation settings and the compositional challenges that arise in practical deployment. To address this gap, we introduce LiveClawBench, a benchmark to evaluate LLM agents on real-world assistant tasks. Based on an analysis of various real OpenClaw usage cases, we derive a Triple-Axis Complexity Framework that characterizes task difficulty along three dimensions: Environment Complexity, Cognitive Demand, and Runtime Adaptability. Guided by this framework, we construct a pilot benchmark with explicit complexity-factor annotations, covering real-world assistant tasks with compositional difficulty. Together, the framework and benchmark provide a principled foundation for evaluating LLM agents in realistic assistant settings, and establish a basis for future expansion across task domains and complexity axes. We are continuing to enrich our case collections to achieve more comprehensive domain and complexity coverage. The project page is at https://github.com/Mosi-AI/LiveClawBench.
CLOct 18, 2022
Hidden State Variability of Pretrained Language Models Can Guide Computation Reduction for Transfer LearningShuo Xie, Jiahao Qiu, Ankita Pasad et al.
While transferring a pretrained language model, common approaches conventionally attach their task-specific classifiers to the top layer and adapt all the pretrained layers. We investigate whether one could make a task-specific selection on which subset of the layers to adapt and where to place the classifier. The goal is to reduce the computation cost of transfer learning methods (e.g. fine-tuning or adapter-tuning) without sacrificing its performance. We propose to select layers based on the variability of their hidden states given a task-specific corpus. We say a layer is already "well-specialized" in a task if the within-class variability of its hidden states is low relative to the between-class variability. Our variability metric is cheap to compute and doesn't need any training or hyperparameter tuning. It is robust to data imbalance and data scarcity. Extensive experiments on the GLUE benchmark demonstrate that selecting layers based on our metric can yield significantly stronger performance than using the same number of top layers and often match the performance of fine-tuning or adapter-tuning the entire language model.
LGAug 27, 2024Code
PAT: Pruning-Aware Tuning for Large Language ModelsYijiang Liu, Huanrui Yang, Youxin Chen et al.
Large language models (LLMs) excel in language tasks, especially with supervised fine-tuning after pre-training. However, their substantial memory and computational requirements hinder practical applications. Structural pruning, which reduces less significant weight dimensions, is one solution. Yet, traditional post-hoc pruning often leads to significant performance loss, with limited recovery from further fine-tuning due to reduced capacity. Since the model fine-tuning refines the general and chaotic knowledge in pre-trained models, we aim to incorporate structural pruning with the fine-tuning, and propose the Pruning-Aware Tuning (PAT) paradigm to eliminate model redundancy while preserving the model performance to the maximum extend. Specifically, we insert the innovative Hybrid Sparsification Modules (HSMs) between the Attention and FFN components to accordingly sparsify the upstream and downstream linear modules. The HSM comprises a lightweight operator and a globally shared trainable mask. The lightweight operator maintains a training overhead comparable to that of LoRA, while the trainable mask unifies the channels to be sparsified, ensuring structural pruning. Additionally, we propose the Identity Loss which decouples the transformation and scaling properties of the HSMs to enhance training robustness. Extensive experiments demonstrate that PAT excels in both performance and efficiency. For example, our Llama2-7b model with a 25\% pruning ratio achieves 1.33$\times$ speedup while outperforming the LoRA-finetuned model by up to 1.26\% in accuracy with a similar training cost. Code: https://github.com/kriskrisliu/PAT_Pruning-Aware-Tuning
CLAug 23, 2024
Causal-Guided Active Learning for Debiasing Large Language ModelsLi Du, Zhouhao Sun, Xiao Ding et al.
Although achieving promising performance, recent analyses show that current generative large language models (LLMs) may still capture dataset biases and utilize them for generation, leading to poor generalizability and harmfulness of LLMs. However, due to the diversity of dataset biases and the over-optimization problem, previous prior-knowledge-based debiasing methods and fine-tuning-based debiasing methods may not be suitable for current LLMs. To address this issue, we explore combining active learning with the causal mechanisms and propose a casual-guided active learning (CAL) framework, which utilizes LLMs itself to automatically and autonomously identify informative biased samples and induce the bias patterns. Then a cost-effective and efficient in-context learning based method is employed to prevent LLMs from utilizing dataset biases during generation. Experimental results show that CAL can effectively recognize typical biased instances and induce various bias patterns for debiasing LLMs.
CLOct 19, 2023
On the Representational Capacity of Recurrent Neural Language ModelsFranz Nowak, Anej Svete, Li Du et al. · allen-ai, eth-zurich
This work investigates the computational expressivity of language models (LMs) based on recurrent neural networks (RNNs). Siegelmann and Sontag (1992) famously showed that RNNs with rational weights and hidden states and unbounded computation time are Turing complete. However, LMs define weightings over strings in addition to just (unweighted) language membership and the analysis of the computational power of RNN LMs (RLMs) should reflect this. We extend the Turing completeness result to the probabilistic case, showing how a rationally weighted RLM with unbounded computation time can simulate any deterministic probabilistic Turing machine (PTM) with rationally weighted transitions. Since, in practice, RLMs work in real-time, processing a symbol at every time step, we treat the above result as an upper bound on the expressivity of RLMs. We also provide a lower bound by showing that under the restriction to real-time computation, such models can simulate deterministic real-time rational PTMs.
AIMay 2
GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward ModelsZhouhao Sun, Xuan Zhang, Xiao Ding et al.
Currently, process reward models (PRMs) have exhibited remarkable potential for test-time scaling. Since large language models (LLMs) regularly generate flawed intermediate reasoning steps when tackling a broad spectrum of reasoning and decision-making tasks, PRMs are required to possess capabilities for detecting process-level errors in real-world scenarios. However, existing benchmarks primarily focus on mathematical reasoning, thereby failing to comprehensively evaluate the error detection ability of PRMs across diverse reasoning scenarios. To mitigate this gap, we introduce GR-Ben, a process-level benchmark specifically designed for assessing PRM's performance across two primary reasoning domains (science and logic) and nine subdomains. We conduct extensive experiments on a diverse set of 22 models, encompassing both PRMs and LLMs, and derive two key findings: (1) In domains beyond mathematical reasoning, the error-detection ability of existing PRMs and LLMs is found to be markedly weaker by comparison.(2) In general, PRMs are less adept at identifying knowledge-based errors, whereas LLMs exhibit poorer performance in detecting computational errors.We hope GR-Ben can foster future researches on PRMs for general domains, thereby enhancing the reasoning capabilities of LLMs.
LGSep 24, 2024
Supervised Fine-Tuning Achieve Rapid Task Adaption Via Alternating Attention Head Activation PatternsYang Zhao, Li Du, Xiao Ding et al.
LLMs' performance on complex tasks is still unsatisfactory. A key issue is that presently LLMs learn in a data-driven schema, while the instructions about these complex tasks are both scarce and hard to collect or construct. On the contrary, a prominent phenomenon is that LLMs can learn rather fast on simpler tasks with adequate prior knowledge captured during pretraining stage. Thus, if the prerequisite and mechanism of such rapid generalization could be elucidated, it could enhance the efficiency and effectiveness of the LLM's ability to learn complex tasks. Thus, in this paper, we employ a gradient-based method, to dissect the process that the SFT process adapts LLMs to downstream tasks via the perspective of attention patterns. We find that: (1) LLMs selectively activate task-specific attention heads during SFT; (2) activation patterns for complex tasks are combinations of basic task patterns; and (3) changes in a few parameters can significantly impact activation patterns after SFT on a small number of samples.Based on these insights, experiments are conducted to actually enhance the efficiency and effectiveness of SFT.
LGJul 5, 2024
SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based SpikingXingrun Xing, Boyan Gao, Zheng Zhang et al.
Recent advancements in large language models (LLMs) with billions of parameters have improved performance in various applications, but their inference processes demand significant energy and computational resources. In contrast, the human brain, with approximately 86 billion neurons, is much more energy-efficient than LLMs with similar parameters. Inspired by this, we redesign 7$\sim$70 billion parameter LLMs using bio-plausible spiking mechanisms, emulating the efficient behavior of the human brain. We propose the first spiking large language model, SpikeLLM. Coupled with the proposed model, two essential approaches are proposed to improve spike training efficiency: Generalized Integrate-and-Fire (GIF) neurons to compress spike length from $T$ to $\frac{T}{L} \log_2 L$ bits, and an Optimal Brain Spiking framework to divide outlier channels and allocate different $T$ for GIF neurons, which further compresses spike length to approximate $log_2T$ bits. The necessity of spike-driven LLM is proved by comparison with quantized LLMs with similar operations. In the OmniQuant pipeline, SpikeLLM reduces 11.01% WikiText2 perplexity and improves 2.55% accuracy of common scene reasoning on a LLAMA-7B W4A4 model. In the GPTQ pipeline, SpikeLLM achieves direct additive in linear layers, significantly exceeding PB-LLMs.
CLAug 14, 2022
Text Difficulty Study: Do machines behave the same as humans regarding text difficulty?Bowen Chen, Xiao Ding, Li Du et al.
Given a task, human learns from easy to hard, whereas the model learns randomly. Undeniably, difficulty insensitive learning leads to great success in NLP, but little attention has been paid to the effect of text difficulty in NLP. In this research, we propose the Human Learning Matching Index (HLM Index) to investigate the effect of text difficulty. Experiment results show: (1) LSTM has more human-like learning behavior than BERT. (2) UID-SuperLinear gives the best evaluation of text difficulty among four text difficulty criteria. (3) Among nine tasks, some tasks' performance is related to text difficulty, whereas some are not. (4) Model trained on easy data performs best in easy and medium data, whereas trains on a hard level only perform well on hard data. (5) Training the model from easy to hard leads to fast convergence.
CLJun 5, 2023
Structured Voronoi SamplingAfra Amini, Li Du, Ryan Cotterell
Gradient-based sampling algorithms have demonstrated their effectiveness in text generation, especially in the context of controlled text generation. However, there exists a lack of theoretically grounded and principled approaches for this task. In this paper, we take an important step toward building a principled approach for sampling from language models with gradient-based methods. We use discrete distributions given by language models to define densities and develop an algorithm based on Hamiltonian Monte Carlo to sample from them. We name our gradient-based technique Structured Voronoi Sampling (SVS). In an experimental setup where the reference distribution is known, we show that the empirical distribution of SVS samples is closer to the reference distribution compared to alternative sampling schemes. Furthermore, in a controlled generation task, SVS is able to generate fluent and diverse samples while following the control targets significantly better than other methods.
CLAug 21, 2024
Diagnosing and Remedying Knowledge Deficiencies in LLMs via Label-free Curricular Meaningful LearningKai Xiong, Xiao Ding, Li Du et al.
Large Language Models (LLMs) are versatile and demonstrate impressive generalization ability by mining and learning information from extensive unlabeled text. However, they still exhibit reasoning mistakes, often stemming from knowledge deficiencies, which can affect their trustworthiness and reliability. Although users can provide diverse and comprehensive queries, obtaining sufficient and effective feedback is demanding. Furthermore, evaluating LLMs comprehensively with limited labeled samples is difficult. This makes it a challenge to diagnose and remedy the deficiencies of LLMs through rich label-free user queries. To tackle this challenge, we propose a label-free curricular meaningful learning framework (LaMer). LaMer first employs relative entropy to automatically diagnose and quantify the knowledge deficiencies of LLMs in a label-free setting. Next, to remedy the diagnosed knowledge deficiencies, we apply curricular meaningful learning: first, we adopt meaningful learning to adaptively synthesize augmentation data according to the severity of the deficiencies, and then design a curricular deficiency remedy strategy to remedy the knowledge deficiencies of LLMs progressively. Experiments show that LaMer efficiently and effectively diagnoses and remedies knowledge deficiencies in LLMs, improving various LLMs across seven out-of-distribution (OOD) reasoning and language understanding benchmarks, achieving comparable results to baselines with just 40\% training data. LaMer even surpasses methods that rely on labeled datasets for deficiency diagnosis. In application, our label-free method can offer an effective knowledge deficiency diagnostic tool for efficient LLM development.
CLDec 31, 2025
Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge StatementsYiming Liang, Yizhi Li, Yantao Du et al.
Benchmarks play a crucial role in tracking the rapid advancement of large language models (LLMs) and identifying their capability boundaries. However, existing benchmarks predominantly curate questions at the question level, suffering from three fundamental limitations: vulnerability to data contamination, restriction to single-knowledge-point assessment, and reliance on costly domain expert annotation. We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up. Our key insight is that knowledge statements, not questions, can serve as the unit of curation, and questions can then be constructed from them. We extract standalone knowledge statements from authoritative textbooks and dynamically compose them into evaluation questions through random sampling at test time. This design directly addresses all three limitations: the combinatorial space is too vast to memorize, and model rankings remain stable across dynamically generated question sets, enabling reliable periodic dataset refresh; each question aggregates 8-10 statements for comprehensive multi-knowledge assessment; annotators only verify formatting compliance without requiring domain expertise, substantially reducing annotation costs. Experiments on over 50 LLMs demonstrate that Encyclo-K poses substantial challenges with strong discriminative power. Even the top-performing OpenAI-GPT-5.1 achieves only 62.07% accuracy, and model performance displays a clear gradient distribution--reasoning models span from 16.04% to 62.07%, while chat models range from 9.71% to 50.40%. These results validate the challenges introduced by dynamic evaluation and multi-statement comprehensive understanding. These findings establish Encyclo-K as a scalable framework for dynamic evaluation of LLMs' comprehensive understanding over multiple fine-grained disciplinary knowledge statements.
CLApr 17, 2025Code
Syntactic and Semantic Control of Large Language Models via Sequential Monte CarloJoão Loula, Benjamin LeBrun, Li Du et al.
A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be naturally framed as probabilistic conditioning, but exact generation from the resulting distribution -- which can differ substantially from the LM's base distribution -- is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). Our SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains -- Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis -- we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8x larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. Our system builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.
LGJul 3, 2024
SFC: Achieve Accurate Fast Convolution under Low-precision ArithmeticLiulu He, Yufei Zhao, Rui Gao et al.
Fast convolution algorithms, including Winograd and FFT, can efficiently accelerate convolution operations in deep models. However, these algorithms depend on high-precision arithmetic to maintain inference accuracy, which conflicts with the model quantization. To resolve this conflict and further improve the efficiency of quantized convolution, we proposes SFC, a new algebra transform for fast convolution by extending the Discrete Fourier Transform (DFT) with symbolic computing, in which only additions are required to perform the transformation at specific transform points, avoiding the calculation of irrational number and reducing the requirement for precision. Additionally, we enhance convolution efficiency by introducing correction terms to convert invalid circular convolution outputs of the Fourier method into effective ones. The numerical error analysis is presented for the first time in this type of work and proves that our algorithms can provide a 3.68x multiplication reduction for 3x3 convolution, while the Winograd algorithm only achieves a 2.25x reduction with similarly low numerical errors. Experiments carried out on benchmarks and FPGA show that our new algorithms can further improve the computation efficiency of quantized models while maintaining accuracy, surpassing both the quantization-alone method and existing works on fast convolution quantization.
CLJun 9, 2025Code
Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language ModelsJijie Li, Li Du, Hanyu Zhao et al.
Large Language Models (LLMs) demonstrate strong performance in real-world applications, yet existing open-source instruction datasets often concentrate on narrow domains, such as mathematics or coding, limiting generalization and widening the gap with proprietary models. To bridge this gap, we introduce Infinity-Instruct, a high-quality instruction dataset designed to enhance both foundational and chat capabilities of LLMs through a two-phase pipeline. In Phase 1, we curate 7.4M high-quality foundational instructions (InfInstruct-F-7.4M) from over 100M samples using hybrid data selection techniques. In Phase 2, we synthesize 1.5M high-quality chat instructions (InfInstruct-G-1.5M) through a two-stage process involving instruction selection, evolution, and diagnostic filtering. We empirically evaluate Infinity-Instruct by fine-tuning several open-source models, including Mistral, LLaMA, Qwen, and Yi, and observe substantial performance gains across both foundational and instruction following benchmarks, consistently surpassing official instruction-tuned counterparts. Notably, InfInstruct-LLaMA3.1-70B outperforms GPT-4-0314 by 8.6\% on instruction following tasks while achieving comparable foundational performance. These results underscore the synergy between foundational and chat training and offer new insights into holistic LLM development. Our dataset\footnote{https://huggingface.co/datasets/BAAI/Infinity-Instruct} and codes\footnote{https://gitee.com/li-touch/infinity-instruct} have been publicly released.
AIJan 12
Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient ConcentrationYang Zhao, Yangou Ouyang, Xiao Ding et al.
While Hybrid Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has become the standard paradigm for training LLM agents, effective mechanisms for data allocation between these stages remain largely underexplored. Current data arbitration strategies often rely on surface-level heuristics that fail to diagnose intrinsic learning needs. Since SFT targets pattern consolidation through imitation while RL drives structural adaptation via exploration, misaligning data with these functional roles causes severe optimization interference. We propose PRISM, a dynamics-aware framework grounded in Schema Theory that arbitrates data based on its degree of cognitive conflict with the model's existing knowledge. By analyzing the spatial geometric structure of gradients, PRISM identifies data triggering high spatial concentration as high-conflict signals that require RL for structural restructuring. In contrast, data yielding diffuse updates is routed to SFT for efficient consolidation. Extensive experiments on WebShop and ALFWorld demonstrate that PRISM achieves a Pareto improvement, outperforming state-of-the-art hybrid methods while reducing computational costs by up to 3.22$\times$. Our findings suggest that disentangling data based on internal optimization regimes is crucial for scalable and robust agent alignment.
LGJan 12
MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward OptimizationYang Zhao, Hepeng Wang, Xiao Ding et al.
Group-Relative Policy Optimization (GRPO) has emerged as an efficient paradigm for aligning Large Language Models (LLMs), yet its efficacy is primarily confined to domains with verifiable ground truths. Extending GRPO to open-domain settings remains a critical challenge, as unconstrained generation entails multi-faceted and often conflicting objectives - such as creativity versus factuality - where rigid, static reward scalarization is inherently suboptimal. To address this, we propose MAESTRO (Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization), which introduces a meta-cognitive orchestration layer that treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck to perceive task-specific priorities. We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal. Across seven benchmarks, MAESTRO consistently outperforms single-reward and static multi-objective baselines, while preserving the efficiency advantages of GRPO, and in some settings even reducing redundant generation.
CVMay 18
MoASE++: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time AdaptationRonyu Zhang, Aosong Cheng, Gaole Dai et al.
Continual test-time adaptation adapts a source-pretrained model to non-stationary, unlabeled target streams while retaining past competence, yet texture-biased backbones risk error accumulation and catastrophic forgetting. Drawing inspiration from the process of decoupling shape and texture in the human visual system, we introduce MoASE, a plug-in mixture-of-experts that disentangles domain-agnostic structure from domain-specific texture using Activation Sparsity Experts with Spatial Differentiable Dropout, forming complementary high- and low-activation pathways, while high- and low-rank bottlenecks diversify representations. The Activation Sparsity Gate produces input-adaptive SDD thresholds for precise token selection, and the Domain-Aware Router assigns per-sample expert weights using texture-sensitive cues. To curb confirmation bias on unlabeled streams and stabilize supervision, we then introduce Domain-Adaptive On-Policy Distillation to constitute MoASE++, with an EMA-anchored on-policy reverse KL distillation and an augmentation policy conditioned on entropy and confidence that aligns predictions across the same views and improves the robustness-plasticity balance. Extensive experiments on classification (CIFAR-10/100-C, ImageNet-C) and semantic segmentation (Cityscapes->ACDC) demonstrate consistent state-of-the-art performance, offering a principled, controllable approach to continual adaptation in dynamic visual environments.
LGNov 2, 2025
Tool Zero: Training Tool-Augmented LLMs via Pure RL from ScratchYirong Zeng, Xiao Ding, Yutai Hou et al.
Training tool-augmented LLMs has emerged as a promising approach to enhancing language models' capabilities for complex tasks. The current supervised fine-tuning paradigm relies on constructing extensive domain-specific datasets to train models. However, this approach often struggles to generalize effectively to unfamiliar or intricate tool-use scenarios. Recently, reinforcement learning (RL) paradigm can endow LLMs with superior reasoning and generalization abilities. In this work, we address a key question: Can the pure RL be used to effectively elicit a model's intrinsic reasoning capabilities and enhance the tool-agnostic generalization? We propose a dynamic generalization-guided reward design for rule-based RL, which progressively shifts rewards from exploratory to exploitative tool-use patterns. Based on this design, we introduce the Tool-Zero series models. These models are trained to enable LLMs to autonomously utilize general tools by directly scaling up RL from Zero models (i.e., base models without post-training). Experimental results demonstrate that our models achieve over 7% performance improvement compared to both SFT and RL-with-SFT models under the same experimental settings. These gains are consistently replicated across cross-dataset and intra-dataset evaluations, validating the effectiveness and robustness of our methods.
CLAug 13, 2024
AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out StrategiesBo-Wen Zhang, Liangdong Wang, Ye Yuan et al.
In recent years, with the rapid application of large language models across various fields, the scale of these models has gradually increased, and the resources required for their pre-training have grown exponentially. Training an LLM from scratch will cost a lot of computation resources while scaling up from a smaller model is a more efficient approach and has thus attracted significant attention. In this paper, we present AquilaMoE, a cutting-edge bilingual 8*16B Mixture of Experts (MoE) language model that has 8 experts with 16 billion parameters each and is developed using an innovative training methodology called EfficientScale. This approach optimizes performance while minimizing data requirements through a two-stage process. The first stage, termed Scale-Up, initializes the larger model with weights from a pre-trained smaller model, enabling substantial knowledge transfer and continuous pretraining with significantly less data. The second stage, Scale-Out, uses a pre-trained dense model to initialize the MoE experts, further enhancing knowledge transfer and performance. Extensive validation experiments on 1.8B and 7B models compared various initialization schemes, achieving models that maintain and reduce loss during continuous pretraining. Utilizing the optimal scheme, we successfully trained a 16B model and subsequently the 8*16B AquilaMoE model, demonstrating significant improvements in performance and training efficiency.
ARMar 31
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory NetworkAojie Jiang, Kang Zhu, Zhiheng Zhang et al.
In-network computing techniques, exemplified by NVLink Sharp (NVLS), offer a promising approach to addressing the communication bottlenecks in LLM inference by offloading collective operations, such as All-Reduce, to switches. However, the accelerator-centric architecture of NVLS suffers from two fundamental limitations: 1) it relies on GPU load instructions to trigger reduction operations, which means that the data reduced in the switch must be additionally transferred back to the initiating GPU rather than being broadcast directly, thereby introducing unnecessary communication overhead; 2) due to its architectural constraints, NVLS cannot offload operators that are not decomposable into memory-semantic instructions, such as the in-network quantization (INQ) proposed in this work. As a result, All-Reduce in NVLS must operate at FP16/BF16 precision, leading to substantial bandwidth waste.To address these limitations, we propose SCIN, the first switch-centric in-network architecture for shared-memory networks of AI accelerators, enabling both low-latency and high-bandwidth All-Reduce. Specifically, we introduce an in-switch accelerator (ISA) capable of initiating memory-semantic operations for in-network processing, together with a co-designed communication fabric that incurs negligible protocol overhead. By eliminating redundant data movement, SCIN delivers lower All-Reduce latency than NVLS. Moreover, by integrating a quantization module into the ISA, SCIN enables INQ for All-Reduce, reducing its precision to 8 bits and nearly doubling bandwidth with negligible accuracy loss. We also present a prototype of SCIN on a multi-FPGA system to demonstrate its feasibility and effectiveness. Experimental results show that our design accelerates All-Reduce by up to 8.7x for small messages and 3.8x for large messages, leading up to 1.74x faster TTFT and 1.34x faster TPOT on LLaMA-2 models.
LGMar 30
Key-Embedded Privacy for Decentralized AI in Biomedical OmicsRongyu Zhang, Hongyu Dong, Gaole Dai et al.
The rapid adoption of data-driven methods in biomedicine has intensified concerns over privacy, governance, and regulation, limiting raw data sharing and hindering the assembly of representative cohorts for clinically relevant AI. This landscape necessitates practical, efficient privacy solutions, as cryptographic defenses often impose heavy overhead and differential privacy can degrade performance, leading to sub-optimal outcomes in real-world settings. Here, we present a lightweight federated learning method, INFL, based on Implicit Neural Representations that addresses these challenges. Our approach integrates plug-and-play, coordinate-conditioned modules into client models, embeds a secret key directly into the architecture, and supports seamless aggregation across heterogeneous sites. Across diverse biomedical omics tasks, including cohort-scale classification in bulk proteomics, regression for perturbation prediction in single-cell transcriptomics, and clustering in spatial transcriptomics and multi-omics with both public and private data, we demonstrate that INFL achieves strong, controllable privacy while maintaining utility, preserving the performance necessary for downstream scientific and clinical applications.
DCJan 7
A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP SystemsQi Wu, Chao Fang, Jiayuan Chen et al.
Mixture-of-Experts (MoE) models facilitate edge deployment by decoupling model capacity from active computation, yet their large memory footprint drives the need for GPU systems with near-data processing (NDP) capabilities that offload experts to dedicated processing units. However, deploying MoE models on such edge-based GPU-NDP systems faces three critical challenges: 1) severe load imbalance across NDP units due to non-uniform expert selection and expert parallelism, 2) insufficient GPU utilization during expert computation within NDP units, and 3) extensive data pre-profiling necessitated by unpredictable expert activation patterns for pre-fetching. To address these challenges, this paper proposes an efficient inference framework featuring three key optimizations. First, the underexplored tensor parallelism in MoE inference is exploited to partition and compute large expert parameters across multiple NDP units simultaneously towards edge low-batch scenarios. Second, a load-balancing-aware scheduling algorithm distributes expert computations across NDP units and GPU to maximize resource utilization. Third, a dataset-free pre-fetching strategy proactively loads frequently accessed experts to minimize activation delays. Experimental results show that our framework enables GPU-NDP systems to achieve 2.41x on average and up to 2.56x speedup in end-to-end latency compared to state-of-the-art approaches, significantly enhancing MoE inference efficiency in resource-constrained environments.
CLJul 29, 2025Code
AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement LearningYifan Wei, Xiaoyan Yu, Yixuan Weng et al.
Large Language Models (LLMs), when enhanced through reasoning-oriented post-training, evolve into powerful Large Reasoning Models (LRMs). Tool-Integrated Reasoning (TIR) further extends their capabilities by incorporating external tools, but existing methods often rely on rigid, predefined tool-use patterns that risk degrading core language competence. Inspired by the human ability to adaptively select tools, we introduce AutoTIR, a reinforcement learning framework that enables LLMs to autonomously decide whether and which tool to invoke during the reasoning process, rather than following static tool-use strategies. AutoTIR leverages a hybrid reward mechanism that jointly optimizes for task-specific answer correctness, structured output adherence, and penalization of incorrect tool usage, thereby encouraging both precise reasoning and efficient tool integration. Extensive evaluations across diverse knowledge-intensive, mathematical, and general language modeling tasks demonstrate that AutoTIR achieves superior overall performance, significantly outperforming baselines and exhibits superior generalization in tool-use behavior. These results highlight the promise of reinforcement learning in building truly generalizable and scalable TIR capabilities in LLMs. The code and data are available at https://github.com/weiyifan1023/AutoTIR.
LGApr 13, 2024Code
T-REX: Mixture-of-Rank-One-Experts with Semantic-aware Intuition for Multi-task Large Language Model FinetuningRongyu Zhang, Yijiang Liu, Huanrui Yang et al.
Large language models (LLMs) encounter significant adaptation challenges in diverse multitask finetuning. Mixture-of-experts (MoE) provides a promising solution with a dynamic architecture, enabling effective task decoupling. However, scaling up the number of MoE experts incurs substantial parameter and computational overheads and suffers from limited performance gain due to naive routing mechanisms. In this paper, we design a novel framework, mix\underline{\textbf{T}}ure\underline{\textbf{-}}of-\underline{\textbf{R}}ank-on\underline{\textbf{E}}-e\underline{\textbf{X}}perts (\texttt{T-REX}), which leverages the combination of ultra-low rank experts to construct LoRA weights on pretrained LLMs. The rank-1 experts enable a mix-and-match mechanism to quadratically expand the vector subspace of experts with linear parameter overheads, achieving approximate error reduction with optimal efficiency. In addition, T-REX offers implicit guidance to the router, leveraging the inherent semantic clustering of training embeddings as prior knowledge, enabling optimized feature allocation across experts for a smoother convergence. Extensive theoretical and empirical results demonstrate that T-REX achieves superior efficiency and generalizability across diverse tasks. Compared with other LoRA-based methods, T-REX achieves up to 1.78\% mean accuracy improvement with around 30\%-40\% less trainable parameters across 14 public datasets. \href{https://github.com/RoyZry98/T-REX-Pytorch}{Code} is available.
CVJan 31, 2024Code
Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity AnalysisJianing Li, Xi Nan, Ming Lu et al.
Multi-modal large language models (MLLMs) have demonstrated remarkable vision-language capabilities, primarily due to the exceptional in-context understanding and multi-task learning strengths of large language models (LLMs). The advent of visual instruction tuning has further enhanced MLLMs' performance in vision-language understanding. However, while existing MLLMs adeptly recognize \textit{what} objects are in an image, they still face challenges in effectively discerning \textit{where} these objects are, particularly along the distance (scene depth) axis. To overcome this limitation in MLLMs, we introduce Proximity Question Answering (Proximity QA), a novel framework designed to enable MLLMs to infer the proximity relationship between objects in images. The framework operates in two phases: the first phase focuses on guiding the models to understand the relative depth of objects, and the second phase further encourages the models to infer the proximity relationships between objects based on their depth perceptions. We also propose a VQA dataset called Proximity-110K, containing additional instructions that incorporate depth information and the proximity relationships of objects. We have conducted extensive experiments to validate Proximity QA's superior ability in depth perception and proximity analysis, outperforming other state-of-the-art MLLMs. Code and dataset will be released at \textcolor{magenta}{https://github.com/NorthSummer/ProximityQA.git}.
LGDec 14, 2023Code
BiPFT: Binary Pre-trained Foundation Transformer with Low-rank Estimation of Binarization Residual PolynomialsXingrun Xing, Li Du, Xinyuan Wang et al.
Pretrained foundation models offer substantial benefits for a wide range of downstream tasks, which can be one of the most potential techniques to access artificial general intelligence. However, scaling up foundation transformers for maximal task-agnostic knowledge has brought about computational challenges, especially on resource-limited devices such as mobiles. This work proposes the first Binary Pretrained Foundation Transformer (BiPFT) for natural language understanding (NLU) tasks, which remarkably saves 56 times operations and 28 times memory. In contrast to previous task-specific binary transformers, BiPFT exhibits a substantial enhancement in the learning capabilities of binary neural networks (BNNs), promoting BNNs into the era of pre-training. Benefiting from extensive pretraining data, we further propose a data-driven binarization method. Specifically, we first analyze the binarization error in self-attention operations and derive the polynomials of binarization error. To simulate full-precision self-attention, we define binarization error as binarization residual polynomials, and then introduce low-rank estimators to model these polynomials. Extensive experiments validate the effectiveness of BiPFTs, surpassing task-specific baseline by 15.4% average performance on the GLUE benchmark. BiPFT also demonstrates improved robustness to hyperparameter changes, improved optimization efficiency, and reduced reliance on downstream distillation, which consequently generalize on various NLU tasks and simplify the downstream pipeline of BNNs. Our code and pretrained models are publicly available at https://github.com/Xingrun-Xing/BiPFT.
CLSep 11, 2024
Beyond IID: Optimizing Instruction Learning from the Perspective of Instruction Interaction and DependencyHanyu Zhao, Li Du, Yiming Ju et al.
With the availability of various instruction datasets, a pivotal challenge is how to effectively select and integrate these instructions to fine-tune large language models (LLMs). Previous research mainly focuses on selecting individual high-quality instructions. However, these works overlooked the joint interactions and dependencies between different categories of instructions, leading to suboptimal selection strategies. Moreover, the nature of these interaction patterns remains largely unexplored, let alone optimize the instruction set with regard to them. To fill these gaps, in this paper, we: (1) systemically investigate interaction and dependency patterns between different categories of instructions, (2) manage to optimize the instruction set concerning the interaction patterns using a linear programming-based method, and optimize the learning schema of SFT using an instruction dependency taxonomy guided curriculum learning. Experimental results across different LLMs demonstrate improved performance over strong baselines on widely adopted benchmarks.
CLJan 7
Towards Compositional Generalization of LLMs via Skill Taxonomy Guided Data SynthesisYifan Wei, Li Du, Xiaoyan Yu et al.
Large Language Models (LLMs) and agent-based systems often struggle with compositional generalization due to a data bottleneck in which complex skill combinations follow a long-tailed, power-law distribution, limiting both instruction-following performance and generalization in agent-centric tasks. To address this challenge, we propose STEPS, a Skill Taxonomy guided Entropy-based Post-training data Synthesis framework for generating compositionally challenging data. STEPS explicitly targets compositional generalization by uncovering latent relationships among skills and organizing them into an interpretable, hierarchical skill taxonomy using structural information theory. Building on this taxonomy, we formulate data synthesis as a constrained information maximization problem, selecting skill combinations that maximize marginal structural information within the hierarchy while preserving semantic coherence. Experiments on challenging instruction-following benchmarks show that STEPS outperforms existing data synthesis baselines, while also yielding improved compositional generalization in downstream agent-based evaluations.
ARJun 27, 2025Code
Image2Net: Datasets, Benchmark and Hybrid Framework to Convert Analog Circuit Diagrams into NetlistsHaohang Xu, Chengjie Liu, Qihang Wang et al.
Large Language Model (LLM) exhibits great potential in designing of analog integrated circuits (IC) because of its excellence in abstraction and generalization for knowledge. However, further development of LLM-based analog ICs heavily relies on textual description of analog ICs, while existing analog ICs are mostly illustrated in image-based circuit diagrams rather than text-based netlists. Converting circuit diagrams to netlists help LLMs to enrich the knowledge of analog IC. Nevertheless, previously proposed conversion frameworks face challenges in further application because of limited support of image styles and circuit elements. Up to now, it still remains a challenging task to effectively convert complex circuit diagrams into netlists. To this end, this paper constructs and opensources a new dataset with rich styles of circuit diagrams as well as balanced distribution of simple and complex analog ICs. And a hybrid framework, named Image2Net, is proposed for practical conversion from circuit diagrams to netlists. The netlist edit distance (NED) is also introduced to precisely assess the difference between the converted netlists and ground truth. Based on our benchmark, Image2Net achieves 80.77\% successful rate, which is 34.62\%-45.19\% higher than previous works. Specifically, the proposed work shows 0.116 averaged NED, which is 62.1\%-69.6\% lower than state-of-the-arts.
CVJan 28, 2025Code
SliceOcc: Indoor 3D Semantic Occupancy Prediction with Vertical Slice RepresentationJianing Li, Ming Lu, Hao Wang et al.
3D semantic occupancy prediction is a crucial task in visual perception, as it requires the simultaneous comprehension of both scene geometry and semantics. It plays a crucial role in understanding 3D scenes and has great potential for various applications, such as robotic vision perception and autonomous driving. Many existing works utilize planar-based representations such as Bird's Eye View (BEV) and Tri-Perspective View (TPV). These representations aim to simplify the complexity of 3D scenes while preserving essential object information, thereby facilitating efficient scene representation. However, in dense indoor environments with prevalent occlusions, directly applying these planar-based methods often leads to difficulties in capturing global semantic occupancy, ultimately degrading model performance. In this paper, we present a new vertical slice representation that divides the scene along the vertical axis and projects spatial point features onto the nearest pair of parallel planes. To utilize these slice features, we propose SliceOcc, an RGB camera-based model specifically tailored for indoor 3D semantic occupancy prediction. SliceOcc utilizes pairs of slice queries and cross-attention mechanisms to extract planar features from input images. These local planar features are then fused to form a global scene representation, which is employed for indoor occupancy prediction. Experimental results on the EmbodiedScan dataset demonstrate that SliceOcc achieves a mIoU of 15.45% across 81 indoor categories, setting a new state-of-the-art performance among RGB camera-based models for indoor 3D semantic occupancy prediction. Code is available at https://github.com/NorthSummer/SliceOcc.
CVJul 2, 2025Code
CI-VID: A Coherent Interleaved Text-Video DatasetYiming Ju, Jijin Hu, Zhengxiong Luo et al.
Text-to-video (T2V) generation has recently attracted considerable attention, resulting in the development of numerous high-quality datasets that have propelled progress in this area. However, existing public datasets are primarily composed of isolated text-video (T-V) pairs and thus fail to support the modeling of coherent multi-clip video sequences. To address this limitation, we introduce CI-VID, a dataset that moves beyond isolated text-to-video (T2V) generation toward text-and-video-to-video (TV2V) generation, enabling models to produce coherent, multi-scene video sequences. CI-VID contains over 340,000 samples, each featuring a coherent sequence of video clips with text captions that capture both the individual content of each clip and the transitions between them, enabling visually and textually grounded generation. To further validate the effectiveness of CI-VID, we design a comprehensive, multi-dimensional benchmark incorporating human evaluation, VLM-based assessment, and similarity-based metrics. Experimental results demonstrate that models trained on CI-VID exhibit significant improvements in both accuracy and content consistency when generating video sequences. This facilitates the creation of story-driven content with smooth visual transitions and strong temporal coherence, underscoring the quality and practical utility of the CI-VID dataset We release the CI-VID dataset and the accompanying code for data construction and evaluation at: https://github.com/ymju-BAAI/CI-VID
CLJun 8, 2025Code
Com$^2$: A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language ModelsKai Xiong, Xiao Ding, Yixin Cao et al.
Large language models (LLMs) have mastered abundant simple and explicit commonsense knowledge through pre-training, enabling them to achieve human-like performance in simple commonsense reasoning. Nevertheless, LLMs struggle to reason with complex and implicit commonsense knowledge that is derived from simple ones (such as understanding the long-term effects of certain events), an aspect humans tend to focus on more. Existing works focus on complex tasks like math and code, while complex commonsense reasoning remains underexplored due to its uncertainty and lack of structure. To fill this gap and align with real-world concerns, we propose a benchmark Com$^2$ focusing on complex commonsense reasoning. We first incorporate causal event graphs to serve as structured complex commonsense. Then we adopt causal theory~(e.g., intervention) to modify the causal event graphs and obtain different scenarios that meet human concerns. Finally, an LLM is employed to synthesize examples with slow thinking, which is guided by the logical relationships in the modified causal graphs. Furthermore, we use detective stories to construct a more challenging subset. Experiments show that LLMs struggle in reasoning depth and breadth, while post-training and slow thinking can alleviate this. The code and data are available at https://github.com/Waste-Wood/Com2.