CVSep 25, 2023
Data Upcycling Knowledge Distillation for Image Super-ResolutionYun Zhang, Wei Li, Simiao Li et al.
Knowledge distillation (KD) compresses deep neural networks by transferring task-related knowledge from cumbersome pre-trained teacher models to compact student models. However, current KD methods for super-resolution (SR) networks overlook the nature of SR task that the outputs of the teacher model are noisy approximations to the ground-truth distribution of high-quality images (GT), which shades the teacher model's knowledge to result in limited KD effects. To utilize the teacher model beyond the GT upper-bound, we present the Data Upcycling Knowledge Distillation (DUKD), to transfer the teacher model's knowledge to the student model through the upcycled in-domain data derived from training data. Besides, we impose label consistency regularization to KD for SR by the paired invertible augmentations to improve the student model's performance and robustness. Comprehensive experiments demonstrate that the DUKD method significantly outperforms previous arts on several SR tasks.
CVMar 3Code
Toward Early Quality Assessment of Text-to-Image Diffusion ModelsHuanlei Guo, Hongxin Wei, Bingyi Jing
Recent text-to-image (T2I) diffusion and flow-matching models can produce highly realistic images from natural language prompts. In practical scenarios, T2I systems are often run in a ``generate--then--select'' mode: many seeds are sampled and only a few images are kept for use. However, this pipeline is highly resource-intensive since each candidate requires tens to hundreds of denoising steps, and evaluation metrics such as CLIPScore and ImageReward are post-hoc. In this work, we address this inefficiency by introducing Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process. We observe that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure, object layout and spatial arrangement--that strongly correlates with final image fidelity. Probe-Select exploits this property by predicting final quality scores directly from early activations, allowing unpromising seeds to be terminated early. Across diffusion and flow-matching backbones, our experiments show that early evaluation at only 20\% of the trajectory accurately ranks candidate seeds and enables selective continuation. This strategy reduces sampling cost by over 60\% while improving the quality of the retained images, demonstrating that early structural signals can effectively guide selective generation without altering the underlying generative model. Code is available at https://github.com/Guhuary/ProbeSelect.
CLOct 24, 2024Code
ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language ModelsHengxiang Zhang, Hongfu Gao, Qiang Hu et al.
With the rapid development of Large language models (LLMs), understanding the capabilities of LLMs in identifying unsafe content has become increasingly important. While previous works have introduced several benchmarks to evaluate the safety risk of LLMs, the community still has a limited understanding of current LLMs' capability to recognize illegal and unsafe content in Chinese contexts. In this work, we present a Chinese safety benchmark (ChineseSafe) to facilitate research on the content safety of large language models. To align with the regulations for Chinese Internet content moderation, our ChineseSafe contains 205,034 examples across 4 classes and 10 sub-classes of safety issues. For Chinese contexts, we add several special types of illegal content: political sensitivity, pornography, and variant/homophonic words. Moreover, we employ two methods to evaluate the legal risks of popular LLMs, including open-sourced models and APIs. The results reveal that many LLMs exhibit vulnerability to certain types of safety issues, leading to legal risks in China. Our work provides a guideline for developers and researchers to facilitate the safety of LLMs. Our results are also available at https://huggingface.co/spaces/SUSTech/ChineseSafe-Benchmark. Additionally, we release a test set comprising 200,000 examples, which is publicly accessible at https://huggingface.co/datasets/SUSTech/ChineseSafe.
DCMay 11
Lakestream: A Consistent and Brokerless Data Plane for Large Foundation Model TrainingTing Sun, Junjie Zhang, Xiao Yan et al.
Modern Large Foundation Model (LFM) training has transformed the data pipeline from a static ingestion layer into a dynamic component that must co-evolve with the training process. Existing systems are ill-equipped: colocated dataloaders offer no failure isolation, while message queue-based disaggregated dataloaders operate on a record/offset abstraction that cannot express the batch-level semantics required by distributed training. We present Lakestream, a brokerless, object-store-native training data plane with three key properties. First, it introduces the Transactional Global Batch (TGB), which builds on lakehouse-style ACID storage semantics and extends them with training-specific consistency, including atomic all-rank batch visibility, a globally ordered step sequence, checkpoint-aligned lifecycle management, and end-to-end exactly-once recovery. Second, it realizes recovery and retention directly in the storage layer, by inlining producer state in the manifest and tying reclamation to distributed checkpoint state. Third, its Decentralized Adaptive Commit (DAC) algorithm sustains stable ingestion throughput as the manifest grows, without any inter-producer communication. Evaluations on large-scale multimodal pre-training and SFT workloads using 64 GPUs show that Lakestream outperforms colocated dataloader throughput while providing full failure isolation, outperforms Apache Kafka in ingestion throughput, and achieves lower consumer read latency than Kafka.
CLFeb 12
Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM SystemsWanxing Wu, He Zhu, Yixia Li et al.
Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.
AIJan 30
Conditional Performance Guarantee for Large Reasoning ModelsJianguo Huang, Hao Zeng, Bingyi Jing et al.
Large reasoning models have shown strong performance through extended chain-of-thought reasoning, yet their computational cost remains significant. Probably approximately correct (PAC) reasoning provides statistical guarantees for efficient reasoning by adaptively switching between thinking and non-thinking models, but the guarantee holds only in the marginal case and does not provide exact conditional coverage. We propose G-PAC reasoning, a practical framework that provides PAC-style guarantees at the group level by partitioning the input space. We develop two instantiations: Group PAC (G-PAC) reasoning for known group structures and Clustered PAC (C-PAC) reasoning for unknown groupings. We prove that both G-PAC and C-PAC achieve group-conditional risk control, and that grouping can strictly improve efficiency over marginal PAC reasoning in heterogeneous settings. Our experiments on diverse reasoning benchmarks demonstrate that G-PAC and C-PAC successfully achieve group-conditional risk control while maintaining substantial computational savings.
LGJan 30
HyPAC: Cost-Efficient LLMs-Human Hybrid Annotation with PAC Error GuaranteesHao Zeng, Huipeng Huang, Xinhao Qu et al.
Data annotation often involves multiple sources with different cost-quality trade-offs, such as fast large language models (LLMs), slow reasoning models, and human experts. In this work, we study the problem of routing inputs to the most cost-efficient annotation source while controlling the labeling error on test instances. We propose \textbf{HyPAC}, a method that adaptively labels inputs to the most cost-efficient annotation source while providing distribution-free guarantees on annotation error. HyPAC calibrates two decision thresholds using importance sampling and upper confidence bounds, partitioning inputs into three regions based on uncertainty and routing each to the appropriate annotation source. We prove that HyPAC achieves the minimum expected cost with a probably approximately correct (PAC) guarantee on the annotation error, free of data distribution and pre-trained models. Experiments on common benchmarks demonstrate the effectiveness of our method, reducing the annotation cost by 78.51\% while tightly controlling the annotation error.
AIJan 22
Natural Language-Driven Global Mapping of Martian LandformsYiran Wang, Shuoyuan Wang, Zhaoran Wei et al.
Planetary surfaces are typically analyzed using high-level semantic concepts in natural language, yet vast orbital image archives remain organized at the pixel level. This mismatch limits scalable, open-ended exploration of planetary surfaces. Here we present MarScope, a planetary-scale vision-language framework enabling natural language-driven, label-free mapping of Martian landforms. MarScope aligns planetary images and text in a shared semantic space, trained on over 200,000 curated image-text pairs. This framework transforms global geomorphic mapping on Mars by replacing pre-defined classifications with flexible semantic retrieval, enabling arbitrary user queries across the entire planet in 5 seconds with F1 scores up to 0.978. Applications further show that it extends beyond morphological classification to facilitate process-oriented analysis and similarity-based geomorphological mapping at a planetary scale. MarScope establishes a new paradigm where natural language serves as a direct interface for scientific discovery over massive geospatial datasets.
AIJan 30
Anytime Safe PAC Efficient ReasoningChengyao Yu, Hao Zeng, Youxin Zhu et al.
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks but suffer from high computational costs and latency. While selective thinking strategies improve efficiency by routing easy queries to non-thinking models, existing approaches often incur uncontrollable errors, especially in online settings where the performance loss of a non-thinking model is only partially observed and data are non-stationary. To address this, we propose Betting Probably Approximately Correct (B-PAC) reasoning, a principled method that enables anytime safe and efficient online reasoning under partial feedback. Specifically, we utilize inverse propensity scoring estimators to construct test supermartingales for candidate thresholds, and then dynamically adjust the routing threshold based on the accumulated statistical evidence of safety. Theoretically, we establish the anytime-valid performance loss control and the efficiency of B-PAC reasoning. Extensive experiments demonstrate that B-PAC reasoning significantly reduces computational overhead, decreasing thinking model usage by up to 81.01\%, while controlling the performance loss below the user-specified level.
CLDec 8, 2023
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual ObjectsJunyu Lu, Dixiang Zhang, Songxin Zhang et al.
Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios. However, the absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors. In this paper, we propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration. Building on the foundation of BLIP-2, Lyrics infuses local visual features extracted from a visual refiner that includes image tagging, object detection and semantic segmentation modules into the Querying Transformer, while on the text side, the language inputs equip the boundary boxes and tags derived from the visual refiner. We further introduce a two-stage training scheme, in which the pre-training stage bridges the modality gap through explicit and comprehensive vision-language alignment targets. During the instruction fine-tuning stage, we introduce semantic-aware visual feature extraction, a crucial method that enables the model to extract informative features from concrete visual objects. Our approach achieves robust performance on 13 datasets across various vision-language tasks, and demonstrates promising multi-modal understanding, perception and conversation capabilities in 11 scenario-based benchmark toolkits.
LGFeb 5, 2025
Parametric Scaling Law of Tuning Bias in Conformal PredictionHao Zeng, Kangdao Liu, Bingyi Jing et al.
Conformal prediction is a popular framework of uncertainty quantification that constructs prediction sets with coverage guarantees. To uphold the exchangeability assumption, many conformal prediction methods necessitate an additional holdout set for parameter tuning. Yet, the impact of violating this principle on coverage remains underexplored, making it ambiguous in practical applications. In this work, we empirically find that the tuning bias - the coverage gap introduced by leveraging the same dataset for tuning and calibration, is negligible for simple parameter tuning in many conformal prediction methods. In particular, we observe the scaling law of the tuning bias: this bias increases with parameter space complexity and decreases with calibration set size. Formally, we establish a theoretical framework to quantify the tuning bias and provide rigorous proof for the scaling law of the tuning bias by deriving its upper bound. In the end, we discuss how to reduce the tuning bias, guided by the theories we developed.
LGFeb 20, 2024
TorchCP: A Python Library for Conformal PredictionJianguo Huang, Jianqing Song, Xuanning Zhou et al.
Conformal prediction (CP) is a powerful statistical framework that generates prediction intervals or sets with guaranteed coverage probability. While CP algorithms have evolved beyond traditional classifiers and regressors to sophisticated deep learning models like deep neural networks (DNNs), graph neural networks (GNNs), and large language models (LLMs), existing CP libraries often lack the model support and scalability for large-scale DL scenarios. This paper introduces TorchCP, a PyTorch-native library designed to integrate state-of-the-art CP algorithms into deep learning techniques, including DNN-based classifier/regressor, GNN, and LLM. Released under the LGPL-3.0 license, TorchCP comprises about 16k lines of code, validated with 100% unit test coverage and detailed documentation. Notably, TorchCP enables CP-specific training algorithms, online prediction, and GPU-accelerated batch processing, achieving up to 90% reduction in inference time on large datasets. With its low-coupling design, comprehensive suite of advanced methods, and full GPU scalability, TorchCP empowers researchers and practitioners to enhance uncertainty quantification across cutting-edge applications.
CVApr 3, 2024
Knowledge Distillation with Multi-granularity Mixture of Priors for Image Super-ResolutionSimiao Li, Yun Zhang, Wei Li et al.
Knowledge distillation (KD) is a promising yet challenging model compression technique that transfers rich learning representations from a well-performing but cumbersome teacher model to a compact student model. Previous methods for image super-resolution (SR) mostly compare the feature maps directly or after standardizing the dimensions with basic algebraic operations (e.g. average, dot-product). However, the intrinsic semantic differences among feature maps are overlooked, which are caused by the disparate expressive capacity between the networks. This work presents MiPKD, a multi-granularity mixture of prior KD framework, to facilitate efficient SR model through the feature mixture in a unified latent space and stochastic network block mixture. Extensive experiments demonstrate the effectiveness of the proposed MiPKD method.
LGFeb 8, 2024
Exploring Learning Complexity for Efficient Downstream Dataset PruningWenyu Jiang, Zhenlong Liu, Zejian Xie et al.
The ever-increasing fine-tuning cost of large-scale pre-trained models gives rise to the importance of dataset pruning, which aims to reduce dataset size while maintaining task performance. However, existing dataset pruning methods require training on the entire dataset, which is impractical for large-scale pre-trained models. In this paper, we propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC), to identify informative images and instructions from the downstream dataset efficiently. Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters. Specifically, we define the Learning Complexity to quantify sample hardness and utilize a lightweight weights masking process for fast estimation, instead of the costly SGD optimization. Based on DLC, we further design a flexible under-sampling with randomness (dubbed FlexRand), replacing the top-K strategy, to alleviate the severe subset distribution shift. Extensive experiments with downstream image and instructions dataset pruning benchmarks demonstrate the effectiveness and efficiency of the proposed approach. In the images pruning benchmark, DLC significantly reduces the pruning time by 35x while establishing state-of-the-art performance with FlexRand.
AIOct 10, 2025
PAC Reasoning: Controlling the Performance Loss for Efficient ReasoningHao Zeng, Jianguo Huang, Bingyi Jing et al.
Large reasoning models (LRMs) have achieved remarkable progress in complex problem-solving tasks. Despite this success, LRMs typically suffer from high computational costs during deployment, highlighting a need for efficient inference. A popular direction of efficiency improvement is to switch the LRM between thinking and nonthinking modes dynamically. However, such approaches often introduce additional reasoning errors and lack statistical guarantees for the performance loss, which are critical for high-stakes applications. In this work, we propose Probably Approximately Correct (PAC) reasoning that controls the performance loss under the user-specified performance loss tolerance. In particular, we construct an upper confidence bound on the performance loss, formulated as a monotone function of the uncertainty score, and subsequently determine a threshold for switching to the nonthinking model. Theoretically, using the threshold to switch between the thinking and nonthinking modes ensures bounded performance loss in a distribution-free manner. Our comprehensive experiments on reasoning benchmarks show that the proposed method can save computational budgets and control the user-specified performance loss.
AIOct 9, 2025
Multi-Condition Conformal SelectionQingyang Hao, Wenbo Liao, Bingyi Jing et al.
Selecting high-quality candidates from large-scale datasets is critically important in resource-constrained applications such as drug discovery, precision medicine, and the alignment of large language models. While conformal selection methods offer a rigorous solution with False Discovery Rate (FDR) control, their applicability is confined to single-threshold scenarios (i.e., y > c) and overlooks practical needs for multi-condition selection, such as conjunctive or disjunctive conditions. In this work, we propose the Multi-Condition Conformal Selection (MCCS) algorithm, which extends conformal selection to scenarios with multiple conditions. In particular, we introduce a novel nonconformity score with regional monotonicity for conjunctive conditions and a global Benjamini-Hochberg (BH) procedure for disjunctive conditions, thereby establishing finite-sample FDR control with theoretical guarantees. The integration of these components enables the proposed method to achieve rigorous FDR-controlled selection in various multi-condition environments. Extensive experiments validate the superiority of MCCS over baselines, its generalizability across diverse condition combinations, different real-world modalities, and multi-task scalability.
LGMay 27, 2025
Semi-Supervised Conformal Prediction With Unlabeled Nonconformity ScoreXuanning Zhou, Hao Zeng, Xiaobo Xia et al.
Conformal prediction (CP) is a powerful framework for uncertainty quantification, providing prediction sets with coverage guarantees when calibrated on sufficient labeled data. However, in real-world applications where labeled data is often limited, standard CP can lead to coverage deviation and output overly large prediction sets. In this paper, we extend CP to the semi-supervised setting and propose SemiCP, leveraging both labeled data and unlabeled data for calibration. Specifically, we introduce a novel nonconformity score function, NNM, designed for unlabeled data. This function selects labeled data with similar pseudo-label scores to estimate nonconformity scores, integrating them into the calibration process to overcome sample size limitations. We theoretically demonstrate that, under mild assumptions, SemiCP provide asymptotically coverage guarantee for prediction sets. Extensive experiments further validate that our approach effectively reduces instability and inefficiency under limited calibration data, can be adapted to conditional coverage settings, and integrates seamlessly with existing CP methods.
CLFeb 6, 2025
Exploring Imbalanced Annotations for Effective In-Context LearningHongfu Gao, Feipeng Zhang, Hao Zeng et al.
Large language models (LLMs) have shown impressive performance on downstream tasks through in-context learning (ICL), which heavily relies on the demonstrations selected from annotated datasets. However, these datasets often exhibit long-tailed class distributions in real-world scenarios, leading to biased demonstration selection. In this work, we show that such class imbalances significantly degrade the ICL performance across various tasks, regardless of selection methods. Moreover, classical rebalancing methods, which focus solely on class weights, yield poor performance due to neglecting condition bias--skewed feature distributions within classes. To address this, we propose Reweighting with Conditional Bias (dubbed RCB), a simple and complementary approach to enhance ICL performance under class imbalance. In particular, RCB estimates conditional bias using a balanced subset and re-weights demonstration scores based on both class weight and conditional bias. In effect, RCB prevents over-selection from dominant classes while preserving the efficacy of current selection methods. Extensive experiments on common benchmarks demonstrate the effectiveness of our method, improving the average accuracy of current selection methods by up to 5.42%.
MLSep 2, 2017
Adaptive ScalingTing Li, Bingyi Jing, Ningchen Ying et al.
Preprocessing data is an important step before any data analysis. In this paper, we focus on one particular aspect, namely scaling or normalization. We analyze various scaling methods in common use and study their effects on different statistical learning models. We will propose a new two-stage scaling method. First, we use some training data to fit linear regression model and then scale the whole data based on the coefficients of regression. Simulations are conducted to illustrate the advantages of our new scaling method. Some real data analysis will also be given.