CVAug 25, 2023Code
How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary DetectionYiyang Yao, Peng Liu, Tiancheng Zhao et al. · cmu
Object detection (OD) in computer vision has made significant progress in recent years, transitioning from closed-set labels to open-vocabulary detection (OVD) based on large-scale vision-language pre-training (VLP). However, current evaluation methods and datasets are limited to testing generalization over object types and referral expressions, which do not provide a systematic, fine-grained, and accurate benchmark of OVD models' abilities. In this paper, we propose a new benchmark named OVDEval, which includes 9 sub-tasks and introduces evaluations on commonsense knowledge, attribute understanding, position understanding, object relation comprehension, and more. The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input. Additionally, we identify a problem with the popular Average Precision (AP) metric when benchmarking models on these fine-grained label datasets and propose a new metric called Non-Maximum Suppression Average Precision (NMS-AP) to address this issue. Extensive experimental results show that existing top OVD models all fail on the new tasks except for simple object types, demonstrating the value of the proposed dataset in pinpointing the weakness of current OVD models and guiding future research. Furthermore, the proposed NMS-AP metric is verified by experiments to provide a much more truthful evaluation of OVD models, whereas traditional AP metrics yield deceptive results. Data is available at \url{https://github.com/om-ai-lab/OVDEval}
CVJul 6, 2024Code
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video UnderstandingTiancheng Zhao, Qianqian Zhang, Kyusong Lee et al. · cmu
We introduce OmChat, a model designed to excel in handling long contexts and video understanding tasks. OmChat's new architecture standardizes how different visual inputs are processed, making it more efficient and adaptable. It uses a dynamic vision encoding process to effectively handle images of various resolutions, capturing fine details across a range of image qualities. OmChat utilizes an active progressive multimodal pretraining strategy, which gradually increases the model's capacity for long contexts and enhances its overall abilities. By selecting high-quality data during training, OmChat learns from the most relevant and informative data points. With support for a context length of up to 512K, OmChat demonstrates promising performance in tasks involving multiple images and videos, outperforming most open-source models in these benchmarks. Additionally, OmChat proposes a prompting strategy for unifying complex multimodal inputs including single image text, multi-image text and videos, and achieving competitive performance on single-image benchmarks. To further evaluate the model's capabilities, we proposed a benchmark dataset named Temporal Visual Needle in a Haystack. This dataset assesses OmChat's ability to comprehend temporal visual details within long videos. Our analysis highlights several key factors contributing to OmChat's success: support for any-aspect high image resolution, the active progressive pretraining strategy, and high-quality supervised fine-tuning datasets. This report provides a detailed overview of OmChat's capabilities and the strategies that enhance its performance in visual understanding.
CVApr 26, 2023Code
Non-rigid Point Cloud Registration for Middle Ear Diagnostics with Endoscopic Optical Coherence TomographyPeng Liu, Jonas Golde, Joseph Morgenstern et al.
Purpose: Middle ear infection is the most prevalent inflammatory disease, especially among the pediatric population. Current diagnostic methods are subjective and depend on visual cues from an otoscope, which is limited for otologists to identify pathology. To address this shortcoming, endoscopic optical coherence tomography (OCT) provides both morphological and functional in-vivo measurements of the middle ear. However, due to the shadow of prior structures, interpretation of OCT images is challenging and time-consuming. To facilitate fast diagnosis and measurement, improvement in the readability of OCT data is achieved by merging morphological knowledge from ex-vivo middle ear models with OCT volumetric data, so that OCT applications can be further promoted in daily clinical settings. Methods: We propose C2P-Net: a two-staged non-rigid registration pipeline for complete to partial point clouds, which are sampled from ex-vivo and in-vivo OCT models, respectively. To overcome the lack of labeled training data, a fast and effective generation pipeline in Blender3D is designed to simulate middle ear shapes and extract in-vivo noisy and partial point clouds. Results: We evaluate the performance of C2P-Net through experiments on both synthetic and real OCT datasets. The results demonstrate that C2P-Net is generalized to unseen middle ear point clouds and capable of handling realistic noise and incompleteness in synthetic and real OCT data. Conclusion: In this work, we aim to enable diagnosis of middle ear structures with the assistance of OCT images. We propose C2P-Net: a two-staged non-rigid registration pipeline for point clouds to support the interpretation of in-vivo noisy and partial OCT images for the first time. Code is available at: https://gitlab.com/nct\_tso\_public/c2p-net.
CRJul 12, 2023Code
SoK: Comparing Different Membership Inference Attacks with a Comprehensive BenchmarkJun Niu, Xiaoyan Zhu, Moxuan Zeng et al.
Membership inference (MI) attacks threaten user privacy through determining if a given data example has been used to train a target model. However, it has been increasingly recognized that the "comparing different MI attacks" methodology used in the existing works has serious limitations. Due to these limitations, we found (through the experiments in this work) that some comparison results reported in the literature are quite misleading. In this paper, we seek to develop a comprehensive benchmark for comparing different MI attacks, called MIBench, which consists not only the evaluation metrics, but also the evaluation scenarios. And we design the evaluation scenarios from four perspectives: the distance distribution of data samples in the target dataset, the distance between data samples of the target dataset, the differential distance between two datasets (i.e., the target dataset and a generated dataset with only nonmembers), and the ratio of the samples that are made no inferences by an MI attack. The evaluation metrics consist of ten typical evaluation metrics. We have identified three principles for the proposed "comparing different MI attacks" methodology, and we have designed and implemented the MIBench benchmark with 84 evaluation scenarios for each dataset. In total, we have used our benchmark to fairly and systematically compare 15 state-of-the-art MI attack algorithms across 588 evaluation scenarios, and these evaluation scenarios cover 7 widely used datasets and 7 representative types of models. All codes and evaluations of MIBench are publicly available at https://github.com/MIBench/MIBench.github.io/blob/main/README.md.
AIOct 24, 2022Code
Classifying Ambiguous Identities in Hidden-Role Stochastic Games with Multi-Agent Reinforcement LearningShijie Han, Siyuan Li, Bo An et al.
Multi-agent reinforcement learning (MARL) is a prevalent learning paradigm for solving stochastic games. In most MARL studies, agents in a game are defined as teammates or enemies beforehand, and the relationships among the agents remain fixed throughout the game. However, in real-world problems, the agent relationships are commonly unknown in advance or dynamically changing. Many multi-party interactions start off by asking: who is on my team? This question arises whether it is the first day at the stock exchange or the kindergarten. Therefore, training policies for such situations in the face of imperfect information and ambiguous identities is an important problem that needs to be addressed. In this work, we develop a novel identity detection reinforcement learning (IDRL) framework that allows an agent to dynamically infer the identities of nearby agents and select an appropriate policy to accomplish the task. In the IDRL framework, a relation network is constructed to deduce the identities of other agents by observing the behaviors of the agents. A danger network is optimized to estimate the risk of false-positive identifications. Beyond that, we propose an intrinsic reward that balances the need to maximize external rewards and accurate identification. After identifying the cooperation-competition pattern among the agents, IDRL applies one of the off-the-shelf MARL methods to learn the policy. To evaluate the proposed method, we conduct experiments on Red-10 card-shedding game, and the results show that IDRL achieves superior performance over other state-of-the-art MARL methods. Impressively, the relation network has the par performance to identify the identities of agents with top human players; the danger network reasonably avoids the risk of imperfect identification. The code to reproduce all the reported results is available online at https://github.com/MR-BENjie/IDRL.
CVAug 10, 2023Code
Double-chain Constraints for 3D Human Pose Estimation in Images and VideosHongbo Kang, Yong Wang, Mengyuan Liu et al.
Reconstructing 3D poses from 2D poses lacking depth information is particularly challenging due to the complexity and diversity of human motion. The key is to effectively model the spatial constraints between joints to leverage their inherent dependencies. Thus, we propose a novel model, called Double-chain Graph Convolutional Transformer (DC-GCT), to constrain the pose through a double-chain design consisting of local-to-global and global-to-local chains to obtain a complex representation more suitable for the current human pose. Specifically, we combine the advantages of GCN and Transformer and design a Local Constraint Module (LCM) based on GCN and a Global Constraint Module (GCM) based on self-attention mechanism as well as a Feature Interaction Module (FIM). The proposed method fully captures the multi-level dependencies between human body joints to optimize the modeling capability of the model. Moreover, we propose a method to use temporal information into the single-frame model by guiding the video sequence embedding through the joint embedding of the target frame, with negligible increase in computational cost. Experimental results demonstrate that DC-GCT achieves state-of-the-art performance on two challenging datasets (Human3.6M and MPI-INF-3DHP). Notably, our model achieves state-of-the-art performance on all action categories in the Human3.6M dataset using detected 2D poses from CPN, and our code is available at: https://github.com/KHB1698/DC-GCT.
CLMay 5, 2022
Balancing Multi-Domain Corpora Learning for Open-Domain Response GenerationYujie Xing, Jinglun Cai, Nils Barlaug et al. · amazon-science
Open-domain conversational systems are assumed to generate equally good responses on multiple domains. Previous work achieved good performance on the single corpus, but training and evaluating on multiple corpora from different domains are less studied. This paper explores methods of generating relevant responses for each of multiple multi-domain corpora. We first examine interleaved learning which intermingles multiple corpora as the baseline. We then investigate two multi-domain learning methods, labeled learning and multi-task labeled learning, which encode each corpus through a unique corpus embedding. Furthermore, we propose Domain-specific Frequency (DF), a novel word-level importance weight that measures the relative importance of a word for a specific corpus compared to other corpora. Based on DF, we propose weighted learning, a method that integrates DF to the loss function. We also adopt DF as a new evaluation metric. Extensive experiments show that our methods gain significant improvements on both automatic and human evaluation. We share our code and data for reproducibility
CVSep 10, 2022
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection networkTiancheng Zhao, Peng Liu, Kyusong Lee · cmu
The advancement of object detection (OD) in open-vocabulary and open-world scenarios is a critical challenge in computer vision. This work introduces OmDet, a novel language-aware object detection architecture, and an innovative training mechanism that harnesses continual learning and multi-dataset vision-language pre-training. Leveraging natural language as a universal knowledge representation, OmDet accumulates a "visual vocabulary" from diverse datasets, unifying the task as a language-conditioned detection framework. Our multimodal detection network (MDN) overcomes the challenges of multi-dataset joint training and generalizes to numerous training datasets without manual label taxonomy merging. We demonstrate superior performance of OmDet over strong baselines in object detection in the wild, open-vocabulary detection, and phrase grounding, achieving state-of-the-art results. Ablation studies reveal the impact of scaling the pre-training visual vocabulary, indicating a promising direction for further expansion to larger datasets. The effectiveness of our deep fusion approach is underscored by its ability to learn jointly from multiple datasets, enhancing performance through knowledge sharing.
CRMay 28
Strengthening Polymorphic Prompt Assembling: Dynamic Separator Generation Against Emerging Prompt Injection AttacksNima Dorzhiev, Peng Liu
Polymorphic Prompt Assembling (PPA) defends LLM agents against prompt injections by randomly selecting separator pairs from a fixed pool to isolate user input from system instructions. Although effective, static pool reuse exposes a blast-radius vulnerability: once a separator leaks, it can be exploited in future requests. We propose a dynamic per-request separator generation using domain-separated SHA-256 digests keyed on the timestamp, session identifier, and cryptographic nonce. Each assembled prompt receives a unique (BEGIN, END) canary pair, thereby limiting leakage exposure to a single request. We evaluated our extension against 16 injection payloads on Llama-3.3-70B-Instruct-Turbo, with cross-model validation on DeepSeek-V4-Flash model. Against the M1 obfuscation payload (leetspeak + urgency), the dynamic mode reduces the Attack Success Rate (ASR) from 0.88 to 0.38, yielding a statistically significant 2.3 x mitigation verified by non-overlapping 95% Wilson confidence intervals. Against format_breakout_salad, static separator leakage (leak_rate = 0.467) is eliminated entirely in the dynamic mode (0.000), confirming the blast-radius reduction in practice. The implementation requires no model fine-tuning, adds 2.7 microseconds prompt-assembly overhead per request, and is backward compatible with the existing PPA SDK.
CVSep 25, 2024Code
BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained DevicesYongqi Xu, Yujian Lee, Gao Yi et al.
Deep neural networks (DNNs) are powerful for cognitive tasks such as image classification, object detection, and scene segmentation. One drawback however is the significant high computational complexity and memory consumption, which makes them unfeasible to run real-time on embedded platforms because of the limited hardware resources. Block floating point (BFP) quantization is one of the representative compression approaches for reducing the memory and computational burden owing to their capability to effectively capture the broad data distribution of DNN models. Unfortunately, prior works on BFP-based quantization empirically choose the block size and the precision that preserve accuracy. In this paper, we develop a BFP-based bitwidth-aware analytical modeling framework (called ``BitQ'') for the best BFP implementation of DNN inference on embedded platforms. We formulate and resolve an optimization problem to identify the optimal BFP block size and bitwidth distribution by the trade-off of both accuracy and performance loss. Experimental results show that compared with an equal bitwidth setting, the BFP DNNs with optimized bitwidth allocation provide efficient computation, preserving accuracy on famous benchmarks. The source code and data are available at https://github.com/Cheliosoops/BitQ.
CRJul 24, 2023
How Does Naming Affect LLMs on Code Analysis Tasks?Zhilong Wang, Lan Zhang, Chen Cao et al.
The Large Language Models (LLMs), such as GPT and BERT, were proposed for natural language processing (NLP) and have shown promising results as general-purpose language models. An increasing number of industry professionals and researchers are adopting LLMs for program analysis tasks. However, one significant difference between programming languages and natural languages is that a programmer has the flexibility to assign any names to variables, methods, and functions in the program, whereas a natural language writer does not. Intuitively, the quality of naming in a program affects the performance of LLMs in program analysis tasks. This paper investigates how naming affects LLMs on code analysis tasks. Specifically, we create a set of datasets with code containing nonsense or misleading names for variables, methods, and functions, respectively. We then use well-trained models (CodeBERT) to perform code analysis tasks on these datasets. The experimental results show that naming has a significant impact on the performance of code analysis tasks based on LLMs, indicating that code representation learning based on LLMs heavily relies on well-defined names in code. Additionally, we conduct a case study on some special code analysis tasks using GPT, providing further insights.
LGOct 16, 2022
Nowhere to Hide: A Lightweight Unsupervised Detector against Adversarial ExamplesHui Liu, Bo Zhao, Kehuan Zhang et al.
Although deep neural networks (DNNs) have shown impressive performance on many perceptual tasks, they are vulnerable to adversarial examples that are generated by adding slight but maliciously crafted perturbations to benign images. Adversarial detection is an important technique for identifying adversarial examples before they are entered into target DNNs. Previous studies to detect adversarial examples either targeted specific attacks or required expensive computation. How design a lightweight unsupervised detector is still a challenging problem. In this paper, we propose an AutoEncoder-based Adversarial Examples (AEAE) detector, that can guard DNN models by detecting adversarial examples with low computation in an unsupervised manner. The AEAE includes only a shallow autoencoder but plays two roles. First, a well-trained autoencoder has learned the manifold of benign examples. This autoencoder can produce a large reconstruction error for adversarial images with large perturbations, so we can detect significantly perturbed adversarial examples based on the reconstruction error. Second, the autoencoder can filter out the small noise and change the DNN's prediction on adversarial examples with small perturbations. It helps to detect slightly perturbed adversarial examples based on the prediction distance. To cover these two cases, we utilize the reconstruction error and prediction distance from benign images to construct a two-tuple feature set and train an adversarial detector using the isolation forest algorithm. We show empirically that the AEAE is unsupervised and inexpensive against the most state-of-the-art attacks. Through the detection in these two cases, there is nowhere to hide adversarial examples.
LGJul 7, 2022
Learning-based Autonomous Channel Access in the Presence of Hidden TerminalsYulin Shao, Yucheng Cai, Taotao Wang et al.
We consider the problem of autonomous channel access (AutoCA), where a group of terminals tries to discover a communication strategy with an access point (AP) via a common wireless channel in a distributed fashion. Due to the irregular topology and the limited communication range of terminals, a practical challenge for AutoCA is the hidden terminal problem, which is notorious in wireless networks for deteriorating the throughput and delay performances. To meet the challenge, this paper presents a new multi-agent deep reinforcement learning paradigm, dubbed MADRL-HT, tailored for AutoCA in the presence of hidden terminals. MADRL-HT exploits topological insights and transforms the observation space of each terminal into a scalable form independent of the number of terminals. To compensate for the partial observability, we put forth a look-back mechanism such that the terminals can infer behaviors of their hidden terminals from the carrier sensed channel states as well as feedback from the AP. A window-based global reward function is proposed, whereby the terminals are instructed to maximize the system throughput while balancing the terminals' transmission opportunities over the course of learning. Extensive numerical experiments verified the superior performance of our solution benchmarked against the legacy carrier-sense multiple access with collision avoidance (CSMA/CA) protocol.
LGAug 14, 2023
IOB: Integrating Optimization Transfer and Behavior Transfer for Multi-Policy ReuseSiyuan Li, Hao Li, Jin Zhang et al. · tsinghua
Humans have the ability to reuse previously learned policies to solve new tasks quickly, and reinforcement learning (RL) agents can do the same by transferring knowledge from source policies to a related target task. Transfer RL methods can reshape the policy optimization objective (optimization transfer) or influence the behavior policy (behavior transfer) using source policies. However, selecting the appropriate source policy with limited samples to guide target policy learning has been a challenge. Previous methods introduce additional components, such as hierarchical policies or estimations of source policies' value functions, which can lead to non-stationary policy optimization or heavy sampling costs, diminishing transfer effectiveness. To address this challenge, we propose a novel transfer RL method that selects the source policy without training extra components. Our method utilizes the Q function in the actor-critic framework to guide policy selection, choosing the source policy with the largest one-step improvement over the current target policy. We integrate optimization transfer and behavior transfer (IOB) by regularizing the learned policy to mimic the guidance policy and combining them as the behavior policy. This integration significantly enhances transfer effectiveness, surpasses state-of-the-art transfer RL baselines in benchmark tasks, and improves final performance and knowledge transferability in continual learning scenarios. Additionally, we show that our optimization transfer technique is guaranteed to improve target policy learning.
CLJan 20, 2023
Which Features are Learned by CodeBert: An Empirical Study of the BERT-based Source Code Representation LearningLan Zhang, Chen Cao, Zhilong Wang et al.
The Bidirectional Encoder Representations from Transformers (BERT) were proposed in the natural language process (NLP) and shows promising results. Recently researchers applied the BERT to source-code representation learning and reported some good news on several downstream tasks. However, in this paper, we illustrated that current methods cannot effectively understand the logic of source codes. The representation of source code heavily relies on the programmer-defined variable and function names. We design and implement a set of experiments to demonstrate our conjecture and provide some insights for future works.
LGMay 19, 2022
Nebula-I: A General Framework for Collaboratively Training Deep Learning Models on Low-Bandwidth Cloud ClustersYang Xiang, Zhihua Wu, Weibao Gong et al.
The ever-growing model size and scale of compute have attracted increasing interests in training deep learning models over multiple nodes. However, when it comes to training on cloud clusters, especially across remote clusters, huge challenges are faced. In this work, we introduce a general framework, Nebula-I, for collaboratively training deep learning models over remote heterogeneous clusters, the connections between which are low-bandwidth wide area networks (WANs). We took natural language processing (NLP) as an example to show how Nebula-I works in different training phases that include: a) pre-training a multilingual language model using two remote clusters; and b) fine-tuning a machine translation model using knowledge distilled from pre-trained models, which run through the most popular paradigm of recent deep learning. To balance the accuracy and communication efficiency, in Nebula-I, parameter-efficient training strategies, hybrid parallel computing methods and adaptive communication acceleration techniques are jointly applied. Meanwhile, security strategies are employed to guarantee the safety, reliability and privacy in intra-cluster computation and inter-cluster communication. Nebula-I is implemented with the PaddlePaddle deep learning framework, which can support collaborative training over heterogeneous hardware, e.g. GPU and NPU. Experiments demonstrate that the proposed framework could substantially maximize the training efficiency while preserving satisfactory NLP performance. By using Nebula-I, users can run large-scale training tasks over cloud clusters with minimum developments, and the utility of existed large pre-trained models could be further promoted. We also introduced new state-of-the-art results on cross-lingual natural language inference tasks, which are generated based upon a novel learning framework and Nebula-I.
LGAug 17, 2022
Gradient-Based Meta-Learning Using Uncertainty to Weigh Loss for Few-Shot LearningLin Ding, Peng Liu, Wenfeng Shen et al.
Model-Agnostic Meta-Learning (MAML) is one of the most successful meta-learning techniques for few-shot learning. It uses gradient descent to learn commonalities between various tasks, enabling the model to learn the meta-initialization of its own parameters to quickly adapt to new tasks using a small amount of labeled training data. A key challenge to few-shot learning is task uncertainty. Although a strong prior can be obtained from meta-learning with a large number of tasks, a precision model of the new task cannot be guaranteed because the volume of the training dataset is normally too small. In this study, first,in the process of choosing initialization parameters, the new method is proposed for task-specific learner adaptively learn to select initialization parameters that minimize the loss of new tasks. Then, we propose two improved methods for the meta-loss part: Method 1 generates weights by comparing meta-loss differences to improve the accuracy when there are few classes, and Method 2 introduces the homoscedastic uncertainty of each task to weigh multiple losses based on the original gradient descent,as a way to enhance the generalization ability to novel classes while ensuring accuracy improvement. Compared with previous gradient-based meta-learning methods, our model achieves better performance in regression tasks and few-shot classification and improves the robustness of the model to the learning rate and query sets in the meta-test set.
LGMay 24, 2022
Phased Progressive Learning with Coupling-Regulation-Imbalance Loss for Imbalanced Data ClassificationLiang Xu, Yi Cheng, Fan Zhang et al.
Deep convolutional neural networks often perform poorly when faced with datasets that suffer from quantity imbalances and classification difficulties. Despite advances in the field, existing two-stage approaches still exhibit dataset bias or domain shift. To counter this, a phased progressive learning schedule has been proposed that gradually shifts the emphasis from representation learning to training the upper classifier. This approach is particularly beneficial for datasets with larger imbalances or fewer samples. Another new method a coupling-regulation-imbalance loss function is proposed, which combines three parts: a correction term, Focal loss, and LDAM loss. This loss is effective in addressing quantity imbalances and outliers, while regulating the focus of attention on samples with varying classification difficulties. These approaches have yielded satisfactory results on several benchmark datasets, including Imbalanced CIFAR10, Imbalanced CIFAR100, ImageNet-LT, and iNaturalist 2018, and can be easily generalized to other imbalanced classification models.
IVMar 24
L-UNet: An LSTM Network for Remote Sensing Image Change DetectionShuting Sun, Lin Mu, Lizhe Wang et al.
Change detection of high-resolution remote sensing images is an important task in earth observation and was extensively investigated. Recently, deep learning has shown to be very successful in plenty of remote sensing tasks. The current deep learning-based change detection method is mainly based on conventional long short-term memory (Conv-LSTM), which does not have spatial characteristics. Since change detection is a process with both spatiality and temporality, it is necessary to propose an end-to-end spatiotemporal network. To achieve this, Conv-LSTM, an extension of the Conv-LSTM structure, is introduced. Since it shares similar spatial characteristics with the convolutional layer, L-UNet, which substitutes partial convolution layers of UNet-to-Conv-LSTM and Atrous L-UNet (AL-UNet), which further using Atrous structure to multiscale spatial information is proposed. Experiments on two data sets are conducted and the proposed methods show the advantages both in quantity and quality when compared with some other methods.
CVMay 7
X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and InteractionXiaoming Ren, Ru Zhen, Chao Li et al.
Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interaction in the Android ecosystem. This unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness. Specifically, Omni Perception provides a unified multimodal ingress pipeline that integrates UI states, real-world visual contexts, and speech inputs, leveraging a temporal alignment module to decompose raw data into structured multimodal intent representations. Omni Memory leverages multimodal memory optimization to enhance personalized intelligence by integrating runtime working memory for task continuity with long-term personal memory distilled from local data, enabling highly context-aware and personalized interactions. Finally, Omni Action employs a hybrid grounding strategy that combines structural XML metadata with visual perception for robust interaction. Through Behavior Cloning and Trajectory Replay, the system captures user navigation as reusable skills, enabling precise direct-access execution. Demonstrations across diverse scenarios show that X-OmniClaw effectively enhances interaction efficiency and task reliability, providing a practical architectural blueprint for the next generation of mobile-native personal assistants.
DCJul 10, 2023
FedDCT: A Dynamic Cross-Tier Federated Learning Framework in Wireless NetworksYouquan Xian, Xiaoyun Gan, Chuanjian Yao et al.
Federated Learning (FL), as a privacy-preserving machine learning paradigm, trains a global model across devices without exposing local data. However, resource heterogeneity and inevitable stragglers in wireless networks severely impact the efficiency and accuracy of FL training. In this paper, we propose a novel Dynamic Cross-Tier Federated Learning framework (FedDCT). Firstly, we design a dynamic tiering strategy that dynamically partitions devices into different tiers based on their response times and assigns specific timeout thresholds to each tier to reduce single-round training time. Then, we propose a cross-tier device selection algorithm that selects devices that respond quickly and are conducive to model convergence to improve convergence efficiency and accuracy. Experimental results demonstrate that the proposed approach under wireless networks outperforms the baseline approach, with an average reduction of 54.7\% in convergence time and an average improvement of 1.83\% in convergence accuracy.
CVJun 14, 2023
C$^3$PS: Context-aware Conditional Cross Pseudo Supervision for Semi-supervised Medical Image SegmentationPeng Liu, Guoyan Zheng
Semi-supervised learning (SSL) methods, which can leverage a large amount of unlabeled data for improved performance, has attracted increasing attention recently. In this paper, we introduce a novel Context-aware Conditional Cross Pseudo Supervision method (referred as C$^3$PS) for semi-supervised medical image segmentation. Unlike previously published Cross Pseudo Supervision (CPS) works, this paper introduces a novel Conditional Cross Pseudo Supervision (CCPS) mechanism where the cross pseudo supervision is conditioned on a given class label. Context-awareness is further introduced in the CCPS to improve the quality of pseudo-labels for cross pseudo supervision. The proposed method has the additional advantage that in the later training stage, it can focus on the learning of hard organs. Validated on two typical yet challenging medical image segmentation tasks, our method demonstrates superior performance over the state-of-the-art methods.
LGMar 21, 2023
Indeterminate Probability TheoryTao Yang, Chuang Liu, Xiaofeng Ma et al.
Complex continuous or mixed joint distributions (e.g., P(Y | z_1, z_2, ..., z_N)) generally lack closed-form solutions, often necessitating approximations such as MCMC. This paper proposes Indeterminate Probability Theory (IPT), which makes the following contributions: (1) An observer-centered framework in which experimental outcomes are represented as distributions combining ground truth with observation error; (2) The introduction of three independence candidate axioms that enable a two-phase probabilistic inference framework; (3) The derivation of closed-form solutions for arbitrary complex joint distributions under this framework. Both the Indeterminate Probability Neural Network (IPNN) model and the non-neural multivariate time series forecasting application demonstrate IPT's effectiveness in modeling high-dimensional distributions, with successful validation up to 1000 dimensions. Importantly, IPT is consistent with classical probability theory and subsumes the frequentist equation in the limit of vanishing observation error.
IVAug 10, 2022
KiPA22 Report: U-Net with Contour Regularization for Renal Structures SegmentationKangqing Ye, Peng Liu, Xiaoyang Zou et al.
Three-dimensional (3D) integrated renal structures (IRS) segmentation is important in clinical practice. With the advancement of deep learning techniques, many powerful frameworks focusing on medical image segmentation are proposed. In this challenge, we utilized the nnU-Net framework, which is the state-of-the-art method for medical image segmentation. To reduce the outlier prediction for the tumor label, we combine contour regularization (CR) loss of the tumor label with Dice loss and cross-entropy loss to improve this phenomenon.
CLFeb 11
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active ParametersAilin Huang, Ang Li, Aobo Kong et al.
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.
ROMay 6
Towards Adaptive Humanoid Control via Multi-Behavior Distillation and Reinforced Fine-TuningYingnan Zhao, Xinmiao Wang, Dewei Wang et al.
Humanoid robots are promising to learn a diverse set of human-like locomotion behaviors, including standing up, walking, running, and jumping. However, existing methods predominantly require training independent policies for each skill, yielding behavior-specific controllers that exhibit limited generalization and brittle performance when deployed on irregular terrains and in diverse situations. To address this challenge, we propose Adaptive Humanoid Control (AHC) that adopts a two-stage framework to learn an adaptive humanoid locomotion controller across different skills and terrains. Specifically, we first train several primary locomotion policies and perform a multi-behavior distillation process to obtain a basic multi-behavior controller, facilitating adaptive behavior switching based on the environment. Then, we perform reinforced fine-tuning by collecting online feedback in performing adaptive behaviors on more diverse terrains, enhancing terrain adaptability for the controller. We conduct experiments in both simulation and real-world experiments in Unitree G1 robots. The results show that our method exhibits strong adaptability across various situations and terrains. Project website: https://ahc-humanoid.github.io.
LGNov 11, 2025
An Integrated Fusion Framework for Ensemble Learning Leveraging Gradient Boosting and Fuzzy Rule-Based ModelsJinbo Li, Peng Liu, Long Chen et al.
The integration of different learning paradigms has long been a focus of machine learning research, aimed at overcoming the inherent limitations of individual methods. Fuzzy rule-based models excel in interpretability and have seen widespread application across diverse fields. However, they face challenges such as complex design specifications and scalability issues with large datasets. The fusion of different techniques and strategies, particularly Gradient Boosting, with Fuzzy Rule-Based Models offers a robust solution to these challenges. This paper proposes an Integrated Fusion Framework that merges the strengths of both paradigms to enhance model performance and interpretability. At each iteration, a Fuzzy Rule-Based Model is constructed and controlled by a dynamic factor to optimize its contribution to the overall ensemble. This control factor serves multiple purposes: it prevents model dominance, encourages diversity, acts as a regularization parameter, and provides a mechanism for dynamic tuning based on model performance, thus mitigating the risk of overfitting. Additionally, the framework incorporates a sample-based correction mechanism that allows for adaptive adjustments based on feedback from a validation set. Experimental results substantiate the efficacy of the presented gradient boosting framework for fuzzy rule-based models, demonstrating performance enhancement, especially in terms of mitigating overfitting and complexity typically associated with many rules. By leveraging an optimal factor to govern the contribution of each model, the framework improves performance, maintains interpretability, and simplifies the maintenance and update of the models.
CVApr 10, 2025Code
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language ModelHaozhan Shen, Peng Liu, Jingcheng Li et al. · cmu
Recently DeepSeek R1 has shown that reinforcement learning (RL) can substantially improve the reasoning capabilities of Large Language Models (LLMs) through a simple yet effective design. The core of R1 lies in its rule-based reward formulation, which leverages tasks with deterministic ground-truth answers to enable precise and stable reward computation. In the visual domain, we similarly observe that a wide range of visual understanding tasks are inherently equipped with well-defined ground-truth annotations. This property makes them naturally compatible with rule-based reward mechanisms. Motivated by this observation, we investigate the extension of R1-style reinforcement learning to Vision-Language Models (VLMs), aiming to enhance their visual reasoning capabilities. To this end, we develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs' performance on general vision-language tasks. Using this framework, we further explore the feasibility of applying RL to visual domain. Experimental results indicate that the RL-based model not only delivers competitive performance on visual understanding tasks but also surpasses Supervised Fine-Tuning (SFT) in generalization ability. Furthermore, we conduct comprehensive ablation studies that uncover a series of noteworthy insights, including the presence of reward hacking in object detection, the emergence of the "OD aha moment", the impact of training data quality, and the scaling behavior of RL across different model sizes. Through these analyses, we aim to deepen the understanding of how reinforcement learning enhances the capabilities of vision-language models, and we hope our findings and open-source contributions will support continued progress in the vision-language RL community. Our code and model are available at https://github.com/om-ai-lab/VLM-R1
CVApr 7Code
A Unified Foundation Model for All-in-One Multi-Modal Remote Sensing Image Restoration and Fusion with Language PromptingYongchuan Cui, Peng Liu
Remote sensing imagery suffers from clouds, haze, noise, resolution limits, and sensor heterogeneity. Existing restoration and fusion approaches train separate models per degradation type. In this work, we present Language-conditioned Large-scale Remote Sensing restoration model (LLaRS), the first unified foundation model for multi-modal and multi-task remote sensing low-level vision. LLaRS employs Sinkhorn-Knopp optimal transport to align heterogeneous bands into semantically matched slots, routes features through three complementary mixture-of-experts layers (convolutional experts for spatial patterns, channel-mixing experts for spectral fidelity, and attention experts with low-rank adapters for global context), and stabilizes joint training via step-level dynamic weight adjustment. To train LLaRS, we construct LLaRS1M, a million-scale multi-task dataset spanning eleven restoration and enhancement tasks, integrating real paired observations and controlled synthetic degradations with diverse natural language prompts. Experiments show LLaRS consistently outperforms seven competitive models, and parameter-efficient finetuning experiments demonstrate strong transfer capability and adaptation efficiency on unseen data. Repo: https://github.com/yc-cui/LLaRS
LGAug 16, 2024
Beam Prediction based on Large Language ModelsYucheng Sheng, Kai Huang, Le Liang et al.
In this letter, we use large language models (LLMs) to develop a high-performing and robust beam prediction method. We formulate the millimeter wave (mmWave) beam prediction problem as a time series forecasting task, where the historical observations are aggregated through cross-variable attention and then transformed into text-based representations using a trainable tokenizer. By leveraging the prompt-as-prefix (PaP) technique for contextual enrichment, our method harnesses the power of LLMs to predict future optimal beams. Simulation results demonstrate that our LLM-based approach outperforms traditional learning-based models in prediction accuracy as well as robustness, highlighting the significant potential of LLMs in enhancing wireless communication systems.
CLFeb 17, 2025Code
Step-Audio: Unified Understanding and Generation in Intelligent Speech InteractionAilin Huang, Boyong Wu, Bruce Wang et al.
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
LGMay 22
Curriculum reinforcement learning with measurable task representation learningYongyan Wen, Siyuan Li, Mingjian Fu et al.
In curriculum reinforcement learning (CRL), an agent incrementally accumulates knowledge over a sequence of tasks (i.e., a curriculum), and the learning process is aimed at using the accumulated knowledge to finally solve a challenging target task. While early CRL works focus on sequencing candidate tasks, recent research explores automatic curriculum generation. Among the rich CRL literature, the interpolation-based CRL paradigm is a main body, which automatically generates intermediate tasks by interpolating between the initial task distribution and the target task distribution in task space with meaningful distance metrics (i.e., can measure the task similarity). However, in challenging navigation tasks, the non-Euclidean context (task) space invalidates this assumption. To achieve automatic curriculum generation in complex task, we propose a novel automatic curriculum generation approach based on measurable task representation learning. To better measure the similarity, we propose to transform the task space to a latent space. Through a variational autoencoder structure that encodes the reward and the state transitions, we achieve a latent task representation with a task similarity measurement property, and two close task embeddings correspond to two similar tasks in terms of rewards and state transitions. Based on the learned task representation, we further develop an automatic curriculum generation scheme, which can effectively generate new tasks more and more similar to the target task. We evaluate our method in a variety of challenging navigation tasks, and the experiment results indicate that the proposed approach surpasses state-of-the-art CRL approaches based on interpolation and generative adversarial networks.
CVJan 5
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and GenerationHuichao Zhang, Liao Qu, Yiheng Liu et al.
We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.
CLJul 22, 2025Code
Step-Audio 2 Technical ReportBoyong Wu, Chao Yan, Chen Hu et al.
This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.
CLDec 23, 2025
Step-DeepResearch Technical ReportChen Hu, Haikuo Du, Heng Wang et al.
As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.
IRMay 20
MemConflict: Evaluating Long-Term Memory Systems Under Memory ConflictsZhen Tao, Jinxiang Zhao, Peng Liu et al.
Long-term memory systems enable conversational agents based on large language models (LLMs) to retain, retrieve, and apply user-specific information across multi-session interactions. However, existing evaluations mainly assess outcome-level performance or temporal updating, providing limited insight into how systems retrieve and rank temporally valid, factually correct, and contextually applicable memory evidence under conflicting alternatives. To address this gap, we propose MemConflict, a diagnostic framework that treats memory validity as a query-conditioned fitness-for-use problem. MemConflict formalizes dynamic, static, and conditional conflicts over temporal validity, factual correctness, and contextual applicability. It simulates controlled long-horizon histories from structured user profiles, introduces cross-session conflicts, and injects semantically similar distractors to create competition among memory candidates. The resulting multi-session dialogue benchmark supports black-box evaluation of final answers and white-box analysis of supporting-memory retrieval and ranking. Experiments on six representative long-term memory systems show uneven strengths across conflict types, with answer correctness often diverging from memory retrieval and ranking. Sensitivity analyses reveal that longer histories, distractors, implicit queries, and larger conflict distances degrade performance. Diagnostics show failures from missing supporting memories and ineffective use of retrieved memories. Collectively, MemConflict advances principled long-term memory governance through retrieval-aware, conflict-aware reliability assessment.
CVMar 11, 2024Code
Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion HeadTiancheng Zhao, Peng Liu, Xuan He et al. · cmu
End-to-end transformer-based detectors (DETRs) have shown exceptional performance in both closed-set and open-vocabulary object detection (OVD) tasks through the integration of language modalities. However, their demanding computational requirements have hindered their practical application in real-time object detection (OD) scenarios. In this paper, we scrutinize the limitations of two leading models in the OVDEval benchmark, OmDet and Grounding-DINO, and introduce OmDet-Turbo. This novel transformer-based real-time OVD model features an innovative Efficient Fusion Head (EFH) module designed to alleviate the bottlenecks observed in OmDet and Grounding-DINO. Notably, OmDet-Turbo-Base achieves a 100.2 frames per second (FPS) with TensorRT and language cache techniques applied. Notably, in zero-shot scenarios on COCO and LVIS datasets, OmDet-Turbo achieves performance levels nearly on par with current state-of-the-art supervised models. Furthermore, it establishes new state-of-the-art benchmarks on ODinW and OVDEval, boasting an AP of 30.1 and an NMS-AP of 26.86, respectively. The practicality of OmDet-Turbo in industrial applications is underscored by its exceptional performance on benchmark datasets and superior inference speed, positioning it as a compelling choice for real-time object detection tasks. Code: \url{https://github.com/om-ai-lab/OmDet}
CVNov 30, 2025
RS-ISRefiner: Towards Better Adapting Vision Foundation Models for Interactive Segmentation of Remote Sensing ImagesDeliang Wang, Peng Liu
Interactive image segmentation(IIS) plays a critical role in generating precise annotations for remote sensing imagery, where objects often exhibit scale variations, irregular boundaries and complex backgrounds. However, existing IIS methods, primarily designed for natural images, struggle to generalize to remote sensing domains due to limited annotated data and computational overhead. To address these challenges, we proposed RS-ISRefiner, a novel click-based IIS framework tailored for remote sensing images. The framework employs an adapter-based tuning strategy that preserves the general representations of Vision Foundation Models while enabling efficient learning of remote sensing-specific spatial and boundary characteristics. A hybrid attention mechanism integrating convolutional local modeling with Transformer-based global reasoning enhances robustness against scale diversity and scene complexity. Furthermore, an improved probability map modulation scheme effectively incorporates historical user interactions, yielding more stable iterative refinement and higher boundary fidelity. Comprehensive experiments on six remote sensing datasets, including iSAID, ISPRS Potsdam, SandBar, NWPU, LoveDA Urban and WHUBuilding, demonstrate that RS-ISRefiner consistently outperforms state-of-the-art IIS methods in terms of segmentation accuracy, efficiency and interaction cost. These results confirm the effectiveness and generalizability of our framework, making it highly suitable for high-quality instance segmentation in practical remote sensing scenarios.
LGOct 22, 2023
Robust Visual Imitation Learning with Inverse Dynamics RepresentationsSiyuan Li, Xun Wang, Rongchang Zuo et al.
Imitation learning (IL) has achieved considerable success in solving complex sequential decision-making problems. However, current IL methods mainly assume that the environment for learning policies is the same as the environment for collecting expert datasets. Therefore, these methods may fail to work when there are slight differences between the learning and expert environments, especially for challenging problems with high-dimensional image observations. However, in real-world scenarios, it is rare to have the chance to collect expert trajectories precisely in the target learning environment. To address this challenge, we propose a novel robust imitation learning approach, where we develop an inverse dynamics state representation learning objective to align the expert environment and the learning environment. With the abstract state representation, we design an effective reward function, which thoroughly measures the similarity between behavior data and expert data not only element-wise, but also from the trajectory level. We conduct extensive experiments to evaluate the proposed approach under various visual perturbations and in diverse visual control tasks. Our approach can achieve a near-expert performance in most environments, and significantly outperforms the state-of-the-art visual IL methods and robust IL methods.
CRMay 18
Not What You Asked For: Typographic Attacks in Household Robot ManipulationAli Iranmanesh, Peng Liu
Open-vocabulary embodied AI agents increasingly rely on vision-language models such as CLIP for object perception and task grounding. However, the shared embedding space that enables this flexibility introduces a structural vulnerability to typographic attacks, where printed text in a physical scene semantically overrides visual judgment. While prior work has quantified this threat in static 2D benchmarks and 3D navigation tasks, its impact on the full Sense-Plan-Act pipeline of household robot manipulation remains unexplored. This work evaluates typographic attacks in a Habitat-based simulation using the HomeRobot benchmark. We introduce a decoupled perception architecture that exposes a frozen CLIP encoder to adversarial stickers while maintaining geometric grounding via DETIC. In a controlled evaluation pool of 59 attributable episodes, the attack achieves an overall Attack Success Rate (ASR) of 67.8%, rising to 70.0% among fully successful episodes, under uncontrolled viewing angles and occlusion with no perceptual optimization. Critically, we find that perceptual errors propagate through the persistent 3D semantic map to produce kinetic failures, defined here as physically executed grasping and transport of the wrong object driven by an adversarially poisoned semantic state. In these cases, the robot physically grasps and delivers the wrong object to a target receptacle. These results establish typographic misclassification as a real, measurable, and physically consequential threat to the safety of modular manipulation pipelines that prior typographic attack research has left unexamined.
CLNov 5, 2025
HaluMem: Evaluating Hallucinations in Memory Systems of AgentsDing Chen, Simin Niu, Kehang Li et al.
Memory systems are key components that enable AI systems such as LLMs and AI agents to achieve long-term learning and sustained interaction. However, during memory storage and retrieval, these systems frequently exhibit memory hallucinations, including fabrication, errors, conflicts, and omissions. Existing evaluations of memory hallucinations are primarily end-to-end question answering, which makes it difficult to localize the operational stage within the memory system where hallucinations arise. To address this, we introduce the Hallucination in Memory Benchmark (HaluMem), the first operation level hallucination evaluation benchmark tailored to memory systems. HaluMem defines three evaluation tasks (memory extraction, memory updating, and memory question answering) to comprehensively reveal hallucination behaviors across different operational stages of interaction. To support evaluation, we construct user-centric, multi-turn human-AI interaction datasets, HaluMem-Medium and HaluMem-Long. Both include about 15k memory points and 3.5k multi-type questions. The average dialogue length per user reaches 1.5k and 2.6k turns, with context lengths exceeding 1M tokens, enabling evaluation of hallucinations across different context scales and task complexities. Empirical studies based on HaluMem show that existing memory systems tend to generate and accumulate hallucinations during the extraction and updating stages, which subsequently propagate errors to the question answering stage. Future research should focus on developing interpretable and constrained memory operation mechanisms that systematically suppress hallucinations and improve memory reliability.
LGJan 23
Robust Categorical Data Clustering Guided by Multi-Granular Competitive LearningShenghong Cai, Yiqun Zhang, Xiaopeng Luo et al.
Data set composed of categorical features is very common in big data analysis tasks. Since categorical features are usually with a limited number of qualitative possible values, the nested granular cluster effect is prevalent in the implicit discrete distance space of categorical data. That is, data objects frequently overlap in space or subspace to form small compact clusters, and similar small clusters often form larger clusters. However, the distance space cannot be well-defined like the Euclidean distance due to the qualitative categorical data values, which brings great challenges to the cluster analysis of categorical data. In view of this, we design a Multi-Granular Competitive Penalization Learning (MGCPL) algorithm to allow potential clusters to interactively tune themselves and converge in stages with different numbers of naturally compact clusters. To leverage MGCPL, we also propose a Cluster Aggregation strategy based on MGCPL Encoding (CAME) to first encode the data objects according to the learned multi-granular distributions, and then perform final clustering on the embeddings. It turns out that the proposed MGCPL-guided Categorical Data Clustering (MCDC) approach is competent in automatically exploring the nested distribution of multi-granular clusters and highly robust to categorical data sets from various domains. Benefiting from its linear time complexity, MCDC is scalable to large-scale data sets and promising in pre-partitioning data sets or compute nodes for boosting distributed computing. Extensive experiments with statistical evidence demonstrate its superiority compared to state-of-the-art counterparts on various real public data sets.
CLMar 7, 2024
Yi: Open Foundation Models by 01.AI01. AI, Alex Young, Bei Chen et al.
We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models. Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our finetuned chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Chatbot Arena. Building upon our scalable super-computing infrastructure and the classical transformer architecture, we attribute the performance of Yi models primarily to its data quality resulting from our data-engineering efforts. For pretraining, we construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. For finetuning, we polish a small scale (less than 10K) instruction dataset over multiple iterations such that every single instance has been verified directly by our machine learning engineers. For vision-language, we combine the chat language model with a vision transformer encoder and train the model to align visual representations to the semantic space of the language model. We further extend the context length to 200K through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. We show that extending the depth of the pretrained checkpoint through continual pretraining further improves performance. We believe that given our current results, continuing to scale up model parameters using thoroughly optimized data will lead to even stronger frontier models.
CRApr 14
From IOCs to Regex: Automating CTI Operationalization for SOC with LLMsPei-Yu Tseng, Lan Zhang, ZihDwo Yeh et al.
Cyber Threat Intelligence (CTI) reports contain Indicators of Compromise (IOCs) that are critical for security operations. To operationalize these IOCs across heterogeneous logs, analysts often convert them into regular expressions (regexes) for tasks such as digital forensics, log parsing, and SIEM rule creation. However, regex construction is still largely manual, requiring analysts to extract IOCs from CTI reports and transform them into syntactically valid and semantically precise patterns. This process is slow, error-prone, and increasingly impractical as CTI volumes grow. Although recent studies have applied Large Language Models (LLMs) to IOC extraction, they typically output plain strings rather than regexes, limiting practical deployment. Plain IOCs cannot effectively capture variations in system context, log format, or attacker behavior. To address this gap, we propose IOCRegex-gen, a fully automated LLM-based regex generation system that converts IOCs into regexes. The system introduces two key innovations: (i) a group-aware mechanism that identifies which IOC segments should be represented as capture or non-capture groups, and (ii) an iterative reasoning and multi-stage validation pipeline to ensure syntactic validity and semantic correctness. Experiments on over 3,000 real CTI reports and 2,400 ground-truth strings from the MITRE ATT&CK Evaluation framework show that IOCRegex-gen achieves an average hit rate of 99.1% and a false-positive rate of only 0.8%, demonstrating its effectiveness for large-scale CTI processing and automated regex generation.
ROMar 25
PCHC: Enabling Preference Conditioned Humanoid Control via Multi-Objective Reinforcement LearningHuanyu Li, Dewei Wang, Xinmiao Wang et al.
Humanoid robots often need to balance competing objectives, such as maximizing speed while minimizing energy consumption. While current reinforcement learning (RL) methods can master complex skills like fall recovery and perceptive locomotion, they are constrained by fixed weighting strategies that produce a single suboptimal policy, rather than providing a diverse set of solutions for sophisticated multi-objective control. In this paper, we propose a novel framework leveraging Multi-Objective Reinforcement Learning (MORL) to achieve Preference-Conditioned Humanoid Control (PCHC). Unlike conventional methods that require training a series of policies to approximate the Pareto front, our framework enables a single, preference-conditioned policy to exhibit a wide spectrum of diverse behaviors. To effectively integrate these requirements, we introduce a Beta distribution-based alignment mechanism based on preference vectors modulating a Mixture-of-Experts (MoE) module. We validated our approach on two representative humanoid tasks. Extensive simulations and real-world experiments demonstrate that the proposed framework allows the robot to adaptively shift its objective priorities in real-time based on the input preference condition.
CVMar 10
IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic FrameworkFeiyu Wang, Jiayuan Yang, Zhiyuan Zhao et al.
Scalable Vector Graphics (SVG) are central to digital design due to their inherent scalability and editability. Despite significant advancements in content generation enabled by Visual Language Models (VLMs), existing text-to-SVG generation methods are limited by a core challenge: the autoregressive training process does not incorporate visual perception of the final rendered image, which fundamentally constrains generation quality. To address this limitation, we propose an Introspective SVG Generation Framework (IntroSVG). At its core, the framework instantiates a unified VLM that operates in a closed loop, assuming dual roles of both generator and critic. Specifically, through Supervised Fine-Tuning (SFT), the model learns to draft SVGs and to provide feedback on their rendered outputs; moreover, we systematically convert early-stage failures into high-quality error-correction training data, thereby enhancing model robustness. Subsequently, we leverage a high-capacity teacher VLM to construct a preference dataset and further align the generator's policy through Direct Preference Optimization (DPO). During inference, the optimized generator and critic operate collaboratively in an iterative "generate-review-refine" cycle, starting from imperfect intermediate drafts to autonomously improve output quality. Experimental results demonstrate that our method achieves state-of-the-art performance across several key evaluation metrics, generating SVGs with more complex structures, stronger semantic alignment, and greater editability. These results corroborate the effectiveness of incorporating explicit visual feedback into the generation loop.
CLSep 26, 2023
Towards Data-efficient Customer Intent Recognition with Prompt-based Learning ParadigmHengyu Luo, Peng Liu, Stefan Esping
Recognizing customer intent accurately with language models based on customer-agent conversational data is essential in today's digital customer service marketplace, but it is often hindered by the lack of sufficient labeled data. In this paper, we introduce the prompt-based learning paradigm that significantly reduces the dependency on extensive datasets. Utilizing prompted training combined with answer mapping techniques, this approach allows small language models to achieve competitive intent recognition performance with only a minimal amount of training data. Furthermore, We enhance the performance by integrating active sampling and ensemble learning strategies in the prompted training pipeline. Additionally, preliminary tests in a zero-shot setting demonstrate that, with well-crafted and detailed prompts, small language models show considerable instruction-following potential even without any further training. These results highlight the viability of semantic modeling of conversational data in a more data-efficient manner with minimal data use, paving the way for advancements in AI-driven customer service.
CRMay 16
Stop Starving or Stuffing Me: Boosting Firmware Fuzzing Efficiency with On-demand Input DeliveryShandian Shen, Wei Zhou, Keming Zhao et al.
Firmware fuzzing has gained attention for identifying firmware bugs. However, current approaches often directly integrate fuzzing tools for general software. General software receives input as it encounters I/O functions, but firmware input can be received asynchronously and independently of the firmware's execution, with uncertain timing and quantity. Without full awareness of firmware's exceptions, existing solutions often imprudently deliver fuzzer-generated input to the firmware in an ad-hoc way. This either overwhelms the processing function of the firmware (stuffing) or fails to deliver enough input data to trigger input processing functions (starving). In both cases, fuzzing capability is weakened. In this paper, we comprehensively investigate the input delivery issue. To determine the optimal timing and quantity for delivering test cases, we leverage the fact that firmware has to check input availability before using data. So we employ static and dynamic analysis to map each input processing route into three stages: input retrieval, availability check, and processing. This recovered semantic information allows the fuzzer to accurately deliver input at the availability check points within the expected length range. For multiple input routes problem, we also optimize the scheduling algorithm to reach more diverse routes. Our prototype, named FIDO, can serve as an add-on to existing firmware fuzzers to enhance their test-case delivery effectiveness. Compared to ad-hoc input delivery methods used in Fuzzware and MULTIFUZZ, FIDO increases their median code coverage by up to 115% and 54%, respectively. Compared to SEmu, which requires humans to manually specify input delivery points, FIDO still improves its coverage by up to 19%. As a result, FIDO discovers known bugs significantly faster and also identifies five previously unknown bugs.
CVMar 19
Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QARuizhi Yu, Keyang Zhong, Peng Liu et al.
Live streaming commerce has become a prominent form of broadcasting in the modern era. To facilitate more efficient and convenient product promotions for streamers, we present Click-to-Ask, an AI-driven assistant for live streaming commerce with complementary offline and online components. The offline module processes diverse multimodal product information, transforming complex inputs into structured product data and generating compliant promotional copywriting. During live broadcasts, the online module enables real-time responses to viewer inquiries by allowing streamers to click on questions and leveraging both the structured product information generated by the offline module and an event-level historical memory maintained in a streaming architecture. This system significantly reduces the time needed for promotional preparation, enhances content engagement, and enables prompt interaction with audience inquiries, ultimately improving the effectiveness of live streaming commerce. On our collected dataset of TikTok live stream frames, the proposed method achieves a Question Recognition Accuracy of 0.913 and a Response Quality score of 0.876, demonstrating considerable potential for practical application. The video demonstration can be viewed here: https://www.youtube.com/shorts/mWIXK-SWhiE.
CVMar 27
AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass EditingTianyu Liu, Weitao Xiong, Kunming Luo et al.
Generative video models have significantly advanced the photorealistic synthesis of adverse weather for autonomous driving; however, they consistently demand massive datasets to learn rare weather scenarios. While 3D-aware editing methods alleviate these data constraints by augmenting existing video footage, they are fundamentally bottlenecked by costly per-scene optimization and suffer from inherent geometric and illumination entanglement. In this work, we introduce AutoWeather4D, a feed-forward 3D-aware weather editing framework designed to explicitly decouple geometry and illumination. At the core of our approach is a G-buffer Dual-pass Editing mechanism. The Geometry Pass leverages explicit structural foundations to enable surface-anchored physical interactions, while the Light Pass analytically resolves light transport, accumulating the contributions of local illuminants into the global illumination to enable dynamic 3D local relighting. Extensive experiments demonstrate that AutoWeather4D achieves comparable photorealism and structural consistency to generative baselines while enabling fine-grained parametric physical control, serving as a practical data engine for autonomous driving.