IVApr 26, 2022Code
Estimating the Resize Parameter in End-to-end Learned Image CompressionLi-Heng Chen, Christos G. Bampis, Zhi Li et al.
We describe a search-free resizing framework that can further improve the rate-distortion tradeoff of recent learned image compression models. Our approach is simple: compose a pair of differentiable downsampling/upsampling layers that sandwich a neural compression model. To determine resize factors for different inputs, we utilize another neural network jointly trained with the compression model, with the end goal of minimizing the rate-distortion objective. Our results suggest that "compression friendly" downsampled representations can be quickly determined during encoding by using an auxiliary network and differentiable image warping. By conducting extensive experimental tests on existing deep image compression models, we show results that our new resizing parameter estimation framework can provide Bjøntegaard-Delta rate (BD-rate) improvement of about 10% against leading perceptual quality engines. We also carried out a subjective quality study, the results of which show that our new approach yields favorable compressed images. To facilitate reproducible research in this direction, the implementation used in this paper is being made freely available online at: https://github.com/treammm/ResizeCompression.
CLMay 31Code
PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and DialectsSicheng Yang, Shulan Ruan, Shiwei Wu et al.
While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high-resource languages, a focus on low-level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech-100, a massive-scale benchmark designed to assess `native-level' speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold-standard human recordings with instruction-driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models (including Gemini-3, GPT-Audio, and Qwen2.5-Omni) yields pivotal insights. First, we demonstrate that open-source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open-source models suffer catastrophic degradation on low-resource languages. Finally, counter-intuitively, we observe that under standard zero-shot settings, Chain-of-Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech-100 establishes a rigorous standard for the next generation of inclusive, omni-capable Speech-LLMs. The data, demo, and code are publicly available at https://github.com/YoungSeng/PolySpeech-100.
CVMar 24, 2023Code
PFT-SSR: Parallax Fusion Transformer for Stereo Image Super-ResolutionHansheng Guo, Juncheng Li, Guangwei Gao et al.
Stereo image super-resolution aims to boost the performance of image super-resolution by exploiting the supplementary information provided by binocular systems. Although previous methods have achieved promising results, they did not fully utilize the information of cross-view and intra-view. To further unleash the potential of binocular images, in this letter, we propose a novel Transformerbased parallax fusion module called Parallax Fusion Transformer (PFT). PFT employs a Cross-view Fusion Transformer (CVFT) to utilize cross-view information and an Intra-view Refinement Transformer (IVRT) for intra-view feature refinement. Meanwhile, we adopted the Swin Transformer as the backbone for feature extraction and SR reconstruction to form a pure Transformer architecture called PFT-SSR. Extensive experiments and ablation studies show that PFT-SSR achieves competitive results and outperforms most SOTA methods. Source code is available at https://github.com/MIVRC/PFT-PyTorch.
AIMay 27Code
MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD GenerationXiaoyu Dong, Zhi Li, Xiao-Ming Wu
Large language models (LLMs) have recently advanced text-driven 3D generation, yet Text-to-CAD remains far from supporting industrial product design. Existing benchmarks focus primarily on generating single-part CAD models and evaluate them using geometric similarity metrics that fail to capture functionality, manufacturability, and assemblability. To address this gap, we introduce MUSE, a Text-to-CAD benchmark focused on complex, editable boundary representation (B-Rep) assemblies. MUSE pairs practical design instances with structured Design Specifications and evaluates generated models through a three-stage protocol: code check, geometric check, and design-intent alignment. The final stage uses design-specific rubrics to assess functionality, manufacturability, and assemblability, moving beyond shape matching toward practical design quality. To enable scalable evaluation, we use a rubric-based visual language model (VLM) judge and validate its reliability through human annotation. Experiments on closed-source and open-source LLMs reveal a clear failure cascade from executable code to valid geometry and finally to engineering-ready design, with even the strongest models achieving limited success on fine-grained engineering criteria. Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design. Our project website, including the leaderboard, dataset, and code, is available at https://dong7313.github.io/muse-benchmark/.
AIJun 3
Agents' Last ExamYiyou Sun, Xinyang Han, Weichen Zhang et al.
Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.
CVMay 11, 2022
HULC: 3D Human Motion Capture with Pose Manifold Sampling and Dense Contact GuidanceSoshi Shimada, Vladislav Golyanik, Zhi Li et al.
Marker-less monocular 3D human motion capture (MoCap) with scene interactions is a challenging research topic relevant for extended reality, robotics and virtual avatar generation. Due to the inherent depth ambiguity of monocular settings, 3D motions captured with existing methods often contain severe artefacts such as incorrect body-scene inter-penetrations, jitter and body floating. To tackle these issues, we propose HULC, a new approach for 3D human MoCap which is aware of the scene geometry. HULC estimates 3D poses and dense body-environment surface contacts for improved 3D localisations, as well as the absolute scale of the subject. Furthermore, we introduce a 3D pose trajectory optimisation based on a novel pose manifold sampling that resolves erroneous body-environment inter-penetrations. Although the proposed method requires less structured inputs compared to existing scene-aware monocular MoCap algorithms, it produces more physically-plausible poses: HULC significantly and consistently outperforms the existing approaches in various experiments and on different metrics. Project page: https://vcai.mpi-inf.mpg.de/projects/HULC/.
LGFeb 20, 2023
FormerTime: Hierarchical Multi-Scale Representations for Multivariate Time Series ClassificationMingyue Cheng, Qi Liu, Zhiding Liu et al.
Deep learning-based algorithms, e.g., convolutional networks, have significantly facilitated multivariate time series classification (MTSC) task. Nevertheless, they suffer from the limitation in modeling long-range dependence due to the nature of convolution operations. Recent advancements have shown the potential of transformers to capture long-range dependence. However, it would incur severe issues, such as fixed scale representations, temporal-invariant and quadratic time complexity, with transformers directly applicable to the MTSC task because of the distinct properties of time series data. To tackle these issues, we propose FormerTime, an hierarchical representation model for improving the classification capacity for the MTSC task. In the proposed FormerTime, we employ a hierarchical network architecture to perform multi-scale feature maps. Besides, a novel transformer encoder is further designed, in which an efficient temporal reduction attention layer and a well-informed contextual positional encoding generating strategy are developed. To sum up, FormerTime exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism. Extensive experiments performed on $10$ publicly available datasets from UEA archive verify the superiorities of the FormerTime compared to previous competitive baselines.
CVAug 17, 2022
MoCapDeform: Monocular 3D Human Motion Capture in Deformable ScenesZhi Li, Soshi Shimada, Bernt Schiele et al.
3D human motion capture from monocular RGB images respecting interactions of a subject with complex and possibly deformable environments is a very challenging, ill-posed and under-explored problem. Existing methods address it only weakly and do not model possible surface deformations often occurring when humans interact with scene surfaces. In contrast, this paper proposes MoCapDeform, i.e., a new framework for monocular 3D human motion capture that is the first to explicitly model non-rigid deformations of a 3D scene for improved 3D human pose estimation and deformable environment reconstruction. MoCapDeform accepts a monocular RGB video and a 3D scene mesh aligned in the camera space. It first localises a subject in the input monocular video along with dense contact labels using a new raycasting based strategy. Next, our human-environment interaction constraints are leveraged to jointly optimise global 3D human poses and non-rigid surface deformations. MoCapDeform achieves superior accuracy than competing methods on several datasets, including our newly recorded one with deforming background scenes.
CVAug 10, 2022
Benchmarking Joint Face Spoofing and Forgery Detection with Visual and Physiological CuesZitong Yu, Rizhao Cai, Zhi Li et al.
Face anti-spoofing (FAS) and face forgery detection play vital roles in securing face biometric systems from presentation attacks (PAs) and vicious digital manipulation (e.g., deepfakes). Despite promising performance upon large-scale data and powerful deep models, the generalization problem of existing approaches is still an open issue. Most of recent approaches focus on 1) unimodal visual appearance or physiological (i.e., remote photoplethysmography (rPPG)) cues; and 2) separated feature representation for FAS or face forgery detection. On one side, unimodal appearance and rPPG features are respectively vulnerable to high-fidelity face 3D mask and video replay attacks, inspiring us to design reliable multi-modal fusion mechanisms for generalized face attack detection. On the other side, there are rich common features across FAS and face forgery detection tasks (e.g., periodic rPPG rhythms and vanilla appearance for bonafides), providing solid evidence to design a joint FAS and face forgery detection system in a multi-task learning fashion. In this paper, we establish the first joint face spoofing and forgery detection benchmark using both visual appearance and physiological rPPG cues. To enhance the rPPG periodicity discrimination, we design a two-branch physiological network using both facial spatio-temporal rPPG signal map and its continuous wavelet transformed counterpart as inputs. To mitigate the modality bias and improve the fusion efficacy, we conduct a weighted batch and layer normalization for both appearance and rPPG features before multi-modal fusion. We find that the generalization capacities of both unimodal (appearance or rPPG) and multi-modal (appearance+rPPG) models can be obviously improved via joint training on these two tasks. We hope this new benchmark will facilitate the future research of both FAS and deepfake detection communities.
CVMay 8, 2022
One-Class Knowledge Distillation for Face Presentation Attack DetectionZhi Li, Rizhao Cai, Haoliang Li et al.
Face presentation attack detection (PAD) has been extensively studied by research communities to enhance the security of face recognition systems. Although existing methods have achieved good performance on testing data with similar distribution as the training data, their performance degrades severely in application scenarios with data of unseen distributions. In situations where the training and testing data are drawn from different domains, a typical approach is to apply domain adaptation techniques to improve face PAD performance with the help of target domain data. However, it has always been a non-trivial challenge to collect sufficient data samples in the target domain, especially for attack samples. This paper introduces a teacher-student framework to improve the cross-domain performance of face PAD with one-class domain adaptation. In addition to the source domain data, the framework utilizes only a few genuine face samples of the target domain. Under this framework, a teacher network is trained with source domain samples to provide discriminative feature representations for face PAD. Student networks are trained to mimic the teacher network and learn similar representations for genuine face samples of the target domain. In the test phase, the similarity score between the representations of the teacher and student networks is used to distinguish attacks from genuine ones. To evaluate the proposed framework under one-class domain adaptation settings, we devised two new protocols and conducted extensive experiments. The experimental results show that our method outperforms baselines under one-class domain adaptation settings and even state-of-the-art methods with unsupervised domain adaptation.
FLU-DYNJul 29, 2023
Rapid Flood Inundation Forecast Using Fourier Neural OperatorAlexander Y. Sun, Zhi Li, Wonhyun Lee et al.
Flood inundation forecast provides critical information for emergency planning before and during flood events. Real time flood inundation forecast tools are still lacking. High-resolution hydrodynamic modeling has become more accessible in recent years, however, predicting flood extents at the street and building levels in real-time is still computationally demanding. Here we present a hybrid process-based and data-driven machine learning (ML) approach for flood extent and inundation depth prediction. We used the Fourier neural operator (FNO), a highly efficient ML method, for surrogate modeling. The FNO model is demonstrated over an urban area in Houston (Texas, U.S.) by training using simulated water depths (in 15-min intervals) from six historical storm events and then tested over two holdout events. Results show FNO outperforms the baseline U-Net model. It maintains high predictability at all lead times tested (up to 3 hrs) and performs well when applying to new sites, suggesting strong generalization skill.
CVAug 14, 2023Code
OpenGCD: Assisting Open World Recognition with Generalized Category DiscoveryFulin Gao, Weimin Zhong, Zhixing Cao et al.
A desirable open world recognition (OWR) system requires performing three tasks: (1) Open set recognition (OSR), i.e., classifying the known (classes seen during training) and rejecting the unknown (unseen$/$novel classes) online; (2) Grouping and labeling these unknown as novel known classes; (3) Incremental learning (IL), i.e., continual learning these novel classes and retaining the memory of old classes. Ideally, all of these steps should be automated. However, existing methods mostly assume that the second task is completely done manually. To bridge this gap, we propose OpenGCD that combines three key ideas to solve the above problems sequentially: (a) We score the origin of instances (unknown or specifically known) based on the uncertainty of the classifier's prediction; (b) For the first time, we introduce generalized category discovery (GCD) techniques in OWR to assist humans in grouping unlabeled data; (c) For the smooth execution of IL and GCD, we retain an equal number of informative exemplars for each class with diversity as the goal. Moreover, we present a new performance evaluation metric for GCD called harmonic clustering accuracy. Experiments on two standard classification benchmarks and a challenging dataset demonstrate that OpenGCD not only offers excellent compatibility but also substantially outperforms other baselines. Code: https://github.com/Fulin-Gao/OpenGCD.
CVMar 16, 2023
Rehearsal-Free Domain Continual Face Anti-Spoofing: Generalize More and Forget LessRizhao Cai, Yawen Cui, Zhi Li et al.
Face Anti-Spoofing (FAS) is recently studied under the continual learning setting, where the FAS models are expected to evolve after encountering the data from new domains. However, existing methods need extra replay buffers to store previous data for rehearsal, which becomes infeasible when previous data is unavailable because of privacy issues. In this paper, we propose the first rehearsal-free method for Domain Continual Learning (DCL) of FAS, which deals with catastrophic forgetting and unseen domain generalization problems simultaneously. For better generalization to unseen domains, we design the Dynamic Central Difference Convolutional Adapter (DCDCA) to adapt Vision Transformer (ViT) models during the continual learning sessions. To alleviate the forgetting of previous domains without using previous data, we propose the Proxy Prototype Contrastive Regularization (PPCR) to constrain the continual learning with previous domain knowledge from the proxy prototypes. Simulate practical DCL scenarios, we devise two new protocols which evaluate both generalization and anti-forgetting performance. Extensive experimental results show that our proposed method can improve the generalization performance in unseen domains and alleviate the catastrophic forgetting of the previous knowledge. The codes and protocols will be released soon.
CVJun 14, 2023
Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph PropagationLikang Wu, Zhi Li, Hongke Zhao et al.
Zero-Shot Learning (ZSL), which aims at automatically recognizing unseen objects, is a promising learning paradigm to understand new real-world knowledge for machines continuously. Recently, the Knowledge Graph (KG) has been proven as an effective scheme for handling the zero-shot task with large-scale and non-attribute data. Prior studies always embed relationships of seen and unseen objects into visual information from existing knowledge graphs to promote the cognitive ability of the unseen data. Actually, real-world knowledge is naturally formed by multimodal facts. Compared with ordinary structural knowledge from a graph perspective, multimodal KG can provide cognitive systems with fine-grained knowledge. For example, the text description and visual content can depict more critical details of a fact than only depending on knowledge triplets. Unfortunately, this multimodal fine-grained knowledge is largely unexploited due to the bottleneck of feature alignment between different modalities. To that end, we propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings via a designed dense attention module and self-calibration loss. It makes the semantic transfer process of our ZSL framework learns more differentiated knowledge between entities. Our model also gets rid of the performance limitation of only using rough global features. We conduct extensive experiments and evaluate our model on large-scale real-world data. The experimental results clearly demonstrate the effectiveness of the proposed model in standard zero-shot classification tasks.
IRMar 1, 2023
GUESR: A Global Unsupervised Data-Enhancement with Bucket-Cluster Sampling for Sequential RecommendationYongqiang Han, Likang Wu, Hao Wang et al.
Sequential Recommendation is a widely studied paradigm for learning users' dynamic interests from historical interactions for predicting the next potential item. Although lots of research work has achieved remarkable progress, they are still plagued by the common issues: data sparsity of limited supervised signals and data noise of accidentally clicking. To this end, several works have attempted to address these issues, which ignored the complex association of items across several sequences. Along this line, with the aim of learning representative item embedding to alleviate this dilemma, we propose GUESR, from the view of graph contrastive learning. Specifically, we first construct the Global Item Relationship Graph (GIRG) from all interaction sequences and present the Bucket-Cluster Sampling (BCS) method to conduct the sub-graphs. Then, graph contrastive learning on this reduced graph is developed to enhance item representations with complex associations from the global view. We subsequently extend the CapsNet module with the elaborately introduced target-attention mechanism to derive users' dynamic preferences. Extensive experimental results have demonstrated our proposed GUESR could not only achieve significant improvements but also could be regarded as a general enhancement strategy to improve the performance in combination with other sequential recommendation methods.
CVMar 24, 2022
A Preliminary Research on Space Situational Awareness Based on Event CamerasKun Xiao, Pengju Li, Guohui Wang et al.
Event camera is a new type of sensor that is different from traditional cameras. Each pixel is triggered asynchronously by an event. The trigger event is the change of the brightness irradiated on the pixel. If the increment or decrement is higher than a certain threshold, the event is output. Compared with traditional cameras, event cameras have the advantages of high temporal resolution, low latency, high dynamic range, low bandwidth and low power consumption. We carried out a series of observation experiments in a simulated space lighting environment. The experimental results show that the event camera can give full play to the above advantages in space situational awareness. This article first introduces the basic principles of the event camera, then analyzes its advantages and disadvantages, then introduces the observation experiment and analyzes the experimental results, and finally, a workflow of space situational awareness based on event cameras is given.
ROMar 16, 2023
Energy Management of Multi-mode Plug-in Hybrid Electric Vehicle using Multi-agent Deep Reinforcement LearningMin Hua, Cetengfei Zhang, Fanggang Zhang et al.
The recently emerging multi-mode plug-in hybrid electric vehicle (PHEV) technology is one of the pathways making contributions to decarbonization, and its energy management requires multiple-input and multipleoutput (MIMO) control. At the present, the existing methods usually decouple the MIMO control into singleoutput (MISO) control and can only achieve its local optimal performance. To optimize the multi-mode vehicle globally, this paper studies a MIMO control method for energy management of the multi-mode PHEV based on multi-agent deep reinforcement learning (MADRL). By introducing a relevance ratio, a hand-shaking strategy is proposed to enable two learning agents to work collaboratively under the MADRL framework using the deep deterministic policy gradient (DDPG) algorithm. Unified settings for the DDPG agents are obtained through a sensitivity analysis of the influencing factors to the learning performance. The optimal working mode for the hand-shaking strategy is attained through a parametric study on the relevance ratio. The advantage of the proposed energy management method is demonstrated on a software-in-the-loop testing platform. The result of the study indicates that the learning rate of the DDPG agents is the greatest influencing factor for learning performance. Using the unified DDPG settings and a relevance ratio of 0.2, the proposed MADRL system can save up to 4% energy compared to the single-agent learning system and up to 23.54% energy compared to the conventional rule-based system.
CVDec 2, 2025Code
MICCAI STSR 2025 Challenge: Semi-Supervised Teeth and Pulp Segmentation and CBCT-IOS RegistrationYaqi Wang, Zhi Li, Chengyu Wu et al.
Cone-Beam Computed Tomography (CBCT) and Intraoral Scanning (IOS) are essential for digital dentistry, but annotated data scarcity limits automated solutions for pulp canal segmentation and cross-modal registration. To benchmark semi-supervised learning (SSL) in this domain, we organized the STSR 2025 Challenge at MICCAI 2025, featuring two tasks: (1) semi-supervised segmentation of teeth and pulp canals in CBCT, and (2) semi-supervised rigid registration of CBCT and IOS. We provided 60 labeled and 640 unlabeled IOS samples, plus 30 labeled and 250 unlabeled CBCT scans with varying resolutions and fields of view. The challenge attracted strong community participation, with top teams submitting open-source deep learning-based SSL solutions. For segmentation, leading methods used nnU-Net and Mamba-like State Space Models with pseudo-labeling and consistency regularization, achieving a Dice score of 0.967 and Instance Affinity of 0.738 on the hidden test set. For registration, effective approaches combined PointNetLK with differentiable SVD and geometric augmentation to handle modality gaps; hybrid neural-classical refinement enabled accurate alignment despite limited labels. All data and code are publicly available at https://github.com/ricoleehduu/STS-Challenge-2025 to ensure reproducibility.
LGJul 14, 2023
Multi-Dimensional Ability Diagnosis for Machine Learning AlgorithmsQi Liu, Zheng Gong, Zhenya Huang et al.
Machine learning algorithms have become ubiquitous in a number of applications (e.g. image classification). However, due to the insufficient measurement of traditional metrics (e.g. the coarse-grained Accuracy of each classifier), substantial gaps are usually observed between the real-world performance of these algorithms and their scores in standardized evaluations. In this paper, inspired by the psychometric theories from human measurement, we propose a task-agnostic evaluation framework Camilla, where a multi-dimensional diagnostic metric Ability is defined for collaboratively measuring the multifaceted strength of each machine learning algorithm. Specifically, given the response logs from different algorithms to data samples, we leverage cognitive diagnosis assumptions and neural networks to learn the complex interactions among algorithms, samples and the skills (explicitly or implicitly pre-defined) of each sample. In this way, both the abilities of each algorithm on multiple skills and some of the sample factors (e.g. sample difficulty) can be simultaneously quantified. We conduct extensive experiments with hundreds of machine learning algorithms on four public datasets, and our experimental results demonstrate that Camilla not only can capture the pros and cons of each algorithm more precisely, but also outperforms state-of-the-art baselines on the metric reliability, rank consistency and rank stability.
CLNov 9, 2022
Nested Named Entity Recognition from Medical Texts: An Adaptive Shared Network Architecture with Attentive CRFJunzhe Jiang, Mingyue Cheng, Qi Liu et al.
Recognizing useful named entities plays a vital role in medical information processing, which helps drive the development of medical area research. Deep learning methods have achieved good results in medical named entity recognition (NER). However, we find that existing methods face great challenges when dealing with the nested named entities. In this work, we propose a novel method, referred to as ASAC, to solve the dilemma caused by the nested phenomenon, in which the core idea is to model the dependency between different categories of entity recognition. The proposed method contains two key modules: the adaptive shared (AS) part and the attentive conditional random field (ACRF) module. The former part automatically assigns adaptive weights across each task to achieve optimal recognition accuracy in the multi-layer network. The latter module employs the attention operation to model the dependency between different entities. In this way, our model could learn better entity representations by capturing the implicit distinctions and relationships between different categories of entities. Extensive experiments on public datasets verify the effectiveness of our method. Besides, we also perform ablation analyses to deeply understand our methods.
ROMay 28
Fisher-Preserving Guidance: Training-Free Manifold Constraints for Safe Diffusion ControlHao Ren, Zetong Bi, Yiming Zeng et al.
Diffusion models are effective for waypoint prediction in visual navigation, but standard sampling and test time guidance can produce unreliable or inefficient trajectories when updates drift off the training manifold. We propose Fisher Preserving Guidance with Outer Product Span Projection, a training-free inference method that avoids large Fisher drift associated with off-distribution actions while optimizing a task objective. Our method computes the Fisher-preserving update via a low-rank Jacobian factorization, requiring only a single backward pass per step and enabling real-time use. We further introduce Truncated Fisher Denoising Sensitivity as an uncertainty signal and use it for robust multi-sample action blending. Experiments on toy and realistic navigation benchmarks, including Maze2D with TSDF-based guidance, PushT with official Diffusion Policy weights, and visual navigation in simulation and on real robots, demonstrate consistent improvements in performance over strong diffusion-policy baselines without additional training.
CVNov 10, 2025
StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video GenerationTianrui Feng, Zhi Li, Shuo Yang et al.
Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered. Previous image-based streaming diffusion models have powered efficient and creative live streaming products but have hit limits on temporal consistency due to the foundation of image-based designs. Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation. However, offline generation systems primarily optimize throughput by batching large workloads. In contrast, live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter. Besides, scalable multi-GPU serving for real-time streams remains largely unresolved so far. To address this, we present StreamDiffusionV2, a training-free pipeline for interactive live streaming with video diffusion models. StreamDiffusionV2 integrates an SLO-aware batching scheduler and a block scheduler, together with a sink-token--guided rolling KV cache, a motion-aware noise controller, and other system-level optimizations. Moreover, we introduce a scalable pipeline orchestration that parallelizes the diffusion process across denoising steps and network layers, achieving near-linear FPS scaling without violating latency guarantees. The system scales seamlessly across heterogeneous GPU environments and supports flexible denoising steps (e.g., 1--4), enabling both ultra-low-latency and higher-quality modes. Without TensorRT or quantization, StreamDiffusionV2 renders the first frame within 0.5s and attains 58.28 FPS with a 14B-parameter model and 64.52 FPS with a 1.3B-parameter model on four H100 GPUs, making state-of-the-art generative live streaming practical and accessible--from individual creators to enterprise-scale platforms.
SIApr 18, 2022
Preference Enhanced Social Influence Modeling for Network-Aware Cascade PredictionLikang Wu, Hao Wang, Enhong Chen et al.
Network-aware cascade size prediction aims to predict the final reposted number of user-generated information via modeling the propagation process in social networks. Estimating the user's reposting probability by social influence, namely state activation plays an important role in the information diffusion process. Therefore, Graph Neural Networks (GNN), which can simulate the information interaction between nodes, has been proved as an effective scheme to handle this prediction task. However, existing studies including GNN-based models usually neglect a vital factor of user's preference which influences the state activation deeply. To that end, we propose a novel framework to promote cascade size prediction by enhancing the user preference modeling according to three stages, i.e., preference topics generation, preference shift modeling, and social influence activation. Our end-to-end method makes the user activating process of information diffusion more adaptive and accurate. Extensive experiments on two large-scale real-world datasets have clearly demonstrated the effectiveness of our proposed model compared to state-of-the-art baselines.
CLFeb 4
ERNIE 5.0 Technical ReportHaifeng Wang, Hua Wu, Tian Wu et al.
In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.
ARJul 6, 2023
TL-nvSRAM-CIM: Ultra-High-Density Three-Level ReRAM-Assisted Computing-in-nvSRAM with DC-Power Free Restore and Ternary MAC OperationsDengfeng Wang, Liukai Xu, Songyuan Liu et al.
Accommodating all the weights on-chip for large-scale NNs remains a great challenge for SRAM based computing-in-memory (SRAM-CIM) with limited on-chip capacity. Previous non-volatile SRAM-CIM (nvSRAM-CIM) addresses this issue by integrating high-density single-level ReRAMs on the top of high-efficiency SRAM-CIM for weight storage to eliminate the off-chip memory access. However, previous SL-nvSRAM-CIM suffers from poor scalability for an increased number of SL-ReRAMs and limited computing efficiency. To overcome these challenges, this work proposes an ultra-high-density three-level ReRAMs-assisted computing-in-nonvolatile-SRAM (TL-nvSRAM-CIM) scheme for large NN models. The clustered n-selector-n-ReRAM (cluster-nSnRs) is employed for reliable weight-restore with eliminated DC power. Furthermore, a ternary SRAM-CIM mechanism with differential computing scheme is proposed for energy-efficient ternary MAC operations while preserving high NN accuracy. The proposed TL-nvSRAM-CIM achieves 7.8x higher storage density, compared with the state-of-art works. Moreover, TL-nvSRAM-CIM shows up to 2.9x and 1.9x enhanced energy-efficiency, respectively, compared to the baseline designs of SRAM-CIM and ReRAM-CIM, respectively.
CVJul 24, 2022
Weakly-Supervised Temporal Action Detection for Fine-Grained Videos with Hierarchical Atomic ActionsZhi Li, Lu He, Huijuan Xu
Action understanding has evolved into the era of fine granularity, as most human behaviors in real life have only minor differences. To detect these fine-grained actions accurately in a label-efficient way, we tackle the problem of weakly-supervised fine-grained temporal action detection in videos for the first time. Without the careful design to capture subtle differences between fine-grained actions, previous weakly-supervised models for general action detection cannot perform well in the fine-grained setting. We propose to model actions as the combinations of reusable atomic actions which are automatically discovered from data through self-supervised clustering, in order to capture the commonality and individuality of fine-grained actions. The learnt atomic actions, represented by visual concepts, are further mapped to fine and coarse action labels leveraging the semantic label hierarchy. Our approach constructs a visual representation hierarchy of four levels: clip level, atomic action level, fine action class level and coarse action class level, with supervision at each level. Extensive experiments on two large-scale fine-grained video datasets, FineAction and FineGym, show the benefit of our proposed weakly-supervised model for fine-grained action detection, and it achieves state-of-the-art results.
LGDec 27, 2025Code
Collaborative Optimization of Multiclass Imbalanced Learning: Density-Aware and Region-Guided BoostingChuantao Li, Zhi Li, Jiahao Xu et al.
Numerous studies attempt to mitigate classification bias caused by class imbalance. However, existing studies have yet to explore the collaborative optimization of imbalanced learning and model training. This constraint hinders further performance improvements. To bridge this gap, this study proposes a collaborative optimization Boosting model of multiclass imbalanced learning. This model is simple but effective by integrating the density factor and the confidence factor, this study designs a noise-resistant weight update mechanism and a dynamic sampling strategy. Rather than functioning as independent components, these modules are tightly integrated to orchestrate weight updates, sample region partitioning, and region-guided sampling. Thus, this study achieves the collaborative optimization of imbalanced learning and model training. Extensive experiments on 20 public imbalanced datasets demonstrate that the proposed model significantly outperforms eight state-of-the-art baselines. The code for the proposed model is available at: https://github.com/ChuantaoLi/DARG.
CRApr 19
Characterizing Trust Boundary Vulnerabilities in TEE Containers: An Empirical StudyWeijie Liu, Hongbo Chen, Shuo Huai et al.
Trusted Execution Environments (TEEs) have become a cornerstone of confidential computing, attracting significant attention from academia and industry. To support secure and scalable application deployment on confidential clouds, TEE containers (Tcons) have been introduced as middleware to shield applications from malicious operating systems and orchestration layers while preserving usability. In this paper, we present the first comprehensive analysis of Tcons, focusing on three critical layers: OS interfaces, encrypted I/O, and orchestration mechanisms. To enable systematic evaluation, we design TBouncer, an automated analyzer that precisely exercises and benchmarks Tcon isolation boundaries. Our study uncovers fundamental flaws in existing Tcons, leading to exploitable vulnerabilities such as code execution, denial-of-service, and information leakage. In total, we identify six attack vectors, twelve new bugs, and three CVEs. These findings provide new insights into the underestimated attack surface of Tcons and highlight key directions for building more secure and trustworthy container solutions.
LGFeb 10, 2023
ShapeWordNet: An Interpretable Shapelet Neural Network for Physiological Signal ClassificationWenqiang He, Mingyue Cheng, Qi Liu et al.
Physiological signals are high-dimensional time series of great practical values in medical and healthcare applications. However, previous works on its classification fail to obtain promising results due to the intractable data characteristics and the severe label sparsity issues. In this paper, we try to address these challenges by proposing a more effective and interpretable scheme tailored for the physiological signal classification task. Specifically, we exploit the time series shapelets to extract prominent local patterns and perform interpretable sequence discretization to distill the whole-series information. By doing so, the long and continuous raw signals are compressed into short and discrete token sequences, where both local patterns and global contexts are well preserved. Moreover, to alleviate the label sparsity issue, a multi-scale transformation strategy is adaptively designed to augment data and a cross-scale contrastive learning mechanism is accordingly devised to guide the model training. We name our method as ShapeWordNet and conduct extensive experiments on three real-world datasets to investigate its effectiveness. Comparative results show that our proposed scheme remarkably outperforms four categories of cutting-edge approaches. Visualization analysis further witnesses the good interpretability of the sequence discretization idea based on shapelets.
LGSep 6, 2022
Energy Management of Multi-mode Hybrid Electric Vehicles based on Hand-shaking Multi-agent LearningMin Hua, Zhi Li, Quan Zhou
The future transportation system will be a multi-agent network where connected AI agents can work together to address the grand challenges in our age, e.g., mitigation of real-world driving energy consumption. Distinguished from the existing research on vehicle energy management, which decoupled multiple inputs and multiple outputs (MIMO) control into single-output(MISO) control, this paper studied a multi-agent deep reinforcement learning (MADRL) framework to deal with multiple control outputs simultaneously. A new hand-shaking strategy is proposed for the DRL agents by introducing an independence ratio, and a parametric study is conducted to obtain the best setting for the MADRL framework. The study suggested that the MADRL with an independence ratio of 0.2 is the best, and more than 2.4% of energy can be saved over the conventional DRL framework.
LGJan 27Code
From Observations to Events: Event-Aware World Model for Reinforcement LearningZhao-Han Peng, Shaohui Li, Zhi Li et al.
While model-based reinforcement learning (MBRL) improves sample efficiency by learning world models from raw observations, existing methods struggle to generalize across structurally similar scenes and remain vulnerable to spurious variations such as textures or color shifts. From a cognitive science perspective, humans segment continuous sensory streams into discrete events and rely on these key events for decision-making. Motivated by this principle, we propose the Event-Aware World Model (EAWM), a general framework that learns event-aware representations to streamline policy learning without requiring handcrafted labels. EAWM employs an automated event generator to derive events from raw observations and introduces a Generic Event Segmentor (GES) to identify event boundaries, which mark the start and end time of event segments. Through event prediction, the representation space is shaped to capture meaningful spatio-temporal transitions. Beyond this, we present a unified formulation of seemingly distinct world model architectures and show the broad applicability of our methods. Experiments on Atari 100K, Craftax 1M, and DeepMind Control 500K, DMC-GB2 500K demonstrate that EAWM consistently boosts the performance of strong MBRL baselines by 10%-45%, setting new state-of-the-art results across benchmarks. Our code is released at https://github.com/MarquisDarwin/EAWM.
CVDec 30, 2025Code
Physically-Grounded Manifold Projection Model for Generalizable Metal Artifact Reduction in Dental CBCTZhi Li, Yaqi Wang, Bingtao Ma et al.
Metal artifacts in Dental CBCT severely obscure anatomical structures, hindering diagnosis. Current deep learning for Metal Artifact Reduction (MAR) faces limitations: supervised methods suffer from spectral blurring due to "regression-to-the-mean", while unsupervised ones risk structural hallucinations. Denoising Diffusion Models (DDPMs) offer realism but rely on slow, stochastic iterative sampling, unsuitable for clinical use. To resolve this, we propose the Physically-Grounded Manifold Projection (PGMP) framework. First, our Anatomically-Adaptive Physics Simulation (AAPS) pipeline synthesizes high-fidelity training pairs via Monte Carlo spectral modeling and patient-specific digital twins, bridging the synthetic-to-real gap. Second, our DMP-Former adapts the Direct x-Prediction paradigm, reformulating restoration as a deterministic manifold projection to recover clean anatomy in a single forward pass, eliminating stochastic sampling. Finally, a Semantic-Structural Alignment (SSA) module anchors the solution using priors from medical foundation models (MedDINOv3), ensuring clinical plausibility. Experiments on synthetic and multi-center clinical datasets show PGMP outperforms state-of-the-art methods on unseen anatomy, setting new benchmarks in efficiency and diagnostic reliability. Code and data: https://github.com/ricoleehduu/PGMP.
LGFeb 26, 2024Code
Generative Pretrained Hierarchical Transformer for Time Series ForecastingZhiding Liu, Jiqian Yang, Mingyue Cheng et al.
Recent efforts have been dedicated to enhancing time series forecasting accuracy by introducing advanced network architectures and self-supervised pretraining strategies. Nevertheless, existing approaches still exhibit two critical drawbacks. Firstly, these methods often rely on a single dataset for training, limiting the model's generalizability due to the restricted scale of the training data. Secondly, the one-step generation schema is widely followed, which necessitates a customized forecasting head and overlooks the temporal dependencies in the output series, and also leads to increased training costs under different horizon length settings. To address these issues, we propose a novel generative pretrained hierarchical transformer architecture for forecasting, named \textbf{GPHT}. There are two aspects of key designs in GPHT. On the one hand, we advocate for constructing a mixed dataset under the channel-independent assumption for pretraining our model, comprising various datasets from diverse data scenarios. This approach significantly expands the scale of training data, allowing our model to uncover commonalities in time series data and facilitating improved transfer to specific datasets. On the other hand, GPHT employs an auto-regressive forecasting approach, effectively modeling temporal dependencies in the output series. Importantly, no customized forecasting head is required, enabling \textit{a single model to forecast at arbitrary horizon settings.} We conduct sufficient experiments on eight datasets with mainstream self-supervised pretraining models and supervised models. The results demonstrated that GPHT surpasses the baseline models across various fine-tuning and zero/few-shot learning settings in the traditional long-term forecasting task. We make our codes publicly available\footnote{https://github.com/icantnamemyself/GPHT}.
MTRL-SCINov 26, 2025
Lattice-to-total thermal conductivity ratio: a phonon-glass electron-crystal descriptor for data-driven thermoelectric designYifan Sun, Zhi Li, Tetsuya Imamura et al.
Thermoelectrics (TEs) are promising candidates for energy harvesting with performance quantified by figure of merit, $ZT$. To accelerate the discovery of high-$ZT$ materials, efforts have focused on identifying compounds with low thermal conductivity $κ$. Using a curated dataset of 71,913 entries, we show that high-$ZT$ materials reside not only in the low-$κ$ regime but also cluster near a lattice-to-total thermal conductivity ratio ($κ_\mathrm{L}/κ$) of approximately 0.5, consistent with the phonon-glass electron-crystal design concept. Building on this insight, we construct a framework consisting of two machine learning models for the lattice and electronic components of thermal conductivity that jointly provide both $κ$ and $κ_\mathrm{L}/κ$ for screening and guiding the optimization of TE materials. Among 104,567 compounds screened, our models identify 2,522 ultralow-$κ$ candidates. Follow-up case studies demonstrate that this framework can reliably provide optimization strategies by suggesting new dopants and alloys that shift pristine materials toward the $κ_\mathrm{L}/κ$ approaching 0.5 regime. Ultimately, by integrating rapid screening with PGEC-guided optimization, our data-driven framework effectively bridges the critical gap between materials discovery and performance enhancement.
LGMar 3, 2024Code
ConvTimeNet: A Deep Hierarchical Fully Convolutional Model for Multivariate Time Series AnalysisMingyue Cheng, Jiqian Yang, Tingyue Pan et al.
Designing effective models for learning time series representations is foundational for time series analysis. Many previous works have explored time series representation modeling approaches and have made progress in this area. Despite their effectiveness, they lack adaptive perception of local patterns in temporally dependent basic units and fail to capture the multi-scale dependency among these units. Instead of relying on prevalent methods centered around self-attention mechanisms, we propose ConvTimeNet, a hierarchical pure convolutional model designed for time series analysis. ConvTimeNet introduces a deformable patch layer that adaptively perceives local patterns of temporally dependent basic units in a data-driven manner. Based on the extracted local patterns, hierarchical pure convolutional blocks are designed to capture dependency relationships among the representations of basic units at different scales. Moreover, a large kernel mechanism is employed to ensure that convolutional blocks can be deeply stacked, thereby achieving a larger receptive field. In this way, local patterns and their multi-scale dependencies can be effectively modeled within a single model. Extensive experiments comparing a wide range of different types of models demonstrate that pure convolutional models still exhibit strong viability, effectively addressing the aforementioned two challenges and showing superior performance across multiple tasks. The code is available for reproducibility.
CLMar 13, 2024Code
Towards Personalized Evaluation of Large Language Models with An Anonymous Crowd-Sourcing PlatformMingyue Cheng, Hao Zhang, Jiqian Yang et al.
Large language model evaluation plays a pivotal role in the enhancement of its capacity. Previously, numerous methods for evaluating large language models have been proposed in this area. Despite their effectiveness, these existing works mainly focus on assessing objective questions, overlooking the capability to evaluate subjective questions which is extremely common for large language models. Additionally, these methods predominantly utilize centralized datasets for evaluation, with question banks concentrated within the evaluation platforms themselves. Moreover, the evaluation processes employed by these platforms often overlook personalized factors, neglecting to consider the individual characteristics of both the evaluators and the models being evaluated. To address these limitations, we propose a novel anonymous crowd-sourcing evaluation platform, BingJian, for large language models that employs a competitive scoring mechanism where users participate in ranking models based on their performance. This platform stands out not only for its support of centralized evaluations to assess the general capabilities of models but also for offering an open evaluation gateway. Through this gateway, users have the opportunity to submit their questions, testing the models on a personalized and potentially broader range of capabilities. Furthermore, our platform introduces personalized evaluation scenarios, leveraging various forms of human-computer interaction to assess large language models in a manner that accounts for individual user preferences and contexts. The demonstration of BingJian can be accessed at https://github.com/Mingyue-Cheng/Bingjian.
ROAug 20, 2023
Efficient Real-time Path Planning with Self-evolving Particle Swarm Optimization in Dynamic ScenariosJinghao Xin, Zhi Li, Yang Zhang et al.
Particle Swarm Optimization (PSO) has demonstrated efficacy in addressing static path planning problems. Nevertheless, such application on dynamic scenarios has been severely precluded by PSO's low computational efficiency and premature convergence downsides. To address these limitations, we proposed a Tensor Operation Form (TOF) that converts particle-wise manipulations to tensor operations, thereby enhancing computational efficiency. Harnessing the computational advantage of TOF, a variant of PSO, designated as Self-Evolving Particle Swarm Optimization (SEPSO) was developed. The SEPSO is underpinned by a novel Hierarchical Self-Evolving Framework (HSEF) that enables autonomous optimization of its own hyper-parameters to evade premature convergence. Additionally, a Priori Initialization (PI) mechanism and an Auto Truncation (AT) mechanism that substantially elevates the real-time performance of SEPSO on dynamic path planning problems were introduced. Comprehensive experiments on four widely used benchmark optimization functions have been initially conducted to corroborate the validity of SEPSO. Following this, a dynamic simulation environment that encompasses moving start/target points and dynamic/static obstacles was employed to assess the effectiveness of SEPSO on the dynamic path planning problem. Simulation results exhibit that the proposed SEPSO is capable of generating superior paths with considerably better real-time performance (67 path planning computations per second in a regular desktop computer) in contrast to alternative methods. The code and video of this paper can be accessed here.
CLJul 31, 2024
Improving Faithfulness of Large Language Models in Summarization via Sliding Generation and Self-ConsistencyTaiji Li, Zhi Li, Yin Zhang
Despite large language models (LLMs) have demonstrated impressive performance in various tasks, they are still suffering from the factual inconsistency problem called hallucinations. For instance, LLMs occasionally generate content that diverges from source article, and prefer to extract information that appears at the beginning and end of the context, especially in long document summarization. Inspired by these findings, we propose to improve the faithfulness of LLMs in summarization by impelling them to process the entire article more fairly and faithfully. We present a novel summary generation strategy, namely SliSum, which exploits the ideas of sliding windows and self-consistency. Specifically, SliSum divides the source article into overlapping windows, and utilizes LLM to generate local summaries for the content in the windows. Finally, SliSum aggregates all local summaries using clustering and majority voting algorithm to produce more faithful summary of entire article. Extensive experiments demonstrate that SliSum significantly improves the faithfulness of diverse LLMs including LLaMA-2, Claude-2 and GPT-3.5 in both short and long text summarization, while maintaining their fluency and informativeness and without additional fine-tuning and resources. We further conduct qualitative and quantitative studies to investigate why SliSum works and impacts of hyperparameters in SliSum on performance.
LGMay 18
HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RLZhi Li, Songkun Yan, Jie Cao et al.
Calibrating distributed hydrologic models is a critical bottleneck across operational water resources management - streamflow prediction, reservoir operation, drought monitoring, infrastructure design, and flood forecasting all depend on it. Each basin demands an expert to translate hydrograph signatures into adjustments of a high-dimensional parameter vector, and the resulting workflow does not transfer between watersheds. We ask: can frontier large language model (LLM) agents replace the human hydrologic modeler, and if not, what would it take? We benchmark nine frontier LLM agents - Claude Opus 4.6/4.7, Sonnet 4.6, GPT-5/5.4/5.4-pro, and Gemini 2.5-pro/3.1-pro/3-flash - on the operational CREST distributed hydrologic model used by the U.S. National Weather Service for flash-flood forecasting. Best-of-twenty-rounds Nash-Sutcliffe Efficiency (NSE) across four held-out gauges spanning 329-40,792 km2 ranges from -0.16 (GPT-5.4) to 0.75 (Sonnet 4.6); the ceiling reproduces across all three vendors and capability tiers, with the strongest models concentrating in the 0.65-0.75 band, and no model reaches the human-expert reference except Opus-4.7 on one gauge. We argue this gap is not a parameter-count problem but a domain-grounding problem. We then propose HYDROAGENT, fine-tuning open-weight Qwen3-4B with supervised fine-tuning on 2,576 expert calibration trajectories and Group-Relative Policy Optimization using NSE as a verifiable reward from online CREST simulations - reinforcement learning with simulation feedback (RLSF). For Earth system science, a small domain-tuned policy with simulator-in-the-loop RL is a more compute-efficient and physically faithful path than scaling generic frontier models, and the multi-modal richness of Earth data - remote sensing, in-situ time series, and forecaster narrative - makes domain agents a leveraged direction for AI in physical science.
SYMay 3, 2018
Convex Combination of Overlap-Save Frequency-Domain Adaptive FiltersSihai Guan, Zhi Li
In order to decrease the steady-state error and reduce the computational complexity and increase the ability to identify a large unknown system, a convex combination of overlap-save frequency-domain adaptive filters (COSFDAF) algorithm is proposed. From the articles available, most papers discuss convex combinations of adaptive-filter algorithms focusing on the time domain. Those algorithms show better performances in convergence speed and steady-state error. The major defect of those algorithms, however, is the computational complexity. To deal with this problem and motivated by frequency-domain adaptive filters (FDAF) and convex optimization, this paper gives an adaptive filter algorithm, that consists of combining the two FDAFs using the convex combination principles and derives a formula to update the mixing parameter. The computational complexity of the COSFDAF is analyzed theoretically. The simulation results show that no matter what kinds of signal to be processed, whether correlated (i.e. colored noise) or uncorrelated (i.e. white noise), the proposed algorithm has better performance in identify the unknown coefficients when compared to a single overlap-save FDAF or the convex combination of two time-domain adaptive filters.
CLJul 7, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic CapabilitiesGheorghe Comanici, Eric Bieber, Mike Schaekermann et al. · amazon-science, baidu
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
LGAug 14, 2025Code
CURE: Critical-Token-Guided Re-Concatenation for Entropy-Collapse PreventionQingbin Li, Rongkun Xue, Jie Wang et al.
Recent advances in Reinforcement Learning with Verified Reward (RLVR) have driven the emergence of more sophisticated cognitive behaviors in large language models (LLMs), thereby enhancing their reasoning capabilities. However, in prior RLVR pipelines, the repeated use of static initial-state sampling drawn exactly from the dataset distribution during each sampling phase produced overly deterministic, low diversity model behavior, which manifested as rapid entropy collapse and hindered sustained performance gains during prolonged training. To address this issue, we introduce CURE (Critical-token-gUided Re concatenation for Entropy-collapse prevention), a two-stage framework that balances exploration and exploitation. Specifically, in the first stage, to deliberately steer the model toward novel yet coherent contexts, we re-generate at high-entropy critical tokens and jointly optimize the original and the branched trajectories. The further comparison with vanilla DAPO shows that the regeneration process achieves a better performance on math reasoning tasks while sustaining a high-level entropy degree for exploration. In the second stage, we continue training with static initial-state sampling by DAPO, intentionally placing the model in a familiar state to gradually strengthen exploitation. Extensive experiments on Qwen-2.5-Math-7B show that, compared to other RLVR methods, CURE achieves a 5% performance gain across six math benchmarks, establishing state-of-the-art performance in both entropy and accuracy. A series of experiments further validate the effectiveness of our approach. Code is available at https://github.com/bytedance/CURE.
CVOct 30, 2023
VDIP-TGV: Blind Image Deconvolution via Variational Deep Image Prior Empowered by Total Generalized VariationTingting Wu, Zhiyan Du, Zhi Li et al.
Recovering clear images from blurry ones with an unknown blur kernel is a challenging problem. Deep image prior (DIP) proposes to use the deep network as a regularizer for a single image rather than as a supervised model, which achieves encouraging results in the nonblind deblurring problem. However, since the relationship between images and the network architectures is unclear, it is hard to find a suitable architecture to provide sufficient constraints on the estimated blur kernels and clean images. Also, DIP uses the sparse maximum a posteriori (MAP), which is insufficient to enforce the selection of the recovery image. Recently, variational deep image prior (VDIP) was proposed to impose constraints on both blur kernels and recovery images and take the standard deviation of the image into account during the optimization process by the variational principle. However, we empirically find that VDIP struggles with processing image details and tends to generate suboptimal results when the blur kernel is large. Therefore, we combine total generalized variational (TGV) regularization with VDIP in this paper to overcome these shortcomings of VDIP. TGV is a flexible regularization that utilizes the characteristics of partial derivatives of varying orders to regularize images at different scales, reducing oil painting artifacts while maintaining sharp edges. The proposed VDIP-TGV effectively recovers image edges and details by supplementing extra gradient information through TGV. Additionally, this model is solved by the alternating direction method of multipliers (ADMM), which effectively combines traditional algorithms and deep learning methods. Experiments show that our proposed VDIP-TGV surpasses various state-of-the-art models quantitatively and qualitatively.
IVDec 20, 2024Code
Efficient MedSAMs: Segment Anything in Medical Images on LaptopJun Ma, Feifei Li, Sumin Kim et al.
Promptable segmentation foundation models have emerged as a transformative approach to addressing the diverse needs in medical images, but most existing models require expensive computing, posing a big barrier to their adoption in clinical practice. In this work, we organized the first international competition dedicated to promptable medical image segmentation, featuring a large-scale dataset spanning nine common imaging modalities from over 20 different institutions. The top teams developed lightweight segmentation foundation models and implemented an efficient inference pipeline that substantially reduced computational requirements while maintaining state-of-the-art segmentation accuracy. Moreover, the post-challenge phase advanced the algorithms through the design of performance booster and reproducibility tasks, resulting in improved algorithms and validated reproducibility of the winning solution. Furthermore, the best-performing algorithms have been incorporated into the open-source software with a user-friendly interface to facilitate clinical adoption. The data and code are publicly available to foster the further development of medical image segmentation foundation models and pave the way for impactful real-world applications.
AIDec 6, 2024Code
TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in MinecraftQian Long, Zhi Li, Ran Gong et al.
Collaboration is a cornerstone of society. In the real world, human teammates make use of multi-sensory data to tackle challenging tasks in ever-changing environments. It is essential for embodied agents collaborating in visually-rich environments replete with dynamic interactions to understand multi-modal observations and task specifications. To evaluate the performance of generalizable multi-modal collaborative agents, we present TeamCraft, a multi-modal multi-agent benchmark built on top of the open-world video game Minecraft. The benchmark features 55,000 task variants specified by multi-modal prompts, procedurally-generated expert demonstrations for imitation learning, and carefully designed protocols to evaluate model generalization capabilities. We also perform extensive analyses to better understand the limitations and strengths of existing approaches. Our results indicate that existing models continue to face significant challenges in generalizing to novel goals, scenes, and unseen numbers of agents. These findings underscore the need for further research in this area. The TeamCraft platform and dataset are publicly available at https://github.com/teamcraft-bench/teamcraft.
CLSep 4, 2024
Do Large Language Models Possess Sensitive to Sentiment?Yang Liu, Xichou Zhu, Zhou Shen et al.
Large Language Models (LLMs) have recently displayed their extraordinary capabilities in language understanding. However, how to comprehensively assess the sentiment capabilities of LLMs continues to be a challenge. This paper investigates the ability of LLMs to detect and react to sentiment in text modal. As the integration of LLMs into diverse applications is on the rise, it becomes highly critical to comprehend their sensitivity to emotional tone, as it can influence the user experience and the efficacy of sentiment-driven tasks. We conduct a series of experiments to evaluate the performance of several prominent LLMs in identifying and responding appropriately to sentiments like positive, negative, and neutral emotions. The models' outputs are analyzed across various sentiment benchmarks, and their responses are compared with human evaluations. Our discoveries indicate that although LLMs show a basic sensitivity to sentiment, there are substantial variations in their accuracy and consistency, emphasizing the requirement for further enhancements in their training processes to better capture subtle emotional cues. Take an example in our findings, in some cases, the models might wrongly classify a strongly positive sentiment as neutral, or fail to recognize sarcasm or irony in the text. Such misclassifications highlight the complexity of sentiment analysis and the areas where the models need to be refined. Another aspect is that different LLMs might perform differently on the same set of data, depending on their architecture and training datasets. This variance calls for a more in-depth study of the factors that contribute to the performance differences and how they can be optimized.
SYMay 3, 2018
Noise constrained least mean absolute third algorithmSihai Guan, Zhi Li
The learning speed of an adaptive algorithm can be improved by properly constraining the cost function of the adaptive algorithm. Besides, the stabilization of the NCLMF algorithm is more complicated, whose stability depends solely on the input power of the adaptive filter and the NCLMF algorithm with unbounded repressors is not mean square stability even for a small value of the step-size. So, in this paper, a noise variance constrained least mean absolute third (LMAT) algorithm is investigated. The noise constrained LMAT (NCLMAT) algorithm is obtained by constraining the cost function of the standard LMAT algorithm to the third-order moment of the additive noise. And it can eliminate a variety of non-Gaussian distribution of noise, such as Rayleigh noise, Binary noise and so on. The NCLMAT algorithm is a type of variable step-size LMAT algorithm where the step-size rule arises naturally from the constraints. The main aim of this work is first time to derive the NCLMAT adaptive algorithm, analyze its convergence behavior, mean square error (MSE), mean-square deviation (MSD) and assess its performance in different noise environments. Finally, the experimental results in system identification applications presented here illustrate the principle and efficiency of the NCLMAT algorithm.
CLSep 4, 2024
How Privacy-Savvy Are Large Language Models? A Case Study on Compliance and Privacy Technical ReviewYang Liu, Xichou Zhu, Zhou Shen et al.
The recent advances in large language models (LLMs) have significantly expanded their applications across various fields such as language generation, summarization, and complex question answering. However, their application to privacy compliance and technical privacy reviews remains under-explored, raising critical concerns about their ability to adhere to global privacy standards and protect sensitive user data. This paper seeks to address this gap by providing a comprehensive case study evaluating LLMs' performance in privacy-related tasks such as privacy information extraction (PIE), legal and regulatory key point detection (KPD), and question answering (QA) with respect to privacy policies and data protection regulations. We introduce a Privacy Technical Review (PTR) framework, highlighting its role in mitigating privacy risks during the software development life-cycle. Through an empirical assessment, we investigate the capacity of several prominent LLMs, including BERT, GPT-3.5, GPT-4, and custom models, in executing privacy compliance checks and technical privacy reviews. Our experiments benchmark the models across multiple dimensions, focusing on their precision, recall, and F1-scores in extracting privacy-sensitive information and detecting key regulatory compliance points. While LLMs show promise in automating privacy reviews and identifying regulatory discrepancies, significant gaps persist in their ability to fully comply with evolving legal standards. We provide actionable recommendations for enhancing LLMs' capabilities in privacy compliance, emphasizing the need for robust model improvements and better integration with legal and regulatory requirements. This study underscores the growing importance of developing privacy-aware LLMs that can both support businesses in compliance efforts and safeguard user privacy rights.
CLAug 6, 2024
Leveraging Inter-Chunk Interactions for Enhanced Retrieval in Large Language Model-Based Question AnsweringTiezheng Guo, Chen Wang, Yanyi Liu et al.
Retrieving external knowledge and prompting large language models with relevant information is an effective paradigm to enhance the performance of question-answering tasks. Previous research typically handles paragraphs from external documents in isolation, resulting in a lack of context and ambiguous references, particularly in multi-document and complex tasks. To overcome these challenges, we propose a new retrieval framework IIER, that leverages Inter-chunk Interactions to Enhance Retrieval. This framework captures the internal connections between document chunks by considering three types of interactions: structural, keyword, and semantic. We then construct a unified Chunk-Interaction Graph to represent all external documents comprehensively. Additionally, we design a graph-based evidence chain retriever that utilizes previous paths and chunk interactions to guide the retrieval process. It identifies multiple seed nodes based on the target question and iteratively searches for relevant chunks to gather supporting evidence. This retrieval process refines the context and reasoning chain, aiding the large language model in reasoning and answer generation. Extensive experiments demonstrate that IIER outperforms strong baselines across four datasets, highlighting its effectiveness in improving retrieval and reasoning capabilities.
CLNov 3, 2025
"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI ReviewersQin Zhou, Zhexin Zhang, Zhi Li et al.
With the rapid advancement of AI models, their deployment across diverse tasks has become increasingly widespread. A notable emerging application is leveraging AI models to assist in reviewing scientific papers. However, recent reports have revealed that some papers contain hidden, injected prompts designed to manipulate AI reviewers into providing overly favorable evaluations. In this work, we present an early systematic investigation into this emerging threat. We propose two classes of attacks: (1) static attack, which employs a fixed injection prompt, and (2) iterative attack, which optimizes the injection prompt against a simulated reviewer model to maximize its effectiveness. Both attacks achieve striking performance, frequently inducing full evaluation scores when targeting frontier AI reviewers. Furthermore, we show that these attacks are robust across various settings. To counter this threat, we explore a simple detection-based defense. While it substantially reduces the attack success rate, we demonstrate that an adaptive attacker can partially circumvent this defense. Our findings underscore the need for greater attention and rigorous safeguards against prompt-injection threats in AI-assisted peer review.