ROJun 2
BotDirector: Robot Storytelling Across the Symmetrical Reality with Multi-modal InteractionsZhe Sun, Meng Wang, Lei Wang et al.
Robot storytelling offers a unique blend of technological innovation and creative expression that engages children in unprecedented ways. However, the technical aspects are often too complicated for children. We propose an interactive system that facilitates robot storytelling with tangible and natural language interactions. Children arrange the playground with their own stuff and create narratives with an LLM agent. The created narratives are transformed into a motion sequence based on the map and characters, and the motions are executed by self-navigating swarm robots. This system enhances robot storytelling with flexible scenarios, enabling young children to create robot dramas with everyday objects.
CVSep 2, 2024Code
VQ-Flow: Taming Normalizing Flows for Multi-Class Anomaly Detection via Hierarchical Vector QuantizationYixuan Zhou, Xing Xu, Zhe Sun et al.
Normalizing flows, a category of probabilistic models famed for their capabilities in modeling complex data distributions, have exhibited remarkable efficacy in unsupervised anomaly detection. This paper explores the potential of normalizing flows in multi-class anomaly detection, wherein the normal data is compounded with multiple classes without providing class labels. Through the integration of vector quantization (VQ), we empower the flow models to distinguish different concepts of multi-class normal data in an unsupervised manner, resulting in a novel flow-based unified method, named VQ-Flow. Specifically, our VQ-Flow leverages hierarchical vector quantization to estimate two relative codebooks: a Conceptual Prototype Codebook (CPC) for concept distinction and its concomitant Concept-Specific Pattern Codebook (CSPC) to capture concept-specific normal patterns. The flow models in VQ-Flow are conditioned on the concept-specific patterns captured in CSPC, capable of modeling specific normal patterns associated with different concepts. Moreover, CPC further enables our VQ-Flow for concept-aware distribution modeling, faithfully mimicking the intricate multi-class normal distribution through a mixed Gaussian distribution reparametrized on the conceptual prototypes. Through the introduction of vector quantization, the proposed VQ-Flow advances the state-of-the-art in multi-class anomaly detection within a unified training scheme, yielding the Det./Loc. AUROC of 99.5%/98.3% on MVTec AD. The codebase is publicly available at https://github.com/cool-xuan/vqflow.
CVFeb 6Code
AdaptOVCD: Training-Free Open-Vocabulary Remote Sensing Change Detection via Adaptive Information FusionMingyu Dou, Shi Qiu, Ming Hu et al.
Remote sensing change detection plays a pivotal role in domains such as environmental monitoring, urban planning, and disaster assessment. However, existing methods typically rely on predefined categories and large-scale pixel-level annotations, which limit their generalization and applicability in open-world scenarios. To address these limitations, this paper proposes AdaptOVCD, a training-free Open-Vocabulary Change Detection (OVCD) architecture based on dual-dimensional multi-level information fusion. The framework integrates multi-level information fusion across data, feature, and decision levels vertically while incorporating targeted adaptive designs horizontally, achieving deep synergy among heterogeneous pre-trained models to effectively mitigate error propagation. Specifically, (1) at the data level, Adaptive Radiometric Alignment (ARA) fuses radiometric statistics with original texture features and synergizes with SAM-HQ to achieve radiometrically consistent segmentation; (2) at the feature level, Adaptive Change Thresholding (ACT) combines global difference distributions with edge structure priors and leverages DINOv3 to achieve robust change detection; (3) at the decision level, Adaptive Confidence Filtering (ACF) integrates semantic confidence with spatial constraints and collaborates with DGTRS-CLIP to achieve high-confidence semantic identification. Comprehensive evaluations across nine scenarios demonstrate that AdaptOVCD detects arbitrary category changes in a zero-shot manner, significantly outperforming existing training-free methods. Meanwhile, it achieves 84.89\% of the fully-supervised performance upper bound in cross-dataset evaluations and exhibits superior generalization capabilities. The code is available at https://github.com/Dmygithub/AdaptOVCD.
CVApr 15, 2024Code
CREST: Cross-modal Resonance through Evidential Deep Learning for Enhanced Zero-Shot LearningHaojian Huang, Xiaozhen Qiao, Zhuo Chen et al.
Zero-shot learning (ZSL) enables the recognition of novel classes by leveraging semantic knowledge transfer from known to unknown categories. This knowledge, typically encapsulated in attribute descriptions, aids in identifying class-specific visual features, thus facilitating visual-semantic alignment and improving ZSL performance. However, real-world challenges such as distribution imbalances and attribute co-occurrence among instances often hinder the discernment of local variances in images, a problem exacerbated by the scarcity of fine-grained, region-specific attribute annotations. Moreover, the variability in visual presentation within categories can also skew attribute-category associations. In response, we propose a bidirectional cross-modal ZSL approach CREST. It begins by extracting representations for attribute and visual localization and employs Evidential Deep Learning (EDL) to measure underlying epistemic uncertainty, thereby enhancing the model's resilience against hard negatives. CREST incorporates dual learning pathways, focusing on both visual-category and attribute-category alignments, to ensure robust correlation between latent and observable spaces. Moreover, we introduce an uncertainty-informed cross-modal fusion technique to refine visual-attribute inference. Extensive experiments demonstrate our model's effectiveness and unique explainability across multiple datasets. Our code and data are available at: https://github.com/JethroJames/CREST
CVAug 8, 2024
ComKD-CLIP: Comprehensive Knowledge Distillation for Contrastive Language-Image Pre-traning ModelYifan Chen, Xiaozhen Qiao, Zhe Sun et al.
Contrastive Language-Image Pre-training (CLIP) models excel in integrating semantic information between images and text through contrastive learning techniques. It has achieved remarkable performance in various multimodal tasks. However, the deployment of large CLIP models is hindered in resource-limited environments, while smaller models frequently fail to meet the performance benchmarks required for practical applications. In this paper, we propose a novel approach, ComKD-CLIP: Comprehensive Knowledge Distillation for Contrastive Language-Image Pre-traning Model, which aims to comprehensively distill the knowledge from a large teacher CLIP model into a smaller student model, ensuring comparable performance with significantly reduced parameters. ComKD-CLIP is composed of two key mechanisms: Image Feature Alignment (IFAlign) and Educational Attention (EduAttention). IFAlign makes the image features extracted by the student model closely match those extracted by the teacher model, enabling the student to learn teacher's knowledge of extracting image features. EduAttention explores the cross-relationships between text features extracted by the teacher model and image features extracted by the student model, enabling the student model to learn how the teacher model integrates text-image features. In addition, ComKD-CLIP can refine the knowledge distilled from IFAlign and EduAttention by leveraging the text-image feature fusion results of the teacher model, ensuring the student model accurately absorbs the teacher's knowledge. Extensive experiments conducted on 11 datasets have demonstrated the superiority of the proposed method.
AIAug 1, 2024
SentenceVAE: Enable Next-sentence Prediction for Large Language Models with Faster Speed, Higher Accuracy and Longer ContextHongjun An, Yifan Chen, Zhe Sun et al.
Current large language models (LLMs) primarily utilize next-token prediction method for inference, which significantly impedes their processing speed. In this paper, we introduce a novel inference methodology termed next-sentence prediction, aiming at enhancing the inference efficiency of LLMs. We present Sentence Variational Autoencoder (SentenceVAE), which includes a Sentence Encoder to compress multiple tokens in a sentence into a single token, and a Sentence Decoder to reconstruct it. By integrating SentenceVAE into the input and output layers of LLMs, we develop Sentence-level LLMs (SLLMs) that employ a sentence-by-sentence inference method. In addition, the SentenceVAE module of SLLMs can maintain the integrity of the original semantic content by segmenting the context into sentences, thereby improving accuracy while boosting inference speed. Moreover, compared to previous LLMs, SLLMs process fewer tokens over equivalent context length, significantly reducing memory demands for self-attention computation and facilitating the handling of longer context. Extensive experiments on Wanjuan dataset have revealed that the proposed method can accelerate inference speed by 204~365%, reduce perplexity (PPL) to 46~75% of its original metric, and decrease memory overhead by 86~91% for the equivalent context length, compared to previous token-by-token methods.
ROJan 22
PUMA: Perception-driven Unified Foothold Prior for Mobility Augmented Quadruped ParkourLiang Wang, Kanzhong Yao, Yang Liu et al.
Parkour tasks for quadrupeds have emerged as a promising benchmark for agile locomotion. While human athletes can effectively perceive environmental characteristics to select appropriate footholds for obstacle traversal, endowing legged robots with similar perceptual reasoning remains a significant challenge. Existing methods often rely on hierarchical controllers that follow pre-computed footholds, thereby constraining the robot's real-time adaptability and the exploratory potential of reinforcement learning. To overcome these challenges, we present PUMA, an end-to-end learning framework that integrates visual perception and foothold priors into a single-stage training process. This approach leverages terrain features to estimate egocentric polar foothold priors, composed of relative distance and heading, guiding the robot in active posture adaptation for parkour tasks. Extensive experiments conducted in simulation and real-world environments across various discrete complex terrains, demonstrate PUMA's exceptional agility and robustness in challenging scenarios.
CVMar 17
Advancing Visual Reliability: Color-Accurate Underwater Image Enhancement for Real-Time Underwater MissionsYiqiang Zhou, Yifan Chen, Zhe Sun et al.
Underwater image enhancement plays a crucial role in providing reliable visual information for underwater platforms, since strong absorption and scattering in water-related environments generally lead to image quality degradation. Existing high-performance methods often rely on complex architectures, which hinder deployment on underwater devices. Lightweight methods often sacrifice quality for speed and struggle to handle severely degraded underwater images. To address this limitation, we present a real-time underwater image enhancement framework with accurate color restoration. First, an Adaptive Weighted Channel Compensation module is introduced to achieve dynamic color recovery of the red and blue channels using the green channel as a reference anchor. Second, we design a Multi-branch Re-parameterized Dilated Convolution that employs multi-branch fusion during training and structural re-parameterization during inference, enabling large receptive field representation with low computational overhead. Finally, a Statistical Global Color Adjustment module is employed to optimize overall color performance based on statistical priors. Extensive experiments on eight datasets demonstrate that the proposed method achieves state-of-the-art performance across seven evaluation metrics. The model contains only 3,880 inference parameters and achieves an inference speed of 409 FPS. Our method improves the UCIQE score by 29.7% under diverse environmental conditions, and the deployment on ROV platforms and performance gains in downstream tasks further validate its superiority for real-time underwater missions.
CVApr 14, 2024Code
StreakNet-Arch: An Anti-scattering Network-based Architecture for Underwater Carrier LiDAR-Radar ImagingXuelong Li, Hongjun An, Haofei Zhao et al.
In this paper, we introduce StreakNet-Arch, a real-time, end-to-end binary-classification framework based on our self-developed Underwater Carrier LiDAR-Radar (UCLR) that embeds Self-Attention and our novel Double Branch Cross Attention (DBC-Attention) to enhance scatter suppression. Under controlled water tank validation conditions, StreakNet-Arch with Self-Attention or DBC-Attention outperforms traditional bandpass filtering and achieves higher $F_1$ scores than learning-based MP networks and CNNs at comparable model size and complexity. Real-time benchmarks on an NVIDIA RTX 3060 show a constant Average Imaging Time (54 to 84 ms) regardless of frame count, versus a linear increase (58 to 1,257 ms) for conventional methods. To facilitate further research, we contribute a publicly available streak-tube camera image dataset contains 2,695,168 real-world underwater 3D point cloud data. More importantly, we validate our UCLR system in a South China Sea trial, reaching an error of 46mm for 3D target at 1,000 m depth and 20 m range. Source code and data are available at https://github.com/BestAnHongjun/StreakNet .
IVJul 24, 2025Code
Parameter-Efficient Fine-Tuning of 3D DDPM for MRI Image Generation Using Tensor NetworksBinghua Li, Ziqing Chang, Tong Liang et al.
We address the challenge of parameter-efficient fine-tuning (PEFT) for three-dimensional (3D) U-Net-based denoising diffusion probabilistic models (DDPMs) in magnetic resonance imaging (MRI) image generation. Despite its practical significance, research on parameter-efficient representations of 3D convolution operations remains limited. To bridge this gap, we propose Tensor Volumetric Operator (TenVOO), a novel PEFT method specifically designed for fine-tuning DDPMs with 3D convolutional backbones. Leveraging tensor network modeling, TenVOO represents 3D convolution kernels with lower-dimensional tensors, effectively capturing complex spatial dependencies during fine-tuning with few parameters. We evaluate TenVOO on three downstream brain MRI datasets-ADNI, PPMI, and BraTS2021-by fine-tuning a DDPM pretrained on 59,830 T1-weighted brain MRI scans from the UK Biobank. Our results demonstrate that TenVOO achieves state-of-the-art performance in multi-scale structural similarity index measure (MS-SSIM), outperforming existing approaches in capturing spatial dependencies while requiring only 0.3% of the trainable parameters of the original model. Our code is available at: https://github.com/xiaovhua/tenvoo
CVApr 13, 2024Code
MAProtoNet: A Multi-scale Attentive Interpretable Prototypical Part Network for 3D Magnetic Resonance Imaging Brain Tumor ClassificationBinghua Li, Jie Mao, Zhe Sun et al.
Automated diagnosis with artificial intelligence has emerged as a promising area in the realm of medical imaging, while the interpretability of the introduced deep neural networks still remains an urgent concern. Although contemporary works, such as XProtoNet and MProtoNet, has sought to design interpretable prediction models for the issue, the localization precision of their resulting attribution maps can be further improved. To this end, we propose a Multi-scale Attentive Prototypical part Network, termed MAProtoNet, to provide more precise maps for attribution. Specifically, we introduce a concise multi-scale module to merge attentive features from quadruplet attention layers, and produces attribution maps. The proposed quadruplet attention layers can enhance the existing online class activation mapping loss via capturing interactions between the spatial and channel dimension, while the multi-scale module then fuses both fine-grained and coarse-grained information for precise maps generation. We also apply a novel multi-scale mapping loss for supervision on the proposed multi-scale module. Compared to existing interpretable prototypical part networks in medical imaging, MAProtoNet can achieve state-of-the-art performance in localization on brain tumor segmentation (BraTS) datasets, resulting in approximately 4% overall improvement on activation precision score (with a best score of 85.8%), without using additional annotated labels of segmentation. Our code will be released in https://github.com/TUAT-Novice/maprotonet.
LGJul 15, 2024
Static and multivariate-temporal attentive fusion transformer for readmission risk predictionZhe Sun, Runzhi Li, Jing Wang et al.
Background: Accurate short-term readmission prediction of ICU patients is significant in improving the efficiency of resource assignment by assisting physicians in making discharge decisions. Clinically, both individual static static and multivariate temporal data collected from ICU monitors play critical roles in short-term readmission prediction. Informative static and multivariate temporal feature representation capturing and fusion present challenges for accurate readmission prediction. Methods:We propose a novel static and multivariate-temporal attentive fusion transformer (SMTAFormer) to predict short-term readmission of ICU patients by fully leveraging the potential of demographic and dynamic temporal data. In SMTAFormer, we first apply an MLP network and a temporal transformer network to learn useful static and temporal feature representations, respectively. Then, the well-designed static and multivariate temporal feature fusion module is applied to fuse static and temporal feature representations by modeling intra-correlation among multivariate temporal features and constructing inter-correlation between static and multivariate temporal features. Results: We construct a readmission risk assessment (RRA) dataset based on the MIMIC-III dataset. The extensive experiments show that SMTAFormer outperforms advanced methods, in which the accuracy of our proposed method is up to 86.6%, and the area under the receiver operating characteristic curve (AUC) is up to 0.717. Conclusion: Our proposed SMTAFormer can efficiently capture and fuse static and multivariate temporal feature representations. The results show that SMTAFormer significantly improves the short-term readmission prediction performance of ICU patients through comparisons to strong baselines.
CVNov 18, 2025Code
A Generative Data Framework with Authentic Supervision for Underwater Image Restoration and EnhancementYufeng Tian, Yifan Chen, Zhe Sun et al.
Underwater image restoration and enhancement are crucial for correcting color distortion and restoring image details, thereby establishing a fundamental basis for subsequent underwater visual tasks. However, current deep learning methodologies in this area are frequently constrained by the scarcity of high-quality paired datasets. Since it is difficult to obtain pristine reference labels in underwater scenes, existing benchmarks often rely on manually selected results from enhancement algorithms, providing debatable reference images that lack globally consistent color and authentic supervision. This limits the model's capabilities in color restoration, image enhancement, and generalization. To overcome this limitation, we propose using in-air natural images as unambiguous reference targets and translating them into underwater-degraded versions, thereby constructing synthetic datasets that provide authentic supervision signals for model learning. Specifically, we establish a generative data framework based on unpaired image-to-image translation, producing a large-scale dataset that covers 6 representative underwater degradation types. The framework constructs synthetic datasets with precise ground-truth labels, which facilitate the learning of an accurate mapping from degraded underwater images to their pristine scene appearances. Extensive quantitative and qualitative experiments across 6 representative network architectures and 3 independent test sets show that models trained on our synthetic data achieve comparable or superior color restoration and generalization performance to those trained on existing benchmarks. This research provides a reliable and scalable data-driven solution for underwater image restoration and enhancement. The generated dataset is publicly available at: https://github.com/yftian2025/SynUIEDatasets.git.
LGMay 30, 2023Code
AnoOnly: Semi-Supervised Anomaly Detection with the Only Loss on AnomaliesYixuan Zhou, Peiyu Yang, Yi Qu et al.
Semi-supervised anomaly detection (SSAD) methods have demonstrated their effectiveness in enhancing unsupervised anomaly detection (UAD) by leveraging few-shot but instructive abnormal instances. However, the dominance of homogeneous normal data over anomalies biases the SSAD models against effectively perceiving anomalies. To address this issue and achieve balanced supervision between heavily imbalanced normal and abnormal data, we develop a novel framework called AnoOnly (Anomaly Only). Unlike existing SSAD methods that resort to strict loss supervision, AnoOnly suspends it and introduces a form of weak supervision for normal data. This weak supervision is instantiated through the utilization of batch normalization, which implicitly performs cluster learning on normal data. When integrated into existing SSAD methods, the proposed AnoOnly demonstrates remarkable performance enhancements across various models and datasets, achieving new state-of-the-art performance. Additionally, our AnoOnly is natively robust to label noise when suffering from data contamination. Our code is publicly available at https://github.com/cool-xuan/AnoOnly.
AIDec 23, 2025
TongSIM: A General Platform for Simulating Intelligent MachinesZhe Sun, Kunlun Wu, Chuanjian Fu et al.
As artificial intelligence (AI) rapidly advances, especially in multimodal large language models (MLLMs), research focus is shifting from single-modality text processing to the more complex domains of multimodal and embodied AI. Embodied intelligence focuses on training agents within realistic simulated environments, leveraging physical interaction and action feedback rather than conventionally labeled datasets. Yet, most existing simulation platforms remain narrowly designed, each tailored to specific tasks. A versatile, general-purpose training environment that can support everything from low-level embodied navigation to high-level composite activities, such as multi-agent social simulation and human-AI collaboration, remains largely unavailable. To bridge this gap, we introduce TongSIM, a high-fidelity, general-purpose platform for training and evaluating embodied agents. TongSIM offers practical advantages by providing over 100 diverse, multi-room indoor scenarios as well as an open-ended, interaction-rich outdoor town simulation, ensuring broad applicability across research needs. Its comprehensive evaluation framework and benchmarks enable precise assessment of agent capabilities, such as perception, cognition, decision-making, human-robot cooperation, and spatial and social reasoning. With features like customized scenes, task-adaptive fidelity, diverse agent types, and dynamic environmental simulation, TongSIM delivers flexibility and scalability for researchers, serving as a unified platform that accelerates training, evaluation, and advancement toward general embodied intelligence.
AIDec 23, 2025
S$^3$IT: A Benchmark for Spatially Situated Social Intelligence TestZhe Sun, Xueyuan Yang, Yujie Lu et al.
The integration of embodied agents into human environments demands embodied social intelligence: reasoning over both social norms and physical constraints. However, existing evaluations fail to address this integration, as they are limited to either disembodied social reasoning (e.g., in text) or socially-agnostic physical tasks. Both approaches fail to assess an agent's ability to integrate and trade off both physical and social constraints within a realistic, embodied context. To address this challenge, we introduce Spatially Situated Social Intelligence Test (S$^{3}$IT), a benchmark specifically designed to evaluate embodied social intelligence. It is centered on a novel and challenging seat-ordering task, requiring an agent to arrange seating in a 3D environment for a group of large language model-driven (LLM-driven) NPCs with diverse identities, preferences, and intricate interpersonal relationships. Our procedurally extensible framework generates a vast and diverse scenario space with controllable difficulty, compelling the agent to acquire preferences through active dialogue, perceive the environment via autonomous exploration, and perform multi-objective optimization within a complex constraint network. We evaluate state-of-the-art LLMs on S$^{3}$IT and found that they still struggle with this problem, showing an obvious gap compared with the human baseline. Results imply that LLMs have deficiencies in spatial intelligence, yet simultaneously demonstrate their ability to achieve near human-level competence in resolving conflicts that possess explicit textual cues.
CVMar 12, 2025
Bidirectional Prototype-Reward co-Evolution for Test-Time Adaptation of Vision-Language ModelsXiaozhen Qiao, Peng Huang, Jiakang Yuan et al.
Test-time adaptation (TTA) is crucial in maintaining performance of Vision Language Models (VLMs) when facing distribution shifts, particularly when the source data or target labels are inaccessible. Existing TTA methods predominantly leverage the output probability distribution of CLIP for feature evaluation, resulting in biases under domain shifts, which cause misclassified features due to text priors or incorrect textual associations. To address these issues, we propose \underline{B}idirectional Prototype-Reward co-Evolution (BPRE), a novel VLMs framework with TTA that integrates feature quality assessment with prototype evolution via a synergistic feedback loop. First, the Multi-dimensional Quality-aware Reward Module (MQRM) is designed to evaluate feature quality and guide prototype refinement precisely. The continuous refinement of prototype quality via Prototype-Reward Interactive Evolution (PRIE) enhances the computation more robust. Through this bidirectional interaction, the precision of rewards and prototype evolution mutually reinforce each other, forming a self-evolving feedback cycle. Extensive experiments conducted on 15 diverse recognition datasets demonstrate that our model consistently achieves superior performance compared to other SOTA methods, and advances VLM generalization capabilities through emphasizing comprehensive feature evaluation.
IVMar 10, 2025
Large model enhanced computational ghost imagingYifan Chen, Hongjun An, Zhe Sun et al.
Ghost imaging (GI) achieves 2D image reconstruction through high-order correlation of 1D bucket signals and 2D light field information, particularly demonstrating enhanced detection sensitivity and high-quality image reconstruction via efficient photon collection in scattering media. Recent investigations have established that deep learning (DL) can substantially enhance the ghost imaging reconstruction quality. Furthermore, with the emergence of large models like SDXL, GPT-4, etc., the constraints of conventional DL in parameters and architecture have been transcended, enabling models to comprehensively explore relationships among all distinct positions within feature sequences. This paradigm shift has significantly advanced the capability of DL in restoring severely degraded and low-resolution imagery, making it particularly advantageous for noise-robust image reconstruction in GI applications. In this paper, we propose the first large imaging model with 1.4 billion parameters that incorporates the physical principles of GI (GILM). The proposed GILM implements a skip connection mechanism to mitigate gradient explosion challenges inherent in deep architectures, ensuring sufficient parametric capacity to capture intricate correlations among object single-pixel measurements. Moreover, GILM leverages multi-head attention mechanism to learn spatial dependencies across pixel points during image reconstruction, facilitating the extraction of comprehensive object information for subsequent reconstruction. We validated the effectiveness of GILM through a series of experiments, including simulated object imaging, imaging objects in free space, and imaging object located 52 meters away in underwater environment. The experimental results show that GILM effectively analyzes the fluctuation trends of the collected signals, thereby optimizing the recovery of the object's image from the acquired data.
CVMar 31
Compressive sensing inspired self-supervised single-pixel imagingJijun Lu, Yifan Chen, Libang Chen et al.
Single-pixel imaging (SPI) is a promising imaging modality with distinctive advantages in strongly perturbed environments. Existing SPI methods lack physical sparsity constraints and overlook the integration of local and global features, leading to severe noise vulnerability, structural distortions and blurred details. To address these limitations, we propose SISTA-Net, a compressive sensing-inspired self-supervised method for single-pixel imaging. SISTA-Net unfolds the Iterative Shrinkage-Thresholding Algorithm (ISTA) into an interpretable network consisting of a data fidelity module and a proximal mapping module. The fidelity module adopts a hybrid CNN-Visual State Space Model (VSSM) architecture to integrate local and global feature modeling, enhancing reconstruction integrity and fidelity. We leverage deep nonlinear networks as adaptive sparse transforms combined with a learnable soft-thresholding operator to impose explicit physical sparsity in the latent domain, enabling noise suppression and robustness to interference even at extremely low sampling rates. Extensive experiments on multiple simulation scenarios demonstrate that SISTA-Net outperforms state-of-the-art methods by 2.6 dB in PSNR. Real-world far-field underwater tests yield a 3.4 dB average PSNR improvement, validating its robust anti-interference capability.
ROApr 10, 2025
Learning Long Short-Term Intention within Human Daily BehaviorsZhe Sun, Rujie Wu, Xiaodong Yang et al.
In the domain of autonomous household robots, it is of utmost importance for robots to understand human behaviors and provide appropriate services. This requires the robots to possess the capability to analyze complex human behaviors and predict the true intentions of humans. Traditionally, humans are perceived as flawless, with their decisions acting as the standards that robots should strive to align with. However, this raises a pertinent question: What if humans make mistakes? In this research, we present a unique task, termed "long short-term intention prediction". This task requires robots can predict the long-term intention of humans, which aligns with human values, and the short term intention of humans, which reflects the immediate action intention. Meanwhile, the robots need to detect the potential non-consistency between the short-term and long-term intentions, and provide necessary warnings and suggestions. To facilitate this task, we propose a long short-term intention model to represent the complex intention states, and build a dataset to train this intention model. Then we propose a two-stage method to integrate the intention model for robots: i) predicting human intentions of both value-based long-term intentions and action-based short-term intentions; and 2) analyzing the consistency between the long-term and short-term intentions. Experimental results indicate that the proposed long short-term intention model can assist robots in comprehending human behavioral patterns over both long-term and short-term durations, which helps determine the consistency between long-term and short-term intentions of humans.
SDNov 17, 2025
Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMsZhe Sun, Yujun Cai, Jiayu Yao et al.
Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering. Yet, whether these models can perceive spatial dynamics, particularly the motion of sound sources, remains unclear. In this work, we uncover a systematic motion perception deficit in current ALLMs. To investigate this issue, we introduce AMPBench, the first benchmark explicitly designed to evaluate auditory motion understanding. AMPBench introduces a controlled question-answering benchmark designed to evaluate whether Audio-Language Models (LALMs) can infer the direction and trajectory of moving sound sources from binaural audio. Comprehensive quantitative and qualitative analyses reveal that current models struggle to reliably recognize motion cues or distinguish directional patterns. The average accuracy remains below 50%, underscoring a fundamental limitation in auditory spatial reasoning. Our study highlights a fundamental gap between human and model auditory spatial reasoning, providing both a diagnostic tool and new insight for enhancing spatial cognition in future Audio-Language Models.
CVNov 24, 2025
Mitigating Long-Tail Bias in HOI Detection via Adaptive Diversity CacheYuqiu Jiang, Xiaozhen Qiao, Tianyu Mei et al.
Human-Object Interaction (HOI) detection is a fundamental task in computer vision, empowering machines to comprehend human-object relationships in diverse real-world scenarios. Recent advances in VLMs have significantly improved HOI detection by leveraging rich cross-modal representations. However, most existing VLM-based approaches rely heavily on additional training or prompt tuning, resulting in substantial computational overhead and limited scalability, particularly in long-tailed scenarios where rare interactions are severely underrepresented. In this paper, we propose the Adaptive Diversity Cache (ADC) module, a novel training-free and plug-and-play mechanism designed to mitigate long-tail bias in HOI detection. ADC constructs class-specific caches that accumulate high-confidence and diverse feature representations during inference. The method incorporates frequency-aware cache adaptation that favors rare categories and is designed to enable robust prediction calibration without requiring additional training or fine-tuning. Extensive experiments on HICO-DET and V-COCO datasets show that ADC consistently improves existing HOI detectors, achieving up to +8.57\% mAP gain on rare categories and +4.39\% on the full dataset, demonstrating its effectiveness in mitigating long-tail bias while preserving overall performance.
CVOct 22, 2025
Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language ModelsXiaozhen Qiao, Jingkai Zhao, Yuqiu Jiang et al.
Vision-Language Models (VLMs) demonstrate impressive zero-shot generalization through large-scale image-text pretraining, yet their performance can drop once the deployment distribution diverges from the training distribution. To address this, Test-Time Adaptation (TTA) methods update models using unlabeled target data. However, existing approaches often ignore two key challenges: prototype degradation in long-tailed distributions and confusion between semantically similar classes. To tackle these issues, we propose \textbf{C}lass-Aware \textbf{P}rototype \textbf{L}earning with \textbf{N}egative \textbf{C}ontrast(\textbf{CPL-NC}), a lightweight TTA framework designed specifically for VLMs to enhance generalization under distribution shifts. CPL-NC introduces a \textit{Class-Aware Prototype Cache} Module that dynamically adjusts per-class capacity based on test-time frequency and activation history, with a rejuvenation mechanism for inactive classes to retain rare-category knowledge. Additionally, a \textit{Negative Contrastive Learning} Mechanism identifies and constrains hard visual-textual negatives to improve class separability. The framework employs asymmetric optimization, refining only textual prototypes while anchoring on stable visual features. Experiments on 15 benchmarks show that CPL-NC consistently outperforms prior TTA methods across both ResNet-50 and ViT-B/16 backbones.
LGSep 27, 2025
DPFNAS: Differential Privacy-Enhanced Federated Neural Architecture Search for 6G Edge IntelligenceYang Lv, Jin Cao, Ben Niu et al.
The Sixth-Generation (6G) network envisions pervasive artificial intelligence (AI) as a core goal, enabled by edge intelligence through on-device data utilization. To realize this vision, federated learning (FL) has emerged as a key paradigm for collaborative training across edge devices. However, the sensitivity and heterogeneity of edge data pose key challenges to FL: parameter sharing risks data reconstruction, and a unified global model struggles to adapt to diverse local distributions. In this paper, we propose a novel federated learning framework that integrates personalized differential privacy (DP) and adaptive model design. To protect training data, we leverage sample-level representations for knowledge sharing and apply a personalized DP strategy to resist reconstruction attacks. To ensure distribution-aware adaptation under privacy constraints, we develop a privacy-aware neural architecture search (NAS) algorithm that generates locally customized architectures and hyperparameters. To the best of our knowledge, this is the first personalized DP solution tailored for representation-based FL with theoretical convergence guarantees. Our scheme achieves strong privacy guarantees for training data while significantly outperforming state-of-the-art methods in model performance. Experiments on benchmark datasets such as CIFAR-10 and CIFAR-100 demonstrate that our scheme improves accuracy by 6.82\% over the federated NAS method PerFedRLNAS, while reducing model size to 1/10 and communication cost to 1/20.
ROMar 25, 2025
Body Discovery of Embodied AIZhe Sun, Pengfei Tian, Xiaozhu Hu et al.
In the pursuit of realizing artificial general intelligence (AGI), the importance of embodied artificial intelligence (AI) becomes increasingly apparent. Following this trend, research integrating robots with AGI has become prominent. As various kinds of embodiments have been designed, adaptability to diverse embodiments will become important to AGI. We introduce a new challenge, termed "Body Discovery of Embodied AI", focusing on tasks of recognizing embodiments and summarizing neural signal functionality. The challenge encompasses the precise definition of an AI body and the intricate task of identifying embodiments in dynamic environments, where conventional approaches often prove inadequate. To address these challenges, we apply causal inference method and evaluate it by developing a simulator tailored for testing algorithms with virtual environments. Finally, we validate the efficacy of our algorithms through empirical testing, demonstrating their robust performance in various scenarios based on virtual environments.
HCJan 28, 2022
Towards Multi-class Pre-movement ClassificationHao Jia, Zhe Sun, Feng Duan et al.
In non-invasive brain-computer interface systems, pre-movement decoding plays an important role in the detection of movement before limbs actually move. Movement-related cortical potential is a kind of brain activity associated with pre-movement decoding. In current studies, patterns decoded from movement are mainly applied to the binary classification between movement state and resting state, such as elbow flexion and rest. The classifications between two movement states and among multiple movement states are still challenging. This study proposes a new method, the star-arrangement spectral filtering (SASF), to solve the multi-class pre-movement classification problem. We first design a referenced task-related component analysis (RTRCA) framework that consists of two modules. This first module is the classification between movement state and resting state; the second module is the classification of multiple movement states. SASF is developed by optimizing the features in RTRCA. In SASF, feature selection on filter banks is used on the first module of RTRCA, and feature selection on time windows is used on the second module of RTRCA. A linear discriminant analysis classifier is used to classify the optimized features. In the binary classification between two motions, the classification accuracy of SASF achieves 0.9670$\pm$0.0522, which is significantly higher than the result provided by the deep convolutional neural network (0.6247$\pm$0.0680) and the discriminative spatial pattern method (0.4400$\pm$0.0700). In the multi-class classification of 7 states, the classification accuracy of SASF is 0.9491$\pm$0.0372. The proposed SASF greatly improves the classification between two motions and enables the classification among multiple motions. The result shows that the movement can be decoded from EEG signals before the actual limb movement.
HCJan 28, 2022
Improving Pre-movement Pattern Detection with Filter Bank SelectionHao Jia, Zhe Sun, Feng Duan et al.
Pre-movement decoding plays an important role in movement detection and is able to detect movement onset with low-frequency electroencephalogram (EEG) signals before the limb moves. In related studies, pre-movement decoding with standard task-related component analysis (STRCA) has been demonstrated to be efficient for classification between movement state and resting state. However, the accuracies of STRCA differ among subbands in the frequency domain. Due to individual differences, the best subband differs among subjects and is difficult to be determined. This study aims to improve the performance of the STRCA method by a feature selection on multiple subbands and avoid the selection of best subbands. This study first compares three frequency range settings ($M_1$: subbands with equally spaced bandwidths; $M_2$: subbands whose high cut-off frequencies are twice the low cut-off frequencies; $M_3$: subbands that start at some specific fixed frequencies and end at the frequencies in an arithmetic sequence.). Then, we develop a mutual information based technique to select the features in these subbands. A binary support vector machine classifier is used to classify the selected essential features. The results show that $M_3$ is a better setting than the other two settings. With the filter banks in $M_3$, the classification accuracy of the proposed FBTRCA achieves 0.8700$\pm$0.1022, which means a significantly improved performance compared to STRCA (0.8287$\pm$0.1101) as well as to the cross validation and testing method (0.8431$\pm$0.1078).
IVAug 21, 2021
L3C-Stereo: Lossless Compression for Stereo ImagesZihao Huang, Zhe Sun, Feng Duan et al.
A large number of autonomous driving tasks need high-definition stereo images, which requires a large amount of storage space. Efficiently executing lossless compression has become a practical problem. Commonly, it is hard to make accurate probability estimates for each pixel. To tackle this, we propose L3C-Stereo, a multi-scale lossless compression model consisting of two main modules: the warping module and the probability estimation module. The warping module takes advantage of two view feature maps from the same domain to generate a disparity map, which is used to reconstruct the right view so as to improve the confidence of the probability estimate of the right view. The probability estimation module provides pixel-wise logistic mixture distributions for adaptive arithmetic coding. In the experiments, our method outperforms the hand-crafted compression methods and the learning-based method on all three datasets used. Then, we show that a better maximum disparity can lead to a better compression effect. Furthermore, thanks to a compression property of our model, it naturally generates a disparity map of an acceptable quality for the subsequent stereo tasks.
CVJun 22, 2021
Serial-EMD: Fast Empirical Mode Decomposition Method for Multi-dimensional Signals Based on SerializationJin Zhang, Fan Feng, Pere Marti-Puig et al.
Empirical mode decomposition (EMD) has developed into a prominent tool for adaptive, scale-based signal analysis in various fields like robotics, security and biomedical engineering. Since the dramatic increase in amount of data puts forward higher requirements for the capability of real-time signal analysis, it is difficult for existing EMD and its variants to trade off the growth of data dimension and the speed of signal analysis. In order to decompose multi-dimensional signals at a faster speed, we present a novel signal-serialization method (serial-EMD), which concatenates multi-variate or multi-dimensional signals into a one-dimensional signal and uses various one-dimensional EMD algorithms to decompose it. To verify the effects of the proposed method, synthetic multi-variate time series, artificial 2D images with various textures and real-world facial images are tested. Compared with existing multi-EMD algorithms, the decomposition time becomes significantly reduced. In addition, the results of facial recognition with Intrinsic Mode Functions (IMFs) extracted using our method can achieve a higher accuracy than those obtained by existing multi-EMD algorithms, which demonstrates the superior performance of our method in terms of the quality of IMFs. Furthermore, this method can provide a new perspective to optimize the existing EMD algorithms, that is, transforming the structure of the input signal rather than being constrained by developing envelope computation techniques or signal decomposition methods. In summary, the study suggests that the serial-EMD technique is a highly competitive and fast alternative for multi-dimensional signal analysis.
AIDec 22, 2020
Generalized Relation Learning with Semantic Correlation Awareness for Link PredictionYao Zhang, Xu Zhang, Jun Wang et al.
Developing link prediction models to automatically complete knowledge graphs has recently been the focus of significant research interest. The current methods for the link prediction taskhavetwonaturalproblems:1)the relation distributions in KGs are usually unbalanced, and 2) there are many unseen relations that occur in practical situations. These two problems limit the training effectiveness and practical applications of the existing link prediction models. We advocate a holistic understanding of KGs and we propose in this work a unified Generalized Relation Learning framework GRL to address the above two problems, which can be plugged into existing link prediction models. GRL conducts a generalized relation learning, which is aware of semantic correlations between relations that serve as a bridge to connect semantically similar relations. After training with GRL, the closeness of semantically similar relations in vector space and the discrimination of dissimilar relations are improved. We perform comprehensive experiments on six benchmarks to demonstrate the superior capability of GRL in the link prediction task. In particular, GRL is found to enhance the existing link prediction models making them insensitive to unbalanced relation distributions and capable of learning unseen relations.
CVMay 29, 2020
Automated Measurements of Key Morphological Features of Human Embryos for IVFBrian D. Leahy, Won-Dong Jang, Helen Y. Yang et al.
A major challenge in clinical In-Vitro Fertilization (IVF) is selecting the highest quality embryo to transfer to the patient in the hopes of achieving a pregnancy. Time-lapse microscopy provides clinicians with a wealth of information for selecting embryos. However, the resulting movies of embryos are currently analyzed manually, which is time consuming and subjective. Here, we automate feature extraction of time-lapse microscopy of human embryos with a machine-learning pipeline of five convolutional neural networks (CNNs). Our pipeline consists of (1) semantic segmentation of the regions of the embryo, (2) regression predictions of fragment severity, (3) classification of the developmental stage, and object instance segmentation of (4) cells and (5) pronuclei. Our approach greatly speeds up the measurement of quantitative, biologically relevant features that may aid in embryo selection.
HCApr 25, 2020
A novel multimodal approach for hybrid brain-computer interfaceZhe Sun, Zihao Huang, Feng Duan et al.
Brain-computer interface (BCI) technologies have been widely used in many areas. In particular, non-invasive technologies such as electroencephalography (EEG) or near-infrared spectroscopy (NIRS) have been used to detect motor imagery, disease, or mental state. It has been already shown in literature that the hybrid of EEG and NIRS has better results than their respective individual signals. The fusion algorithm for EEG and NIRS sources is the key to implement them in real-life applications. In this research, we propose three fusion methods for the hybrid of the EEG and NIRS-based brain-computer interface system: linear fusion, tensor fusion, and $p$th-order polynomial fusion. Firstly, our results prove that the hybrid BCI system is more accurate, as expected. Secondly, the $p$th-order polynomial fusion has the best classification results out of the three methods, and also shows improvements compared with previous studies. For a motion imagery task and a mental arithmetic task, the best detection accuracy in previous papers were 74.20\% and 88.1\%, whereas our accuracy achieved was 77.53\% and 90.19\% . Furthermore, unlike complex artificial neural network methods, our proposed methods are not as computationally demanding.
QUANT-PHOct 8, 2019
A framework for quantum homomorphic encryption with experimental demonstrationYu Zhang, Li Yu, Qi-Ping Su et al.
Quantum homomorphic encryption (QHE) is an encryption method that allows quantum computation to be performed on one party's private data with the program provided by another party, without revealing much information about the data nor the program to the opposite party. We propose a framework for (interactive) QHE based on the universal circuit approach. It contains a subprocedure of calculating a classical linear polynomial, which can be implemented with quantum or classical methods; apart from the subprocedure, the framework has low requirement on the quantum capabilities of the party who provides the circuit. We illustrate the subprocedure using a quite simple classical protocol with some privacy tradeoff. For a special case of such protocol, we obtain a scheme similar to blind quantum computation but with the output on a different party. Another way of implementing the subprocedure is to use a recently studied quantum check-based protocol, which has low requirement on the quantum capabilities of both parties. The subprocedure could also be implemented with a classical additive homomorphic encryption scheme. We demonstrate some key steps of the outer part of the framework in a quantum optics experiment.
CLJun 5, 2018
JTAV: Jointly Learning Social Media Content Representation by Fusing Textual, Acoustic, and Visual FeaturesHongru Liang, Haozheng Wang, Jun Wang et al.
Learning social media content is the basis of many real-world applications, including information retrieval and recommendation systems, among others. In contrast with previous works that focus mainly on single modal or bi-modal learning, we propose to learn social media content by fusing jointly textual, acoustic, and visual information (JTAV). Effective strategies are proposed to extract fine-grained features of each modality, that is, attBiGRU and DCRNN. We also introduce cross-modal fusion and attentive pooling techniques to integrate multi-modal information comprehensively. Extensive experimental evaluation conducted on real-world datasets demonstrates our proposed model outperforms the state-of-the-art approaches by a large margin.
MLApr 6, 2017
DIMM-SC: A Dirichlet mixture model for clustering droplet-based single cell transcriptomic dataZhe Sun, Ting Wang, Ke Deng et al.
Motivation: Single cell transcriptome sequencing (scRNA-Seq) has become a revolutionary tool to study cellular and molecular processes at single cell resolution. Among existing technologies, the recently developed droplet-based platform enables efficient parallel processing of thousands of single cells with direct counting of transcript copies using Unique Molecular Identifier (UMI). Despite the technology advances, statistical methods and computational tools are still lacking for analyzing droplet-based scRNA-Seq data. Particularly, model-based approaches for clustering large-scale single cell transcriptomic data are still under-explored. Methods: We developed DIMM-SC, a Dirichlet Mixture Model for clustering droplet-based Single Cell transcriptomic data. This approach explicitly models UMI count data from scRNA-Seq experiments and characterizes variations across different cell clusters via a Dirichlet mixture prior. An expectation-maximization algorithm is used for parameter inference. Results: We performed comprehensive simulations to evaluate DIMM-SC and compared it with existing clustering methods such as K-means, CellTree and Seurat. In addition, we analyzed public scRNA-Seq datasets with known cluster labels and in-house scRNA-Seq datasets from a study of systemic sclerosis with prior biological knowledge to benchmark and validate DIMM-SC. Both simulation studies and real data applications demonstrated that overall, DIMM-SC achieves substantially improved clustering accuracy and much lower clustering variability compared to other existing clustering methods. More importantly, as a model-based approach, DIMM-SC is able to quantify the clustering uncertainty for each single cell, facilitating rigorous statistical inference and biological interpretations, which are typically unavailable from existing clustering methods.