CVSep 13, 2023Code
Beyond Adapting SAM: Towards End-to-End Ultrasound Image Segmentation via Auto PromptingXian Lin, Yangyang Xiang, Li Yu et al.
End-to-end medical image segmentation is of great value for computer-aided diagnosis dominated by task-specific models, usually suffering from poor generalization. With recent breakthroughs brought by the segment anything model (SAM) for universal image segmentation, extensive efforts have been made to adapt SAM for medical imaging but still encounter two major issues: 1) severe performance degradation and limited generalization without proper adaptation, and 2) semi-automatic segmentation relying on accurate manual prompts for interaction. In this work, we propose SAMUS as a universal model tailored for ultrasound image segmentation and further enable it to work in an end-to-end manner denoted as AutoSAMUS. Specifically, in SAMUS, a parallel CNN branch is introduced to supplement local information through cross-branch attention, and a feature adapter and a position adapter are jointly used to adapt SAM from natural to ultrasound domains while reducing training complexity. AutoSAMUS is realized by introducing an auto prompt generator (APG) to replace the manual prompt encoder of SAMUS to automatically generate prompt embeddings. A comprehensive ultrasound dataset, comprising about 30k images and 69k masks and covering six object categories, is collected for verification. Extensive comparison experiments demonstrate the superiority of SAMUS and AutoSAMUS against the state-of-the-art task-specific and SAM-based foundation models. We believe the auto-prompted SAM-based model has the potential to become a new paradigm for end-to-end medical image segmentation and deserves more exploration. Code and data are available at https://github.com/xianlin7/SAMUS.
CVJun 29, 2022Code
The Lighter The Better: Rethinking Transformers in Medical Image Segmentation Through Adaptive PruningXian Lin, Li Yu, Kwang-Ting Cheng et al.
Vision transformers have recently set off a new wave in the field of medical image analysis due to their remarkable performance on various computer vision tasks. However, recent hybrid-/transformer-based approaches mainly focus on the benefits of transformers in capturing long-range dependency while ignoring the issues of their daunting computational complexity, high training costs, and redundant dependency. In this paper, we propose to employ adaptive pruning to transformers for medical image segmentation and propose a lightweight and effective hybrid network APFormer. To our best knowledge, this is the first work on transformer pruning for medical image analysis tasks. The key features of APFormer mainly are self-supervised self-attention (SSA) to improve the convergence of dependency establishment, Gaussian-prior relative position embedding (GRPE) to foster the learning of position information, and adaptive pruning to eliminate redundant computations and perception information. Specifically, SSA and GRPE consider the well-converged dependency distribution and the Gaussian heatmap distribution separately as the prior knowledge of self-attention and position embedding to ease the training of transformers and lay a solid foundation for the following pruning operation. Then, adaptive transformer pruning, both query-wise and dependency-wise, is performed by adjusting the gate control parameters for both complexity reduction and performance improvement. Extensive experiments on two widely-used datasets demonstrate the prominent segmentation performance of APFormer against the state-of-the-art methods with much fewer parameters and lower GFLOPs. More importantly, we prove, through ablation studies, that adaptive pruning can work as a plug-n-play module for performance improvement on other hybrid-/transformer-based methods. Code is available at https://github.com/xianlin7/APFormer.
CVJun 28, 2022Code
FedIIC: Towards Robust Federated Learning for Class-Imbalanced Medical Image ClassificationNannan Wu, Li Yu, Xin Yang et al.
Federated learning (FL), training deep models from decentralized data without privacy leakage, has shown great potential in medical image computing recently. However, considering the ubiquitous class imbalance in medical data, FL can exhibit performance degradation, especially for minority classes (e.g. rare diseases). Existing methods towards this problem mainly focus on training a balanced classifier to eliminate class prior bias among classes, but neglect to explore better representation to facilitate classification performance. In this paper, we present a privacy-preserving FL method named FedIIC to combat class imbalance from two perspectives: feature learning and classifier learning. In feature learning, two levels of contrastive learning are designed to extract better class-specific features with imbalanced data in FL. In classifier learning, per-class margins are dynamically set according to real-time difficulty and class priors, which helps the model learn classes equally. Experimental results on publicly-available datasets demonstrate the superior performance of FedIIC in dealing with both real-world and simulated multi-source medical imaging data under class imbalance. Code is available at https://github.com/wnn2000/FedIIC.
IVSep 9, 2023Code
ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image SegmentationXian Lin, Zengqiang Yan, Xianbo Deng et al.
Transformers have been extensively studied in medical image segmentation to build pairwise long-range dependence. Yet, relatively limited well-annotated medical image data makes transformers struggle to extract diverse global features, resulting in attention collapse where attention maps become similar or even identical. Comparatively, convolutional neural networks (CNNs) have better convergence properties on small-scale training data but suffer from limited receptive fields. Existing works are dedicated to exploring the combinations of CNN and transformers while ignoring attention collapse, leaving the potential of transformers under-explored. In this paper, we propose to build CNN-style Transformers (ConvFormer) to promote better attention convergence and thus better segmentation performance. Specifically, ConvFormer consists of pooling, CNN-style self-attention (CSA), and convolutional feed-forward network (CFFN) corresponding to tokenization, self-attention, and feed-forward network in vanilla vision transformers. In contrast to positional embedding and tokenization, ConvFormer adopts 2D convolution and max-pooling for both position information preservation and feature size reduction. In this way, CSA takes 2D feature maps as inputs and establishes long-range dependency by constructing self-attention matrices as convolution kernels with adaptive sizes. Following CSA, 2D convolution is utilized for feature refinement through CFFN. Experimental results on multiple datasets demonstrate the effectiveness of ConvFormer working as a plug-and-play module for consistent performance improvement of transformer-based frameworks. Code is available at https://github.com/xianlin7/ConvFormer.
SYJul 19, 2018
A New Approach to Linear/Nonlinear Distributed Fusion Estimation ProblemBo Chen, Guoqiang Hu, Daniel W. C. Ho et al.
Disturbance noises are always bounded in a practical system, while fusion estimation is to best utilize multiple sensor data containing noises for the purpose of estimating a quantity--a parameter or process. However, few results are focused on the information fusion estimation problem under bounded noises. In this paper, we study the distributed fusion estimation problem for linear time-varying systems and nonlinear systems with bounded noises, where the addressed noises do not provide any statistical information, and are unknown but bounded. When considering linear time-varying fusion systems with bounded noises, a new local Kalman-like estimator is designed such that the square error of the estimator is bounded as time goes to $\infty$. A novel constructive method is proposed to find an upper bound of fusion estimation error, then a convex optimization problem on the design of an optimal weighting fusion criterion is established in terms of linear matrix inequalities, which can be solved by standard software packages. Furthermore, according to the design method of linear time-varying fusion systems, each local nonlinear estimator is derived for nonlinear systems with bounded noises by using Taylor series expansion, and a corresponding distributed fusion criterion is obtained by solving a convex optimization problem. Finally, target tracking system and localization of a mobile robot are given to show the advantages and effectiveness of the proposed methods.
CVJul 20, 2024Code
PASSION: Towards Effective Incomplete Multi-Modal Medical Image Segmentation with Imbalanced Missing RatesJunjie Shi, Caozhi Shang, Zhaobin Sun et al.
Incomplete multi-modal image segmentation is a fundamental task in medical imaging to refine deployment efficiency when only partial modalities are available. However, the common practice that complete-modality data is visible during model training is far from realistic, as modalities can have imbalanced missing rates in clinical scenarios. In this paper, we, for the first time, formulate such a challenging setting and propose Preference-Aware Self-diStillatION (PASSION) for incomplete multi-modal medical image segmentation under imbalanced missing rates. Specifically, we first construct pixel-wise and semantic-wise self-distillation to balance the optimization objective of each modality. Then, we define relative preference to evaluate the dominance of each modality during training, based on which to design task-wise and gradient-wise regularization to balance the convergence rates of different modalities. Experimental results on two publicly available multi-modal datasets demonstrate the superiority of PASSION against existing approaches for modality balancing. More importantly, PASSION is validated to work as a plug-and-play module for consistent performance improvement across different backbones. Code is available at https://github.com/Jun-Jie-Shi/PASSION.
SYJul 1, 2021
Delay-Dependent Distributed Kalman Fusion Estimation with Dimensionality Reduction in Cyber-Physical SystemsBo Chen, Daniel W. C. Ho, Guoqiang Hu et al.
This paper studies the distributed dimensionality reduction fusion estimation problem with communication delays for a class of cyber-physical systems (CPSs). The raw measurements are preprocessed in each sink node to obtain the local optimal estimate (LOE) of a CPS, and the compressed LOE under dimensionality reduction encounters with communication delays during the transmission. Under this case, a mathematical model with compensation strategy is proposed to characterize the dimensionality reduction and communication delays. This model also has the property to reduce the information loss caused by the dimensionality reduction and delays. Based on this model, a recursive distributed Kalman fusion estimator (DKFE) is derived by optimal weighted fusion criterion in the linear minimum variance sense. A stability condition for the DKFE, which can be easily verified by the exiting software, is derived. In addition, this condition can guarantee that estimation error covariance matrix of the DKFE converges to the unique steady-state matrix for any initial values, and thus the steady-state DKFE (SDKFE) is given. Notice that the computational complexity of the SDKFE is much lower than that of the DKFE. Moreover, a probability selection criterion for determining the dimensionality reduction strategy is also presented to guarantee the stability of the DKFE. Two illustrative examples are given to show the advantage and effectiveness of the proposed methods.
CVJul 2, 2024Code
FedIA: Federated Medical Image Segmentation with Heterogeneous Annotation CompletenessYangyang Xiang, Nannan Wu, Li Yu et al.
Federated learning has emerged as a compelling paradigm for medical image segmentation, particularly in light of increasing privacy concerns. However, most of the existing research relies on relatively stringent assumptions regarding the uniformity and completeness of annotations across clients. Contrary to this, this paper highlights a prevalent challenge in medical practice: incomplete annotations. Such annotations can introduce incorrectly labeled pixels, potentially undermining the performance of neural networks in supervised learning. To tackle this issue, we introduce a novel solution, named FedIA. Our insight is to conceptualize incomplete annotations as noisy data (i.e., low-quality data), with a focus on mitigating their adverse effects. We begin by evaluating the completeness of annotations at the client level using a designed indicator. Subsequently, we enhance the influence of clients with more comprehensive annotations and implement corrections for incomplete ones, thereby ensuring that models are trained on accurate data. Our method's effectiveness is validated through its superior performance on two extensively used medical image segmentation datasets, outperforming existing solutions. The code is available at https://github.com/HUSTxyy/FedIA.
MMOct 25, 2022
End-to-end Transformer for Compressed Video Quality EnhancementLi Yu, Wenshuai Chang, Shiyu Wu et al.
Convolutional neural networks have achieved excellent results in compressed video quality enhancement task in recent years. State-of-the-art methods explore the spatiotemporal information of adjacent frames mainly by deformable convolution. However, offset fields in deformable convolution are difficult to train, and its instability in training often leads to offset overflow, which reduce the efficiency of correlation modeling. In this work, we propose a transformer-based compressed video quality enhancement (TVQE) method, consisting of Swin-AutoEncoder based Spatio-Temporal feature Fusion (SSTF) module and Channel-wise Attention based Quality Enhancement (CAQE) module. The proposed SSTF module learns both local and global features with the help of Swin-AutoEncoder, which improves the ability of correlation modeling. Meanwhile, the window mechanism-based Swin Transformer and the encoderdecoder structure greatly improve the execution efficiency. On the other hand, the proposed CAQE module calculates the channel attention, which aggregates the temporal information between channels in the feature map, and finally achieves the efficient fusion of inter-frame information. Extensive experimental results on the JCT-VT test sequences show that the proposed method achieves better performance in average for both subjective and objective quality. Meanwhile, our proposed method outperforms existing ones in terms of both inference speed and GPU consumption.
CVJun 29, 2022
BATFormer: Towards Boundary-Aware Lightweight Transformer for Efficient Medical Image SegmentationXian Lin, Li Yu, Kwang-Ting Cheng et al.
Objective: Transformers, born to remedy the inadequate receptive fields of CNNs, have drawn explosive attention recently. However, the daunting computational complexity of global representation learning, together with rigid window partitioning, hinders their deployment in medical image segmentation. This work aims to address the above two issues in transformers for better medical image segmentation. Methods: We propose a boundary-aware lightweight transformer (BATFormer) that can build cross-scale global interaction with lower computational complexity and generate windows flexibly under the guidance of entropy. Specifically, to fully explore the benefits of transformers in long-range dependency establishment, a cross-scale global transformer (CGT) module is introduced to jointly utilize multiple small-scale feature maps for richer global features with lower computational complexity. Given the importance of shape modeling in medical image segmentation, a boundary-aware local transformer (BLT) module is constructed. Different from rigid window partitioning in vanilla transformers which would produce boundary distortion, BLT adopts an adaptive window partitioning scheme under the guidance of entropy for both computational complexity reduction and shape preservation. Results: BATFormer achieves the best performance in Dice of 92.84%, 91.97%, 90.26%, and 96.30% for the average, right ventricle, myocardium, and left ventricle respectively on the ACDC dataset and the best performance in Dice, IoU, and ACC of 90.76%, 84.64%, and 96.76% respectively on the ISIC 2018 dataset. More importantly, BATFormer requires the least amount of model parameters and the lowest computational complexity compared to the state-of-the-art approaches. Conclusion and Significance: Our results demonstrate the necessity of developing customized transformers for efficient and better medical image segmentation.
CVMar 17, 2022
Surface Defect Detection and Evaluation for Marine Vessels using Multi-Stage Deep LearningLi Yu, Kareem Metwaly, James Z. Wang et al.
Detecting and evaluating surface coating defects is important for marine vessel maintenance. Currently, the assessment is carried out manually by qualified inspectors using international standards and their own experience. Automating the processes is highly challenging because of the high level of variation in vessel type, paint surface, coatings, lighting condition, weather condition, paint colors, areas of the vessel, and time in service. We present a novel deep learning-based pipeline to detect and evaluate the percentage of corrosion, fouling, and delamination on the vessel surface from normal photographs. We propose a multi-stage image processing framework, including ship section segmentation, defect segmentation, and defect classification, to automatically recognize different types of defects and measure the coverage percentage on the ship surface. Experimental results demonstrate that our proposed pipeline can objectively perform a similar assessment as a qualified inspector.
CVNov 26, 2025Code
MobileI2V: Fast and High-Resolution Image-to-Video on Mobile DevicesShuai Zhang, Bao Tang, Siyuan Yu et al.
Recently, video generation has witnessed rapid advancements, drawing increasing attention to image-to-video (I2V) synthesis on mobile devices. However, the substantial computational complexity and slow generation speed of diffusion models pose significant challenges for real-time, high-resolution video generation on resource-constrained mobile devices. In this work, we propose MobileI2V, a 270M lightweight diffusion model for real-time image-to-video generation on mobile devices. The core lies in: (1) We analyzed the performance of linear attention modules and softmax attention modules on mobile devices, and proposed a linear hybrid architecture denoiser that balances generation efficiency and quality. (2) We design a time-step distillation strategy that compresses the I2V sampling steps from more than 20 to only two without significant quality loss, resulting in a 10-fold increase in generation speed. (3) We apply mobile-specific attention optimizations that yield a 2-fold speed-up for attention operations during on-device inference. MobileI2V enables, for the first time, fast 720p image-to-video generation on mobile devices, with quality comparable to existing models. Under one-step conditions, the generation speed of each frame of 720p video is less than 100 ms. Our code is available at: https://github.com/hustvl/MobileI2V.
SPJan 14, 2023
Hand Gesture Recognition through Reflected Infrared Light Wave SignalsMd Zobaer Islam, Li Yu, Hisham Abuella et al.
In this study, we present a wireless (non-contact) gesture recognition method using only incoherent light wave signals reflected from a human subject. In comparison to existing radar, light shadow, sound and camera-based sensing systems, this technology uses a low-cost ubiquitous light source (e.g., infrared LED) to send light towards the subject's hand performing gestures and the reflected light is collected by a light sensor (e.g., photodetector). This light wave sensing system recognizes different gestures from the variations of the received light intensity within a 20-35cm range. The hand gesture recognition results demonstrate up to 96% accuracy on average. The developed system can be utilized in numerous Human-computer Interaction (HCI) applications as a low-cost and non-contact gesture recognition technology.
LGSep 6, 2023
Marketing Budget Allocation with Offline Constrained Deep Reinforcement LearningTianchi Cai, Jiyan Jiang, Wenpeng Zhang et al.
We study the budget allocation problem in online marketing campaigns that utilize previously collected offline data. We first discuss the long-term effect of optimizing marketing budget allocation decisions in the offline setting. To overcome the challenge, we propose a novel game-theoretic offline value-based reinforcement learning method using mixed policies. The proposed method reduces the need to store infinitely many policies in previous methods to only constantly many policies, which achieves nearly optimal policy efficiency, making it practical and favorable for industrial usage. We further show that this method is guaranteed to converge to the optimal policy, which cannot be achieved by previous value-based reinforcement learning methods for marketing budget allocation. Our experiments on a large-scale marketing campaign with tens-of-millions users and more than one billion budget verify the theoretical results and show that the proposed method outperforms various baseline methods. The proposed method has been successfully deployed to serve all the traffic of this marketing campaign.
CLJan 20Code
HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM InferenceZhiyuan Shi, Qibo Qiu, Feng Xue et al.
The linear memory growth of the KV cache poses a significant bottleneck for LLM inference in long-context tasks. Existing static compression methods often fail to preserve globally important information, principally because they overlook the attention drift phenomenon where token significance evolves dynamically. Although recent dynamic retrieval approaches attempt to address this issue, they typically suffer from coarse-grained caching strategies and incur high I/O overhead due to frequent data transfers. To overcome these limitations, we propose HeteroCache, a training-free dynamic compression framework. Our method is built on two key insights: attention heads exhibit diverse temporal heterogeneity, and there is significant spatial redundancy among heads within the same layer. Guided by these insights, HeteroCache categorizes heads based on stability and redundancy. Consequently, we apply a fine-grained weighting strategy that allocates larger cache budgets to heads with rapidly shifting attention to capture context changes, thereby addressing the inefficiency of coarse-grained strategies. Furthermore, we employ a hierarchical storage mechanism in which a subset of representative heads monitors attention shift, and trigger an asynchronous, on-demand retrieval of contexts from the CPU, effectively hiding I/O latency. Finally, experiments demonstrate that HeteroCache achieves state-of-the-art performance on multiple long-context benchmarks and accelerates decoding by up to $3\times$ compared to the original model in the 224K context. Our code will be open-source.
CVDec 9, 2024Code
Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any GranularityHuaxin Zhang, Xiaohao Xu, Xiang Wang et al.
How can we enable models to comprehend video anomalies occurring over varying temporal scales and contexts? Traditional Video Anomaly Understanding (VAU) methods focus on frame-level anomaly prediction, often missing the interpretability of complex and diverse real-world anomalies. Recent multimodal approaches leverage visual and textual data but lack hierarchical annotations that capture both short-term and long-term anomalies. To address this challenge, we introduce HIVAU-70k, a large-scale benchmark for hierarchical video anomaly understanding across any granularity. We develop a semi-automated annotation engine that efficiently scales high-quality annotations by combining manual video segmentation with recursive free-text annotation using large language models (LLMs). This results in over 70,000 multi-granular annotations organized at clip-level, event-level, and video-level segments. For efficient anomaly detection in long videos, we propose the Anomaly-focused Temporal Sampler (ATS). ATS integrates an anomaly scorer with a density-aware sampler to adaptively select frames based on anomaly scores, ensuring that the multimodal LLM concentrates on anomaly-rich regions, which significantly enhances both efficiency and accuracy. Extensive experiments demonstrate that our hierarchical instruction data markedly improves anomaly comprehension. The integrated ATS and visual-language model outperform traditional methods in processing long videos. Our benchmark and model are publicly available at https://github.com/pipixin321/HolmesVAU.
LGDec 20, 2023Code
FedA3I: Annotation Quality-Aware Aggregation for Federated Medical Image Segmentation against Heterogeneous Annotation NoiseNannan Wu, Zhaobin Sun, Zengqiang Yan et al.
Federated learning (FL) has emerged as a promising paradigm for training segmentation models on decentralized medical data, owing to its privacy-preserving property. However, existing research overlooks the prevalent annotation noise encountered in real-world medical datasets, which limits the performance ceilings of FL. In this paper, we, for the first time, identify and tackle this problem. For problem formulation, we propose a contour evolution for modeling non-independent and identically distributed (Non-IID) noise across pixels within each client and then extend it to the case of multi-source data to form a heterogeneous noise model (i.e., Non-IID annotation noise across clients). For robust learning from annotations with such two-level Non-IID noise, we emphasize the importance of data quality in model aggregation, allowing high-quality clients to have a greater impact on FL. To achieve this, we propose Federated learning with Annotation quAlity-aware AggregatIon, named FedA3I, by introducing a quality factor based on client-wise noise estimation. Specifically, noise estimation at each client is accomplished through the Gaussian mixture model and then incorporated into model aggregation in a layer-wise manner to up-weight high-quality clients. Extensive experiments on two real-world medical image segmentation datasets demonstrate the superior performance of FedA$^3$I against the state-of-the-art approaches in dealing with cross-client annotation noise. The code is available at https://github.com/wnn2000/FedAAAI.
LGNov 13, 2025
AgentEvolver: Towards Efficient Self-Evolving Agent SystemYunpeng Zhai, Shuchang Tao, Cheng Chen et al.
Autonomous agents powered by large language models (LLMs) have the potential to significantly enhance human productivity by reasoning, using tools, and executing complex tasks in diverse environments. However, current approaches to developing such agents remain costly and inefficient, as they typically require manually constructed task datasets and reinforcement learning (RL) pipelines with extensive random exploration. These limitations lead to prohibitively high data-construction costs, low exploration efficiency, and poor sample utilization. To address these challenges, we present AgentEvolver, a self-evolving agent system that leverages the semantic understanding and reasoning capabilities of LLMs to drive autonomous agent learning. AgentEvolver introduces three synergistic mechanisms: (i) self-questioning, which enables curiosity-driven task generation in novel environments, reducing dependence on handcrafted datasets; (ii) self-navigating, which improves exploration efficiency through experience reuse and hybrid policy guidance; and (iii) self-attributing, which enhances sample efficiency by assigning differentiated rewards to trajectory states and actions based on their contribution. By integrating these mechanisms into a unified framework, AgentEvolver enables scalable, cost-effective, and continual improvement of agent capabilities. Preliminary experiments indicate that AgentEvolver achieves more efficient exploration, better sample utilization, and faster adaptation compared to traditional RL-based baselines.
CVAug 25, 2024
MSVM-UNet: Multi-Scale Vision Mamba UNet for Medical Image SegmentationChaowei Chen, Li Yu, Shiquan Min et al.
State Space Models (SSMs), especially Mamba, have shown great promise in medical image segmentation due to their ability to model long-range dependencies with linear computational complexity. However, accurate medical image segmentation requires the effective learning of both multi-scale detailed feature representations and global contextual dependencies. Although existing works have attempted to address this issue by integrating CNNs and SSMs to leverage their respective strengths, they have not designed specialized modules to effectively capture multi-scale feature representations, nor have they adequately addressed the directional sensitivity problem when applying Mamba to 2D image data. To overcome these limitations, we propose a Multi-Scale Vision Mamba UNet model for medical image segmentation, termed MSVM-UNet. Specifically, by introducing multi-scale convolutions in the VSS blocks, we can more effectively capture and aggregate multi-scale feature representations from the hierarchical features of the VMamba encoder and better handle 2D visual data. Additionally, the large kernel patch expanding (LKPE) layers achieve more efficient upsampling of feature maps by simultaneously integrating spatial and channel information. Extensive experiments on the Synapse and ACDC datasets demonstrate that our approach is more effective than some state-of-the-art methods in capturing and aggregating multi-scale feature representations and modeling long-range dependencies between pixels.
CVNov 13, 2025
Learning to Tell Apart: Weakly Supervised Video Anomaly Detection via Disentangled Semantic AlignmentWenti Yin, Huaxin Zhang, Xiang Wang et al.
Recent advancements in weakly-supervised video anomaly detection have achieved remarkable performance by applying the multiple instance learning paradigm based on multimodal foundation models such as CLIP to highlight anomalous instances and classify categories. However, their objectives may tend to detect the most salient response segments, while neglecting to mine diverse normal patterns separated from anomalies, and are prone to category confusion due to similar appearance, leading to unsatisfactory fine-grained classification results. Therefore, we propose a novel Disentangled Semantic Alignment Network (DSANet) to explicitly separate abnormal and normal features from coarse-grained and fine-grained aspects, enhancing the distinguishability. Specifically, at the coarse-grained level, we introduce a self-guided normality modeling branch that reconstructs input video features under the guidance of learned normal prototypes, encouraging the model to exploit normality cues inherent in the video, thereby improving the temporal separation of normal patterns and anomalous events. At the fine-grained level, we present a decoupled contrastive semantic alignment mechanism, which first temporally decomposes each video into event-centric and background-centric components using frame-level anomaly scores and then applies visual-language contrastive learning to enhance class-discriminative representations. Comprehensive experiments on two standard benchmarks, namely XD-Violence and UCF-Crime, demonstrate that DSANet outperforms existing state-of-the-art methods.
LGApr 27, 2024Code
From Optimization to Generalization: Fair Federated Learning against Quality Shift via Inter-Client Sharpness MatchingNannan Wu, Zhuo Kuang, Zengqiang Yan et al.
Due to escalating privacy concerns, federated learning has been recognized as a vital approach for training deep neural networks with decentralized medical data. In practice, it is challenging to ensure consistent imaging quality across various institutions, often attributed to equipment malfunctions affecting a minority of clients. This imbalance in image quality can cause the federated model to develop an inherent bias towards higher-quality images, thus posing a severe fairness issue. In this study, we pioneer the identification and formulation of this new fairness challenge within the context of the imaging quality shift. Traditional methods for promoting fairness in federated learning predominantly focus on balancing empirical risks across diverse client distributions. This strategy primarily facilitates fair optimization across different training data distributions, yet neglects the crucial aspect of generalization. To address this, we introduce a solution termed Federated learning with Inter-client Sharpness Matching (FedISM). FedISM enhances both local training and global aggregation by incorporating sharpness-awareness, aiming to harmonize the sharpness levels across clients for fair generalization. Our empirical evaluations, conducted using the widely-used ICH and ISIC 2019 datasets, establish FedISM's superiority over current state-of-the-art federated learning methods in promoting fairness. Code is available at https://github.com/wnn2000/FFL4MIA.
CVMar 20, 2024Code
SAMCT: Segment Any CT Allowing Labor-Free Task-Indicator PromptsXian Lin, Yangyang Xiang, Zhehao Wang et al.
Segment anything model (SAM), a foundation model with superior versatility and generalization across diverse segmentation tasks, has attracted widespread attention in medical imaging. However, it has been proved that SAM would encounter severe performance degradation due to the lack of medical knowledge in training and local feature encoding. Though several SAM-based models have been proposed for tuning SAM in medical imaging, they still suffer from insufficient feature extraction and highly rely on high-quality prompts. In this paper, we construct a large CT dataset consisting of 1.1M CT images and 5M masks from public datasets and propose a powerful foundation model SAMCT allowing labor-free prompts. Specifically, based on SAM, SAMCT is further equipped with a U-shaped CNN image encoder, a cross-branch interaction module, and a task-indicator prompt encoder. The U-shaped CNN image encoder works in parallel with the ViT image encoder in SAM to supplement local features. Cross-branch interaction enhances the feature expression capability of the CNN image encoder and the ViT image encoder by exchanging global perception and local features from one to the other. The task-indicator prompt encoder is a plug-and-play component to effortlessly encode task-related indicators into prompt embeddings. In this way, SAMCT can work in an automatic manner in addition to the semi-automatic interactive strategy in SAM. Extensive experiments demonstrate the superiority of SAMCT against the state-of-the-art task-specific and SAM-based medical foundation models on various tasks. The code, data, and models are released at https://github.com/xianlin7/SAMCT.
LGFeb 6, 2023
Cross-Fusion Rule for Personalized Federated LearningWangzhuo Yang, Bo Chen, Yijun Shen et al.
Data scarcity and heterogeneity pose significant performance challenges for personalized federated learning, and these challenges are mainly reflected in overfitting and low precision in existing methods. To overcome these challenges, a multi-layer multi-fusion strategy framework is proposed in this paper, i.e., the server adopts the network layer parameters of each client upload model as the basic unit of fusion for information-sharing calculation. Then, a new fusion strategy combining personalized and generic is purposefully proposed, and the network layer number fusion threshold of each fusion strategy is designed according to the network layer function. Under this mechanism, the L2-Norm negative exponential similarity metric is employed to calculate the fusion weights of the corresponding feature extraction layer parameters for each client, thus improving the efficiency of heterogeneous data personalized collaboration. Meanwhile, the federated global optimal model approximation fusion strategy is adopted in the network full-connect layer, and this generic fusion strategy alleviates the overfitting introduced by forceful personalized. Finally, the experimental results show that the proposed method is superior to the state-of-the-art methods.
CVMay 13
OP4KSR: One-Step Patch-Free 4K Super-Resolution with Periodic Artifact SuppressionChengyan Deng, Pengbin Yu, Zhentao Chen et al.
Diffusion-based real-world image super-resolution (Real-ISR) has achieved remarkable perceptual quality; however, directly super-resolving images to 4K remains limited by extreme memory consumption. Consequently, prior methods adopt patch-based inference, sacrificing global context and introducing semantic confusion, spatial inconsistency, and severe latency. We propose OP4KSR, a one-step patch-free 4K SR approach built upon the powerful Flux backbone. By leveraging the extreme-compression F16 VAE, OP4KSR makes 4K SR inference tractable under practical GPU budgets, preserving global spatial-semantic coherence while enabling highly efficient inference. However, adapting this one-step architecture intrinsically triggers severe periodic artifacts. We trace this to a RoPE base frequency allocation mismatch and intra-token spatial ambiguity, both exacerbated by the lack of iterative refinement. To suppress these artifacts, we couple RoPE base frequency rescaling (RFR) with an autocorrelation-based periodicity loss ($\mathcal{L}_\text{AP}$). Furthermore, we curate a dedicated training dataset alongside three benchmarks (one synthetic and two real-world) to advance 4K SR research. Extensive experiments demonstrate that OP4KSR achieves competitive perceptual quality with efficient inference, generating a $4096\times4096$ output in only 5.75 seconds on a single NVIDIA H20 GPU.
CVApr 1Code
TF-SSD: A Strong Pipeline via Synergic Mask Filter for Training-free Co-salient Object DetectionZhijin He, Shuo Jin, Siyue Yu et al.
Co-salient Object Detection (CoSOD) aims to segment salient objects that consistently appear across a group of related images. Despite the notable progress achieved by recent training-based approaches, they still remain constrained by the closed-set datasets and exhibit limited generalization. However, few studies explore the potential of Vision Foundation Models (VFMs) to address CoSOD, which demonstrate a strong generalized ability and robust saliency understanding. In this paper, we investigate and leverage VFMs for CoSOD, and further propose a novel training-free method, TF-SSD, through the synergy between SAM and DINO. Specifically, we first utilize SAM to generate comprehensive raw proposals, which serve as a candidate mask pool. Then, we introduce a quality mask generator to filter out redundant masks, thereby acquiring a refined mask set. Since this generator is built upon SAM, it inherently lacks semantic understanding of saliency. To this end, we adopt an intra-image saliency filter that employs DINO's attention maps to identify visually salient masks within individual images. Moreover, to extend saliency understanding across group images, we propose an inter-image prototype selector, which computes similarity scores among cross-image prototypes to select masks with the highest score. These selected masks serve as final predictions for CoSOD. Extensive experiments show that our TF-SSD outperforms existing methods (e.g., 13.7\% gains over the recent training-free method). Codes are available at https://github.com/hzz-yy/TF-SSD.
AIDec 11, 2025
Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent EvolutionZouying Cao, Jiaji Deng, Li Yu et al.
Procedural memory enables large language model (LLM) agents to internalize "how-to" knowledge, theoretically reducing redundant trial-and-error. However, existing frameworks predominantly suffer from a "passive accumulation" paradigm, treating memory as a static append-only archive. To bridge the gap between static storage and dynamic reasoning, we propose $\textbf{ReMe}$ ($\textit{Remember Me, Refine Me}$), a comprehensive framework for experience-driven agent evolution. ReMe innovates across the memory lifecycle via three mechanisms: 1) $\textit{multi-faceted distillation}$, which extracts fine-grained experiences by recognizing success patterns, analyzing failure triggers and generating comparative insights; 2) $\textit{context-adaptive reuse}$, which tailors historical insights to new contexts via scenario-aware indexing; and 3) $\textit{utility-based refinement}$, which autonomously adds valid memories and prunes outdated ones to maintain a compact, high-quality experience pool. Extensive experiments on BFCL-V3 and AppWorld demonstrate that ReMe establishes a new state-of-the-art in agent memory system. Crucially, we observe a significant memory-scaling effect: Qwen3-8B equipped with ReMe outperforms larger, memoryless Qwen3-14B, suggesting that self-evolving memory provides a computation-efficient pathway for lifelong learning. We release our code and the $\texttt{reme.library}$ dataset to facilitate further research.
AIJan 29Code
Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail StoresZhiyong Shen, Gongpeng Zhao, Jun Zhou et al.
Multimodal Large Language Models (MLLMs) have recently achieved substantial progress in general-purpose perception and reasoning. Nevertheless, their deployment in Food-Service and Retail Stores (FSRS) scenarios encounters two major obstacles: (i) real-world FSRS data, collected from heterogeneous acquisition devices, are highly noisy and lack auditable, closed-loop data curation, which impedes the construction of high-quality, controllable, and reproducible training corpora; and (ii) existing evaluation protocols do not offer a unified, fine-grained and standardized benchmark spanning single-image, multi-image, and video inputs, making it challenging to objectively gauge model robustness. To address these challenges, we first develop Ostrakon-VL, an FSRS-oriented MLLM based on Qwen3-VL-8B. Second, we introduce ShopBench, the first public benchmark for FSRS. Third, we propose QUAD (Quality-aware Unbiased Automated Data-curation), a multi-stage multimodal instruction data curation pipeline. Leveraging a multi-stage training strategy, Ostrakon-VL achieves an average score of 60.1 on ShopBench, establishing a new state of the art among open-source MLLMs with comparable parameter scales and diverse architectures. Notably, it surpasses the substantially larger Qwen3-VL-235B-A22B (59.4) by +0.7, and exceeds the same-scale Qwen3-VL-8B (55.3) by +4.8, demonstrating significantly improved parameter efficiency. These results indicate that Ostrakon-VL delivers more robust and reliable FSRS-centric perception and decision-making capabilities. To facilitate reproducible research, we will publicly release Ostrakon-VL and the ShopBench benchmark.
CVNov 25, 2025Code
Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled PairsBao Tang, Shuai Zhang, Yueting Zhu et al.
Timestep distillation is an effective approach for improving the generation efficiency of diffusion models. The Consistency Model (CM), as a trajectory-based framework, demonstrates significant potential due to its strong theoretical foundation and high-quality few-step generation. Nevertheless, current continuous-time consistency distillation methods still rely heavily on training data and computational resources, hindering their deployment in resource-constrained scenarios and limiting their scalability to diverse domains. To address this issue, we propose Trajectory-Backward Consistency Model (TBCM), which eliminates the dependence on external training data by extracting latent representations directly from the teacher model's generation trajectory. Unlike conventional methods that require VAE encoding and large-scale datasets, our self-contained distillation paradigm significantly improves both efficiency and simplicity. Moreover, the trajectory-extracted samples naturally bridge the distribution gap between training and inference, thereby enabling more effective knowledge transfer. Empirically, TBCM achieves 6.52 FID and 28.08 CLIP scores on MJHQ-30k under one-step generation, while reducing training time by approximately 40% compared to Sana-Sprint and saving a substantial amount of GPU memory, demonstrating superior efficiency without sacrificing quality. We further reveal the diffusion-generation space discrepancy in continuous-time consistency distillation and analyze how sampling strategies affect distillation performance, offering insights for future distillation research. GitHub Link: https://github.com/hustvl/TBCM.
AIOct 17, 2025Code
Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement LearningBoyin Liu, Zhuo Zhang, Sen Huang et al.
Aligning language models using LLM judge feedback offers a scalable alternative to human annotation, yet is plagued by judgment inconsistencies that destabilize reinforcement learning. While prior work has focused on judge accuracy, the critical issue of logical coherence particularly preference cycles has been largely unaddressed. To address this gap, this work introduces an end to end framework to systematically detect and resolve these inconsistencies within the reinforcement learning training loop. Our framework features two core contributions: the Conflict Detection Rate (CDR), a novel metric to quantify judgment conflicts, and Deconflicted Graph Rewards (DGR), a signal-purification framework that eliminates cycles before policy optimization. DGR constructs preference graphs from raw judgments, transforms them into conflict-free Directed Acyclic Graphs (DAGs), and generates a logically coherent reward signal compatible with any policy optimizer. Experiments confirm that our framework significantly improves training stability and model performance over strong baselines, establishing logical consistency as a crucial and now-addressable dimension of AI feedback. The code for our method is available at https://github.com/modelscope/RM-Gallery.
CVAug 19, 2025Code
LENS: Learning to Segment Anything with Unified Reinforced ReasoningLianghui Zhu, Bin Ouyang, Yuxuan Zhang et al.
Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning significantly enhances text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models (SAM). Code is available at https://github.com/hustvl/LENS.
IVMar 12, 2025Code
Fair Federated Medical Image Classification Against Quality Shift via Inter-Client Progressive State MatchingNannan Wu, Zhuo Kuang, Zengqiang Yan et al.
Despite the potential of federated learning in medical applications, inconsistent imaging quality across institutions-stemming from lower-quality data from a minority of clients-biases federated models toward more common high-quality images. This raises significant fairness concerns. Existing fair federated learning methods have demonstrated some effectiveness in solving this problem by aligning a single 0th- or 1st-order state of convergence (e.g., training loss or sharpness). However, we argue in this work that fairness based on such a single state is still not an adequate surrogate for fairness during testing, as these single metrics fail to fully capture the convergence characteristics, making them suboptimal for guiding fair learning. To address this limitation, we develop a generalized framework. Specifically, we propose assessing convergence using multiple states, defined as sharpness or perturbed loss computed at varying search distances. Building on this comprehensive assessment, we propose promoting fairness for these states across clients to achieve our ultimate fairness objective. This is accomplished through the proposed method, FedISM+. In FedISM+, the search distance evolves over time, progressively focusing on different states. We then incorporate two components in local training and global aggregation to ensure cross-client fairness for each state. This gradually makes convergence equitable for all states, thereby improving fairness during testing. Our empirical evaluations, performed on the well-known RSNA ICH and ISIC 2019 datasets, demonstrate the superiority of FedISM+ over existing state-of-the-art methods for fair federated learning. The code is available at https://github.com/wnn2000/FFL4MIA.
LGJun 27, 2024Code
FedMLP: Federated Multi-Label Medical Image Classification under Task HeterogeneityZhaobin Sun, Nannan Wu, Junjie Shi et al.
Cross-silo federated learning (FL) enables decentralized organizations to collaboratively train models while preserving data privacy and has made significant progress in medical image classification. One common assumption is task homogeneity where each client has access to all classes during training. However, in clinical practice, given a multi-label classification task, constrained by the level of medical knowledge and the prevalence of diseases, each institution may diagnose only partial categories, resulting in task heterogeneity. How to pursue effective multi-label medical image classification under task heterogeneity is under-explored. In this paper, we first formulate such a realistic label missing setting in the multi-label FL domain and propose a two-stage method FedMLP to combat class missing from two aspects: pseudo label tagging and global knowledge learning. The former utilizes a warmed-up model to generate class prototypes and select samples with high confidence to supplement missing labels, while the latter uses a global model as a teacher for consistency regularization to prevent forgetting missing class knowledge. Experiments on two publicly-available medical datasets validate the superiority of FedMLP against the state-of-the-art both federated semi-supervised and noisy label learning approaches under task heterogeneity. Code is available at https://github.com/szbonaldo/FedMLP.
CRFeb 20, 2021Code
Spotting Silent Buffer Overflows in Execution Trace through Graph Neural Network Assisted Data Flow AnalysisZhilong Wang, Li Yu, Suhang Wang et al.
A software vulnerability could be exploited without any visible symptoms. When no source code is available, although such silent program executions could cause very serious damage, the general problem of analyzing silent yet harmful executions is still an open problem. In this work, we propose a graph neural network (GNN) assisted data flow analysis method for spotting silent buffer overflows in execution traces. The new method combines a novel graph structure (denoted DFG+) beyond data-flow graphs, a tool to extract {\tt DFG+} from execution traces, and a modified Relational Graph Convolutional Network as the GNN model to be trained. The evaluation results show that a well-trained model can be used to analyze vulnerabilities in execution traces (of previously-unseen programs) without support of any source code. Our model achieves 94.39\% accuracy on the test data and successfully locates 29 out of 30 real-world silent buffer overflow vulnerabilities. Leveraging deep learning, the proposed method is, to our best knowledge, the first general-purpose analysis method for silent buffer overflows. It is also the first method to spot silent buffer overflows in global variables, stack variables, or heap variables without crossing the boundary of allocated chunks.
CVOct 8, 2023
Boosting Facial Action Unit Detection Through Jointly Learning Facial Landmark Detection and Domain Separation and ReconstructionZiqiao Shang, Li Yu
Recently how to introduce large amounts of unlabeled facial images in the wild into supervised Facial Action Unit (AU) detection frameworks has become a challenging problem. In this paper, we propose a new AU detection framework where multi-task learning is introduced to jointly learn AU domain separation and reconstruction and facial landmark detection by sharing the parameters of homostructural facial extraction modules. In addition, we propose a new feature alignment scheme based on contrastive learning by simple projectors and an improved contrastive loss, which adds four additional intermediate supervisors to promote the feature reconstruction process. Experimental results on two benchmarks demonstrate our superiority against the state-of-the-art methods for AU detection in the wild.
CVSep 11, 2024
Bridging Domain Gap of Point Cloud Representations via Self-Supervised Geometric AugmentationLi Yu, Hongchao Zhong, Longkun Zou et al.
Recent progress of semantic point clouds analysis is largely driven by synthetic data (e.g., the ModelNet and the ShapeNet), which are typically complete, well-aligned and noisy free. Therefore, representations of those ideal synthetic point clouds have limited variations in the geometric perspective and can gain good performance on a number of 3D vision tasks such as point cloud classification. In the context of unsupervised domain adaptation (UDA), representation learning designed for synthetic point clouds can hardly capture domain invariant geometric patterns from incomplete and noisy point clouds. To address such a problem, we introduce a novel scheme for induced geometric invariance of point cloud representations across domains, via regularizing representation learning with two self-supervised geometric augmentation tasks. On one hand, a novel pretext task of predicting translation distances of augmented samples is proposed to alleviate centroid shift of point clouds due to occlusion and noises. On the other hand, we pioneer an integration of the relational self-supervised learning on geometrically-augmented point clouds in a cascade manner, utilizing the intrinsic relationship of augmented variants and other samples as extra constraints of cross-domain geometric features. Experiments on the PointDA-10 dataset demonstrate the effectiveness of the proposed method, achieving the state-of-the-art performance.
CVApr 21, 2024
Pointsoup: High-Performance and Extremely Low-Decoding-Latency Learned Geometry Codec for Large-Scale Point Cloud ScenesKang You, Kai Liu, Li Yu et al.
Despite considerable progress being achieved in point cloud geometry compression, there still remains a challenge in effectively compressing large-scale scenes with sparse surfaces. Another key challenge lies in reducing decoding latency, a crucial requirement in real-world application. In this paper, we propose Pointsoup, an efficient learning-based geometry codec that attains high-performance and extremely low-decoding-latency simultaneously. Inspired by conventional Trisoup codec, a point model-based strategy is devised to characterize local surfaces. Specifically, skin features are embedded from local windows via an attention-based encoder, and dilated windows are introduced as cross-scale priors to infer the distribution of quantized features in parallel. During decoding, features undergo fast refinement, followed by a folding-based point generator that reconstructs point coordinates with fairly fast speed. Experiments show that Pointsoup achieves state-of-the-art performance on multiple benchmarks with significantly lower decoding complexity, i.e., up to 90$\sim$160$\times$ faster than the G-PCCv23 Trisoup decoder on a comparatively low-end platform (e.g., one RTX 2080Ti). Furthermore, it offers variable-rate control with a single neural model (2.9MB), which is attractive for industrial practitioners.
CVApr 27, 2024
Unsupervised Anomaly Detection via Masked Diffusion Posterior SamplingDi Wu, Shicai Fan, Xue Zhou et al.
Reconstruction-based methods have been commonly used for unsupervised anomaly detection, in which a normal image is reconstructed and compared with the given test image to detect and locate anomalies. Recently, diffusion models have shown promising applications for anomaly detection due to their powerful generative ability. However, these models lack strict mathematical support for normal image reconstruction and unexpectedly suffer from low reconstruction quality. To address these issues, this paper proposes a novel and highly-interpretable method named Masked Diffusion Posterior Sampling (MDPS). In MDPS, the problem of normal image reconstruction is mathematically modeled as multiple diffusion posterior sampling for normal images based on the devised masked noisy observation model and the diffusion-based normal image prior under Bayesian framework. Using a metric designed from pixel-level and perceptual-level perspectives, MDPS can effectively compute the difference map between each normal posterior sample and the given test image. Anomaly scores are obtained by averaging all difference maps for multiple posterior samples. Exhaustive experiments on MVTec and BTAD datasets demonstrate that MDPS can achieve state-of-the-art performance in normal image reconstruction quality as well as anomaly detection and localization.
IVFeb 8, 2024
Joint End-to-End Image Compression and Denoising: Leveraging Contrastive Learning and Multi-Scale Self-ONNsYuxin Xie, Li Yu, Farhad Pakdaman et al.
Noisy images are a challenge to image compression algorithms due to the inherent difficulty of compressing noise. As noise cannot easily be discerned from image details, such as high-frequency signals, its presence leads to extra bits needed for compression. Since the emerging learned image compression paradigm enables end-to-end optimization of codecs, recent efforts were made to integrate denoising into the compression model, relying on clean image features to guide denoising. However, these methods exhibit suboptimal performance under high noise levels, lacking the capability to generalize across diverse noise types. In this paper, we propose a novel method integrating a multi-scale denoiser comprising of Self Organizing Operational Neural Networks, for joint image compression and denoising. We employ contrastive learning to boost the network ability to differentiate noise from high frequency signal components, by emphasizing the correlation between noisy and clean counterparts. Experimental results demonstrate the effectiveness of the proposed method both in rate-distortion performance, and codec speed, outperforming the current state-of-the-art.
IVFeb 5, 2024
Panoramic Image Inpainting With Gated Convolution And Contextual Reconstruction LossLi Yu, Yanjun Gao, Farhad Pakdaman et al.
Deep learning-based methods have demonstrated encouraging results in tackling the task of panoramic image inpainting. However, it is challenging for existing methods to distinguish valid pixels from invalid pixels and find suitable references for corrupted areas, thus leading to artifacts in the inpainted results. In response to these challenges, we propose a panoramic image inpainting framework that consists of a Face Generator, a Cube Generator, a side branch, and two discriminators. We use the Cubemap Projection (CMP) format as network input. The generator employs gated convolutions to distinguish valid pixels from invalid ones, while a side branch is designed utilizing contextual reconstruction (CR) loss to guide the generators to find the most suitable reference patch for inpainting the missing region. The proposed method is compared with state-of-the-art (SOTA) methods on SUN360 Street View dataset in terms of PSNR and SSIM. Experimental results and ablation study demonstrate that the proposed method outperforms SOTA both quantitatively and qualitatively.
CVDec 13, 2024
CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object DetectionQibo Chen, Weizhong Jin, Jianyue Ge et al.
Recent research on universal object detection aims to introduce language in a SoTA closed-set detector and then generalize the open-set concepts by constructing large-scale (text-region) datasets for training. However, these methods face two main challenges: (i) how to efficiently use the prior information in the prompts to genericise objects and (ii) how to reduce alignment bias in the downstream tasks, both leading to sub-optimal performance in some scenarios beyond pre-training. To address these challenges, we propose a strong universal detection foundation model called CP-DETR, which is competitive in almost all scenarios, with only one pre-training weight. Specifically, we design an efficient prompt visual hybrid encoder that enhances the information interaction between prompt and visual through scale-by-scale and multi-scale fusion modules. Then, the hybrid encoder is facilitated to fully utilize the prompted information by prompt multi-label loss and auxiliary detection head. In addition to text prompts, we have designed two practical concept prompt generation methods, visual prompt and optimized prompt, to extract abstract concepts through concrete visual examples and stably reduce alignment bias in downstream tasks. With these effective designs, CP-DETR demonstrates superior universal detection performance in a broad spectrum of scenarios. For example, our Swin-T backbone model achieves 47.6 zero-shot AP on LVIS, and the Swin-L backbone model achieves 32.2 zero-shot AP on ODinW35. Furthermore, our visual prompt generation method achieves 68.4 AP on COCO val by interactive detection, and the optimized prompt achieves 73.1 fully-shot AP on ODinW13.
BMApr 22, 2024
ControlMol: Adding Substructure Control To Molecule Diffusion ModelsQi Zhengyang, Liu Zijing, Zhang Jiying et al.
Due to the vast design space of molecules, generating molecules conditioned on a specific sub-structure relevant to a particular function or therapeutic target is a crucial task in computer-aided drug design. Existing works mainly focus on specific tasks, such as linker design or scaffold hopping, each task requires training a model from scratch, and many well-pretrained De Novo molecule generation model parameters are not effectively utilized. To this end, we propose a two-stage training approach, consisting of condition learning and condition optimization. In the condition learning stage, we adopt the idea of ControlNet and design some meaningful adjustments to make the unconditional generative model learn sub-structure conditioned generation. In the condition optimization stage, by using human preference learning, we further enhance the stability and robustness of sub-structure control. In our experiments, only trained on randomly partitioned sub-structure data, the proposed method outperforms previous techniques by generating more valid and diverse molecules. Our method is easy to implement and can be quickly applied to various pre-trained molecule generation models.
CVApr 19, 2025
Text-Audio-Visual-conditioned Diffusion Model for Video Saliency PredictionLi Yu, Xuanzhe Sun, Wei Zhou et al.
Video saliency prediction is crucial for downstream applications, such as video compression and human-computer interaction. With the flourishing of multimodal learning, researchers started to explore multimodal video saliency prediction, including audio-visual and text-visual approaches. Auditory cues guide the gaze of viewers to sound sources, while textual cues provide semantic guidance for understanding video content. Integrating these complementary cues can improve the accuracy of saliency prediction. Therefore, we attempt to simultaneously analyze visual, auditory, and textual modalities in this paper, and propose TAVDiff, a Text-Audio-Visual-conditioned Diffusion Model for video saliency prediction. TAVDiff treats video saliency prediction as an image generation task conditioned on textual, audio, and visual inputs, and predicts saliency maps through stepwise denoising. To effectively utilize text, a large multimodal model is used to generate textual descriptions for video frames and introduce a saliency-oriented image-text response (SITR) mechanism to generate image-text response maps. It is used as conditional information to guide the model to localize the visual regions that are semantically related to the textual description. Regarding the auditory modality, it is used as another conditional information for directing the model to focus on salient regions indicated by sounds. At the same time, since the diffusion transformer (DiT) directly concatenates the conditional information with the timestep, which may affect the estimation of the noise level. To achieve effective conditional guidance, we propose Saliency-DiT, which decouples the conditional information from the timestep. Experimental results show that TAVDiff outperforms existing methods, improving 1.03\%, 2.35\%, 2.71\% and 0.33\% on SIM, CC, NSS and AUC-J metrics, respectively.
CVApr 16, 2025
DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality AssessmentLi Yu, Situo Wang, Wei Zhou et al.
Inspired by the dual-stream theory of the human visual system (HVS) - where the ventral stream is responsible for object recognition and detail analysis, while the dorsal stream focuses on spatial relationships and motion perception - an increasing number of video quality assessment (VQA) works built upon this framework are proposed. Recent advancements in large multi-modal models, notably Contrastive Language-Image Pretraining (CLIP), have motivated researchers to incorporate CLIP into dual-stream-based VQA methods. This integration aims to harness the model's superior semantic understanding capabilities to replicate the object recognition and detail analysis in ventral stream, as well as spatial relationship analysis in dorsal stream. However, CLIP is originally designed for images and lacks the ability to capture temporal and motion information inherent in videos. To address the limitation, this paper propose a Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment (DVLTA-VQA), which decouples CLIP's visual and textual components, and integrates them into different stages of the NR-VQA pipeline. Specifically, a Video-Based Temporal CLIP module is proposed to explicitly model temporal dynamics and enhance motion perception, aligning with the dorsal stream. Additionally, a Temporal Context Module is developed to refine inter-frame dependencies, further improving motion modeling. On the ventral stream side, a Basic Visual Feature Extraction Module is employed to strengthen detail analysis. Finally, a text-guided adaptive fusion strategy is proposed to enable dynamic weighting of features, facilitating more effective integration of spatial and temporal information.
CVDec 17, 2024
Structural Pruning via Spatial-aware Information Redundancy for Semantic SegmentationDongyue Wu, Zilin Guo, Li Yu et al.
In recent years, semantic segmentation has flourished in various applications. However, the high computational cost remains a significant challenge that hinders its further adoption. The filter pruning method for structured network slimming offers a direct and effective solution for the reduction of segmentation networks. Nevertheless, we argue that most existing pruning methods, originally designed for image classification, overlook the fact that segmentation is a location-sensitive task, which consequently leads to their suboptimal performance when applied to segmentation networks. To address this issue, this paper proposes a novel approach, denoted as Spatial-aware Information Redundancy Filter Pruning~(SIRFP), which aims to reduce feature redundancy between channels. First, we formulate the pruning process as a maximum edge weight clique problem~(MEWCP) in graph theory, thereby minimizing the redundancy among the remaining features after pruning. Within this framework, we introduce a spatial-aware redundancy metric based on feature maps, thus endowing the pruning process with location sensitivity to better adapt to pruning segmentation networks. Additionally, based on the MEWCP, we propose a low computational complexity greedy strategy to solve this NP-hard problem, making it feasible and efficient for structured pruning. To validate the effectiveness of our method, we conducted extensive comparative experiments on various challenging datasets. The results demonstrate the superior performance of SIRFP for semantic segmentation tasks.
CVNov 18, 2024
Relevance-guided Audio Visual Fusion for Video Saliency PredictionLi Yu, Xuanzhe Sun, Pan Gao et al.
Audio data, often synchronized with video frames, plays a crucial role in guiding the audience's visual attention. Incorporating audio information into video saliency prediction tasks can enhance the prediction of human visual behavior. However, existing audio-visual saliency prediction methods often directly fuse audio and visual features, which ignore the possibility of inconsistency between the two modalities, such as when the audio serves as background music. To address this issue, we propose a novel relevance-guided audio-visual saliency prediction network dubbed AVRSP. Specifically, the Relevance-guided Audio-Visual feature Fusion module (RAVF) dynamically adjusts the retention of audio features based on the semantic relevance between audio and visual elements, thereby refining the integration process with visual features. Furthermore, the Multi-scale feature Synergy (MS) module integrates visual features from different encoding stages, enhancing the network's ability to represent objects at various scales. The Multi-scale Regulator Gate (MRG) could transfer crucial fusion information to visual features, thus optimizing the utilization of multi-scale visual features. Extensive experiments on six audio-visual eye movement datasets have demonstrated that our AVRSP network achieves competitive performance in audio-visual saliency prediction.
CVNov 11, 2024
High-Frequency Enhanced Hybrid Neural Representation for Video CompressionLi Yu, Zhihui Li, Jimin Xiao et al.
Neural Representations for Videos (NeRV) have simplified the video codec process and achieved swift decoding speeds by encoding video content into a neural network, presenting a promising solution for video compression. However, existing work overlooks the crucial issue that videos reconstructed by these methods lack high-frequency details. To address this problem, this paper introduces a High-Frequency Enhanced Hybrid Neural Representation Network. Our method focuses on leveraging high-frequency information to improve the synthesis of fine details by the network. Specifically, we design a wavelet high-frequency encoder that incorporates Wavelet Frequency Decomposer (WFD) blocks to generate high-frequency feature embeddings. Next, we design the High-Frequency Feature Modulation (HFM) block, which leverages the extracted high-frequency embeddings to enhance the fitting process of the decoder. Finally, with the refined Harmonic decoder block and a Dynamic Weighted Frequency Loss, we further reduce the potential loss of high-frequency information. Experiments on the Bunny and UVG datasets demonstrate that our method outperforms other methods, showing notable improvements in detail preservation and compression performance.
CVDec 14, 2023
Exploration of visual prompt in Grounded pre-trained open-set detectionQibo Chen, Weizhong Jin, Shuchang Li et al.
Text prompts are crucial for generalizing pre-trained open-set object detection models to new categories. However, current methods for text prompts are limited as they require manual feedback when generalizing to new categories, which restricts their ability to model complex scenes, often leading to incorrect detection results. To address this limitation, we propose a novel visual prompt method that learns new category knowledge from a few labeled images, which generalizes the pre-trained detection model to the new category. To allow visual prompts to represent new categories adequately, we propose a statistical-based prompt construction module that is not limited by predefined vocabulary lengths, thus allowing more vectors to be used when representing categories. We further utilize the category dictionaries in the pre-training dataset to design task-specific similarity dictionaries, which make visual prompts more discriminative. We evaluate the method on the ODinW dataset and show that it outperforms existing prompt learning methods and performs more consistently in combinatorial inference.
CVNov 20, 2025
Degradation-Aware Hierarchical Termination for Blind Quality Enhancement of Compressed VideoLi Yu, Yingbo Zhao, Shiyu Wu et al.
Existing studies on Quality Enhancement for Compressed Video (QECV) predominantly rely on known Quantization Parameters (QPs), employing distinct enhancement models per QP setting, termed non-blind methods. However, in real-world scenarios involving transcoding or transmission, QPs may be partially or entirely unknown, limiting the applicability of such approaches and motivating the development of blind QECV techniques. Current blind methods generate degradation vectors via classification models with cross-entropy loss, using them as channel attention to guide artifact removal. However, these vectors capture only global degradation information and lack spatial details, hindering adaptation to varying artifact patterns at different spatial positions. To address these limitations, we propose a pretrained Degradation Representation Learning (DRL) module that decouples and extracts high-dimensional, multiscale degradation representations from video content to guide the artifact removal. Additionally, both blind and non-blind methods typically employ uniform architectures across QPs, hence, overlooking the varying computational demands inherent to different compression levels. We thus introduce a hierarchical termination mechanism that dynamically adjusts the number of artifact reduction stages based on the compression level. Experimental results demonstrate that the proposed approach significantly enhances performance, achieving a PSNR improvement of 110% (from 0.31 dB to 0.65 dB) over a competing state-of-the-art blind method at QP = 22. Furthermore, the proposed hierarchical termination mechanism reduces the average inference time at QP = 22 by half compared to QP = 42.
IRSep 3, 2025
RankGraph: Unified Heterogeneous Graph Learning for Cross-Domain RecommendationRenzhi Wu, Junjie Yang, Li Chen et al.
Cross-domain recommendation systems face the challenge of integrating fine-grained user and item relationships across various product domains. To address this, we introduce RankGraph, a scalable graph learning framework designed to serve as a core component in recommendation foundation models (FMs). By constructing and leveraging graphs composed of heterogeneous nodes and edges across multiple products, RankGraph enables the integration of complex relationships between users, posts, ads, and other entities. Our framework employs a GPU-accelerated Graph Neural Network and contrastive learning, allowing for dynamic extraction of subgraphs such as item-item and user-user graphs to support similarity-based retrieval and real-time clustering. Furthermore, RankGraph integrates graph-based pretrained representations as contextual tokens into FM sequence models, enriching them with structured relational knowledge. RankGraph has demonstrated improvements in click (+0.92%) and conversion rates (+2.82%) in online A/B tests, showcasing its effectiveness in cross-domain recommendation scenarios.
AIJul 26, 2025
Digital Twin Channel-Enabled Online Resource Allocation for 6G: Principle, Architecture and ApplicationTongjie Li, Jianhua Zhang, Li Yu et al.
Emerging applications such as holographic communication, autonomous driving, and the industrial Internet of Things impose stringent requirements on flexible, low-latency, and reliable resource allocation in 6G networks. Conventional methods, which rely on statistical modeling, have proven effective in general contexts but may fail to achieve optimal performance in specific and dynamic environments. Furthermore, acquiring real-time channel state information (CSI) typically requires excessive pilot overhead. To address these challenges, a digital twin channel (DTC)-enabled online optimization framework is proposed, in which DTC is employed to predict CSI based on environmental sensing. The predicted CSI is then utilized by lightweight game-theoretic algorithms to perform online resource allocation in a timely and efficient manner. Simulation results based on a digital replica of a realistic industrial workshop demonstrate that the proposed method achieves throughput improvements of up to 11.5\% compared with pilot-based ideal CSI schemes, validating its effectiveness for scalable, low-overhead, and environment-aware communication in future 6G networks.