CVNov 3, 2023Code
Bridging the Gap between Multi-focus and Multi-modal: A Focused Integration Framework for Multi-modal Image FusionXilai Li, Xiaosong Li, Tao Ye et al.
Multi-modal image fusion (MMIF) integrates valuable information from different modality images into a fused one. However, the fusion of multiple visible images with different focal regions and infrared images is a unprecedented challenge in real MMIF applications. This is because of the limited depth of the focus of visible optical lenses, which impedes the simultaneous capture of the focal information within the same scene. To address this issue, in this paper, we propose a MMIF framework for joint focused integration and modalities information extraction. Specifically, a semi-sparsity-based smoothing filter is introduced to decompose the images into structure and texture components. Subsequently, a novel multi-scale operator is proposed to fuse the texture components, capable of detecting significant information by considering the pixel focus attributes and relevant data from various modal images. Additionally, to achieve an effective capture of scene luminance and reasonable contrast maintenance, we consider the distribution of energy information in the structural components in terms of multi-directional frequency variance and information entropy. Extensive experiments on existing MMIF datasets, as well as the object detection and depth estimation tasks, consistently demonstrate that the proposed algorithm can surpass the state-of-the-art methods in visual perception and quantitative evaluation. The code is available at https://github.com/ixilai/MFIF-MMIF.
IVMay 15Code
Degradation-Aware Blur-Segmentation of Brain TumorYuchun Wang, Xiaosong Li, Gefei Liang et al.
Multimodal 3D MRI brain tumor segmentation is a pivotal step in radiotherapy target delineation, surgical planning and post-treatment assessment. Existing methods often assume artifact-free MRI images. However, inevitable patient motion during scanning introduces artifacts and blur that degrade boundary and texture features, leading to poor segmentation performance. To bridge this gap, we introduce Degradation-Aware Blur-Segmentation Net (DABSeg), a synchronous deblurring 3D multimodal MRI segmentation network that unifies blur removal and accurate segmentation. Specifically, we propose a feature-domain motion-deblurring stem to compensate for blur and rebalance intensity. Concurrently, the backbone network embeds a blur-aware cross-modal cross-attention module and multi-scale residual aggregation to yield effective modality complementarity. Notably, we optimize a joint loss that combines weighted Dice with a clear-reference reconstruction term, where imbalanced weights are applied to small targets to boost learning intensity and predictive stability for small lesions and border regions. Systematic comparisons and ablation experiments on the BraTS2020 dataset under both clear and degenerative conditions consistently demonstrate that DABSeg surpasses state-of-the-art methods in tumor Dice score and boundary precision. These results validate the effectiveness of degenerative-aware cross-task collaborative learning in improving the robustness and clinical utility of multi-modal 3D brain tumor segmentation under realistic degenerative conditions. The source code is available at https://github.com/YuchunWang24/DABSeg_ICPR
CLMar 20
DLLM Agent: See Farther, Run FasterHuiling Zhen, Weizhe Lin, Renxi Liu et al.
Diffusion large language models (DLLMs) have emerged as an alternative to autoregressive (AR) decoding with appealing efficiency and modeling properties, yet their implications for agentic multi-step decision making remain underexplored. We ask a concrete question: when the generation paradigm is changed but the agent framework and supervision are held fixed, do diffusion backbones induce systematically different planning and tool-use behaviors, and do these differences translate into end-to-end efficiency gains? We study this in a controlled setting by instantiating DLLM and AR backbones within the same agent workflow (DeepDiver) and performing matched agent-oriented fine-tuning on the same trajectory data, yielding diffusion-backed DLLM Agents and directly comparable AR agents. Across benchmarks and case studies, we find that, at comparable accuracy, DLLM Agents are on average over 30% faster end to end than AR agents, with some cases exceeding 8x speedup. Conditioned on correct task completion, DLLM Agents also require fewer interaction rounds and tool invocations, consistent with higher planner hit rates that converge earlier to a correct action path with less backtracking. We further identify two practical considerations for deploying diffusion backbones in tool-using agents. First, naive DLLM policies are more prone to structured tool-call failures, necessitating stronger tool-call-specific training to emit valid schemas and arguments. Second, for multi-turn inputs interleaving context and action spans, diffusion-style span corruption requires aligned attention masking to avoid spurious context-action information flow; without such alignment, performance degrades. Finally, we analyze attention dynamics across workflow stages and observe paradigm-specific coordination patterns, suggesting stronger global planning signals in diffusion-backed agents.
CVMar 3Code
CAWM-Mamba: A unified model for infrared-visible image fusion and compound adverse weather restorationHuichun Liu, Xiaosong Li, Zhuangfan Huang et al.
Multimodal Image Fusion (MMIF) integrates complementary information from various modalities to produce clearer and more informative fused images. MMIF under adverse weather is particularly crucial in autonomous driving and UAV monitoring applications. However, existing adverse weather fusion methods generally only tackle single types of degradation such as haze, rain, or snow, and fail when multiple degradations coexist (e.g., haze+rain, rain+snow). To address this challenge, we propose Compound Adverse Weather Mamba (CAWM-Mamba), the first end-to-end framework that jointly performs image fusion and compound weather restoration with unified shared weights. Our network contains three key components: (1) a Weather-Aware Preprocess Module (WAPM) to enhance degraded visible features and extracts global weather embeddings; (2) a Cross-modal Feature Interaction Module (CFIM) to facilitate the alignment of heterogeneous modalities and exchange of complementary features across modalities; and (3) a Wavelet Space State Block (WSSB) that leverages wavelet-domain decomposition to decouple multi-frequency degradations. WSSB includes Freq-SSM, a module that models anisotropic high-frequency degradation without redundancy, and a unified degradation representation mechanism to further improve generalization across complex compound weather conditions. Extensive experiments on the AWMM-100K benchmark and three standard fusion datasets demonstrate that CAWM-Mamba consistently outperforms state-of-the-art methods in both compound and single-weather scenarios. In addition, our fusion results excel in downstream tasks covering semantic segmentation and object detection, confirming the practical value in real-world adverse weather perception. The source code will be available at https://github.com/Feecuin/CAWM-Mamba.
IVApr 26, 2024Code
Simultaneous Tri-Modal Medical Image Fusion and Super-Resolution using Conditional Diffusion ModelYushen Xu, Xiaosong Li, Yuchan Jie et al.
In clinical practice, tri-modal medical image fusion, compared to the existing dual-modal technique, can provide a more comprehensive view of the lesions, aiding physicians in evaluating the disease's shape, location, and biological activity. However, due to the limitations of imaging equipment and considerations for patient safety, the quality of medical images is usually limited, leading to sub-optimal fusion performance, and affecting the depth of image analysis by the physician. Thus, there is an urgent need for a technology that can both enhance image resolution and integrate multi-modal information. Although current image processing methods can effectively address image fusion and super-resolution individually, solving both problems synchronously remains extremely challenging. In this paper, we propose TFS-Diff, a simultaneously realize tri-modal medical image fusion and super-resolution model. Specially, TFS-Diff is based on the diffusion model generation of a random iterative denoising process. We also develop a simple objective function and the proposed fusion super-resolution loss, effectively evaluates the uncertainty in the fusion and ensures the stability of the optimization process. And the channel attention module is proposed to effectively integrate key information from different modalities for clinical diagnosis, avoiding information loss caused by multiple image processing. Extensive experiments on public Harvard datasets show that TFS-Diff significantly surpass the existing state-of-the-art methods in both quantitative and visual evaluations. Code is available at https://github.com/XylonXu01/TFS-Diff.
CVSep 11, 2025Code
FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolutionYuchan Jie, Yushen Xu, Xiaosong Li et al.
As an influential information fusion and low-level vision technique, image fusion integrates complementary information from source images to yield an informative fused image. A few attempts have been made in recent years to jointly realize image fusion and super-resolution. However, in real-world applications such as military reconnaissance and long-range detection missions, the target and background structures in multimodal images are easily corrupted, with low resolution and weak semantic information, which leads to suboptimal results in current fusion techniques. In response, we propose FS-Diff, a semantic guidance and clarity-aware joint image fusion and super-resolution method. FS-Diff unifies image fusion and super-resolution as a conditional generation problem. It leverages semantic guidance from the proposed clarity sensing mechanism for adaptive low-resolution perception and cross-modal feature extraction. Specifically, we initialize the desired fused result as pure Gaussian noise and introduce the bidirectional feature Mamba to extract the global features of the multimodal images. Moreover, utilizing the source images and semantics as conditions, we implement a random iterative denoising process via a modified U-Net network. This network istrained for denoising at multiple noise levels to produce high-resolution fusion results with cross-modal features and abundant semantic information. We also construct a powerful aerial view multiscene (AVMS) benchmark covering 600 pairs of images. Extensive joint image fusion and super-resolution experiments on six public and our AVMS datasets demonstrated that FS-Diff outperforms the state-of-the-art methods at multiple magnifications and can recover richer details and semantics in the fused images. The code is available at https://github.com/XylonXu01/FS-Diff.
CVFeb 3, 2024Code
All-weather Multi-Modality Image Fusion: Unified Framework and 100k BenchmarkXilai Li, Wuyang Liu, Xiaosong Li et al.
Multi-modality image fusion (MMIF) combines complementary information from different image modalities to provide a more comprehensive and objective interpretation of scenes. However, existing MMIF methods lack the ability to resist different weather interferences in real-world scenes, preventing them from being useful in practical applications such as autonomous driving. To bridge this research gap, we proposed an all-weather MMIF model. Achieving effective multi-tasking in this context is particularly challenging due to the complex and diverse nature of weather conditions. A key obstacle lies in the 'black box' nature of current deep learning architectures, which restricts their multi-tasking capabilities. To overcome this, we decompose the network into two modules: a fusion module and a restoration module. For the fusion module, we introduce a learnable low-rank representation model to decompose images into low-rank and sparse components. This interpretable feature separation allows us to better observe and understand images. For the restoration module, we propose a physically-aware clear feature prediction module based on an atmospheric scattering model that can deduce variations in light transmittance from both scene illumination and reflectance. We also construct a large-scale multi-modality dataset with 100,000 image pairs across rain, haze, and snow conditions, covering various degradation levels and diverse scenes to thoroughly evaluate image fusion methods in adverse weather. Experimental results in both real-world and synthetic scenes show that the proposed algorithm excels in detail recovery and multi-modality feature extraction. The code is available at https://github.com/ixilai/AWFusion.
CVOct 14, 2024Code
Interaction-Guided Two-Branch Image Dehazing NetworkHuichun Liu, Xiaosong Li, Tianshu Tan
Image dehazing aims to restore clean images from hazy ones. Convolutional Neural Networks (CNNs) and Transformers have demonstrated exceptional performance in local and global feature extraction, respectively, and currently represent the two mainstream frameworks in image dehazing. In this paper, we propose a novel dual-branch image dehazing framework that guides CNN and Transformer components interactively. We reconsider the complementary characteristics of CNNs and Transformers by leveraging the differential relationships between global and local features for interactive guidance. This approach enables the capture of local feature positions through global attention maps, allowing the CNN to focus solely on feature information at effective positions. The single-branch Transformer design ensures the network's global information recovery capability. Extensive experiments demonstrate that our proposed method yields competitive qualitative and quantitative evaluation performance on both synthetic and real public datasets. Codes are available at https://github.com/Feecuin/Two-Branch-Dehazing
CLMar 27
AgentCollab: A Self-Evaluation-Driven Collaboration Paradigm for Efficient LLM AgentsWenbo Gao, Renxi Liu, Xian Wang et al.
Autonomous agents powered by large language models (LLMs) perform complex tasks through long-horizon reasoning and tool interaction, where a fundamental trade-off arises between execution efficiency and reasoning robustness. Models at different capability-cost levels offer complementary advantages: lower-cost models enable fast execution but may struggle on difficult reasoning segments, while stronger models provide more robust reasoning at higher computational cost. We present AgentCollab, a self-driven collaborative inference framework that dynamically coordinates models with different reasoning capacities during agent execution. Instead of relying on external routing modules, the framework uses the agent's own self-reflection signal to determine whether the current reasoning trajectory is making meaningful progress, and escalates control to a stronger reasoning tier only when necessary. To further stabilize long-horizon execution, we introduce a difficulty-aware cumulative escalation strategy that allocates additional reasoning budget based on recent failure signals. In our experiments, we instantiate this framework using a two-level small-large model setting. Experiments on diverse multi-step agent benchmarks show that AgentCollab consistently improves the accuracy-efficiency Pareto frontier of LLM agents.
CVFeb 3, 2024Code
UMCFuse: A Unified Multiple Complex Scenes Infrared and Visible Image Fusion FrameworkXilai Li, Xiaosong Li, Tianshu Tan et al.
Infrared and visible image fusion has emerged as a prominent research area in computer vision. However, little attention has been paid to the fusion task in complex scenes, leading to sub-optimal results under interference. To fill this gap, we propose a unified framework for infrared and visible images fusion in complex scenes, termed UMCFuse. Specifically, we classify the pixels of visible images from the degree of scattering of light transmission, allowing us to separate fine details from overall intensity. Maintaining a balance between interference removal and detail preservation is essential for the generalization capacity of the proposed method. Therefore, we propose an adaptive denoising strategy for the fusion of detail layers. Meanwhile, we fuse the energy features from different modalities by analyzing them from multiple directions. Extensive fusion experiments on real and synthetic complex scenes datasets cover adverse weather conditions, noise, blur, overexposure, fire, as well as downstream tasks including semantic segmentation, object detection, salient object detection, and depth estimation, consistently indicate the superiority of the proposed method compared with the recent representative methods. Our code is available at https://github.com/ixilai/UMCFuse.
CLMay 27, 2025Code
Pangu Pro MoE: Mixture of Grouped Experts for Efficient SparsityYehui Tang, Xiaosong Li, Fangcheng Liu et al.
The surgence of Mixture of Experts (MoE) in Large Language Models promises a small price of execution cost for a much larger model parameter count and learning capacity, because only a small fraction of parameters are activated for each input token. However, it is commonly observed that some experts are activated far more often than others, leading to system inefficiency when running the experts on different devices in parallel. Therefore, we introduce Mixture of Grouped Experts (MoGE), which groups the experts during selection and balances the expert workload better than MoE in nature. It constrains tokens to activate an equal number of experts within each predefined expert group. When a model execution is distributed on multiple devices, this architectural design ensures a balanced computational load across devices, significantly enhancing throughput, particularly for the inference phase. Further, we build Pangu Pro MoE on Ascend NPUs, a sparse model based on MoGE with 72 billion total parameters, 16 billion of which are activated for each token. The configuration of Pangu Pro MoE is optimized for Ascend 300I Duo and 800I A2 through extensive system simulation studies. Our experiments indicate that MoGE indeed leads to better expert load balancing and more efficient execution for both model training and inference on Ascend NPUs. The inference performance of Pangu Pro MoE achieves 1148 tokens/s per card and can be further improved to 1528 tokens/s per card by speculative acceleration, outperforming comparable 32B and 72B Dense models. Furthermore, we achieve an excellent cost-to-performance ratio for model inference on Ascend 300I Duo. Our studies show that Ascend NPUs are capable of training Pangu Pro MoE with massive parallelization to make it a leading model within the sub-100B total parameter class, outperforming prominent open-source models like GLM-Z1-32B and Qwen3-32B.
CVApr 2Code
MAVFusion: Efficient Infrared and Visible Video Fusion via Motion-Aware Sparse InteractionXilai Li, Weijun Jiang, Xiaosong Li et al.
Infrared and visible video fusion combines the object saliency from infrared images with the texture details from visible images to produce semantically rich fusion results. However, most existing methods are designed for static image fusion and cannot effectively handle frame-to-frame motion in videos. Current video fusion methods improve temporal consistency by introducing interactions across frames, but they often require high computational cost. To mitigate these challenges, we propose MAVFusion, an end-to-end video fusion framework featuring a motion-aware sparse interaction mechanism that enhances efficiency while maintaining superior fusion quality. Specifically, we leverage optical flow to identify dynamic regions in multi-modal sequences, adaptively allocating computationally intensive cross-modal attention to these sparse areas to capture salient transitions and facilitate inter-modal information exchange. For static background regions, a lightweight weak interaction module is employed to maintain structural and appearance integrity. By decoupling the processing of dynamic and static regions, MAVFusion simultaneously preserves temporal consistency and fine-grained details while significantly accelerating inference. Extensive experiments demonstrate that MAVFusion achieves state-of-the-art performance on multiple infrared and visible video benchmarks, achieving a speed of 14.16\,FPS at $640 \times 480$ resolution. The source code will be available at https://github.com/ixilai/MAVFusion.
CVApr 2Code
FTPFusion: Frequency-Aware Infrared and Visible Video Fusion with Temporal PerturbationXilai Li, Chusheng Fang, Xiaosong Li
Infrared and visible video fusion plays a critical role in intelligent surveillance and low-light monitoring. However, maintaining temporal stability while preserving spatial detail remains a fundamental challenge. Existing methods either focus on frame-wise enhancement with limited temporal modeling or rely on heavy spatio-temporal aggregation that often sacrifices high-frequency details. In this paper, we propose FTPFusion, a frequency-aware infrared and visible video fusion method based on temporal perturbation and sparse cross-modal interaction. Specifically, FTPFusion decomposes the feature representations into high-frequency and low-frequency components for collaborative modeling. The high-frequency branch performs sparse cross-modal spatio-temporal interaction to capture motion-related context and complementary details. The low-frequency branch introduces a temporal perturbation strategy to enhance robustness against complex video variations, such as flickering, jitter, and local misalignment. Furthermore, we design an offset-aware temporal consistency constraint to explicitly stabilize cross-frame representations under temporal disturbances. Extensive experiments on multiple public benchmarks demonstrate that FTPFusion consistently outperforms state-of-the-art methods across multiple metrics in both spatial fidelity and temporal consistency. The source code will be available at https://github.com/ixilai/FTPFusion.
DCJan 5
EPD-Serve: A Flexible Multimodal EPD Disaggregation Inference Serving System On AscendFan Bai, Pai Peng, Zhengzhi Tang et al.
With the widespread adoption of large multimodal models, efficient inference across text, image, audio, and video modalities has become critical. However, existing multimodal inference systems typically employ monolithic architectures that tightly couple the Encode, Prefill, and Decode stages on homogeneous hardware, neglecting the heterogeneous computational characteristics of each stage. This design leads to inefficient resource utilization and limited system throughput. To address these issues, we propose EPD-Serve, a stage-level disaggregated inference serving system for multimodal models. EPD-Serve decouples the inference pipeline into independent Encode, Prefill, and Decode stages, enabling logical isolation and flexible co-located deployment through dynamic orchestration. Leveraging the Ascend interconnect topology, EPD-Serve introduces asynchronous feature prefetching between Encode and Prefill stages and a hierarchical grouped KV cache transmission mechanism between Prefill and Decode stages to improve cross-node communication efficiency. In addition, EPD-Serve incorporates multi-route scheduling, instance-level load balancing, and multi-stage hardware co-location with spatial multiplexing to better support diverse multimodal workloads. Comprehensive experiments on multimodal understanding models demonstrate that, under high-concurrency scenarios, EPD-Serve improves end-to-end throughput by 57.37-69.48% compared to PD-disaggregated deployment, while satisfying strict SLO constraints, including TTFT below 2000 ms and TPOT below 50 ms. These results highlight the effectiveness of stage-level disaggregation for optimizing multimodal large model inference systems.
CVOct 28, 2025Code
A Luminance-Aware Multi-Scale Network for Polarization Image Fusion with a Multi-Scene DatasetZhuangfan Huang, Xiaosong Li, Gao Wang et al.
Polarization image fusion combines S0 and DOLP images to reveal surface roughness and material properties through complementary texture features, which has important applications in camouflage recognition, tissue pathology analysis, surface defect detection and other fields. To intergrate coL-Splementary information from different polarized images in complex luminance environment, we propose a luminance-aware multi-scale network (MLSN). In the encoder stage, we propose a multi-scale spatial weight matrix through a brightness-branch , which dynamically weighted inject the luminance into the feature maps, solving the problem of inherent contrast difference in polarized images. The global-local feature fusion mechanism is designed at the bottleneck layer to perform windowed self-attention computation, to balance the global context and local details through residual linking in the feature dimension restructuring stage. In the decoder stage, to further improve the adaptability to complex lighting, we propose a Brightness-Enhancement module, establishing the mapping relationship between luminance distribution and texture features, realizing the nonlinear luminance correction of the fusion result. We also present MSP, an 1000 pairs of polarized images that covers 17 types of indoor and outdoor complex lighting scenes. MSP provides four-direction polarization raw maps, solving the scarcity of high-quality datasets in polarization image fusion. Extensive experiment on MSP, PIF and GAND datasets verify that the proposed MLSN outperms the state-of-the-art methods in subjective and objective evaluations, and the MS-SSIM and SD metircs are higher than the average values of other methods by 8.57%, 60.64%, 10.26%, 63.53%, 22.21%, and 54.31%, respectively. The source code and dataset is avalable at https://github.com/1hzf/MLS-UNet.
CVSep 26, 2025Code
Gradient-based multi-focus image fusion with focus-aware saliency enhancementHaoyu Li, XiaoSong Li
Multi-focus image fusion (MFIF) aims to yield an all-focused image from multiple partially focused inputs, which is crucial in applications cover sur-veillance, microscopy, and computational photography. However, existing methods struggle to preserve sharp focus-defocus boundaries, often resulting in blurred transitions and focused details loss. To solve this problem, we propose a MFIF method based on significant boundary enhancement, which generates high-quality fused boundaries while effectively detecting focus in-formation. Particularly, we propose a gradient-domain-based model that can obtain initial fusion results with complete boundaries and effectively pre-serve the boundary details. Additionally, we introduce Tenengrad gradient detection to extract salient features from both the source images and the ini-tial fused image, generating the corresponding saliency maps. For boundary refinement, we develop a focus metric based on gradient and complementary information, integrating the salient features with the complementary infor-mation across images to emphasize focused regions and produce a high-quality initial decision result. Extensive experiments on four public datasets demonstrate that our method consistently outperforms 12 state-of-the-art methods in both subjective and objective evaluations. We have realized codes in https://github.com/Lihyua/GICI
CVSep 17, 2025Code
NDLPNet: A Location-Aware Nighttime Deraining Network and a Real-World Benchmark DatasetHuichun Liu, Xiaosong Li, Yang Liu et al.
Visual degradation caused by rain streak artifacts in low-light conditions significantly hampers the performance of nighttime surveillance and autonomous navigation. Existing image deraining techniques are primarily designed for daytime conditions and perform poorly under nighttime illumination due to the spatial heterogeneity of rain distribution and the impact of light-dependent stripe visibility. In this paper, we propose a novel Nighttime Deraining Location-enhanced Perceptual Network(NDLPNet) that effectively captures the spatial positional information and density distribution of rain streaks in low-light environments. Specifically, we introduce a Position Perception Module (PPM) to capture and leverage spatial contextual information from input data, enhancing the model's capability to identify and recalibrate the importance of different feature channels. The proposed nighttime deraining network can effectively remove the rain streaks as well as preserve the crucial background information. Furthermore, We construct a night scene rainy (NSR) dataset comprising 900 image pairs, all based on real-world nighttime scenes, providing a new benchmark for nighttime deraining task research. Extensive qualitative and quantitative experimental evaluations on both existing datasets and the NSR dataset consistently demonstrate our method outperform the state-of-the-art (SOTA) methods in nighttime deraining tasks. The source code and dataset is available at https://github.com/Feecuin/NDLPNet.
CVAug 23, 2025Code
AWM-Fuse: Multi-Modality Image Fusion for Adverse Weather via Global and Local Text PerceptionXilai Li, Huichun Liu, Xiaosong Li et al.
Multi-modality image fusion (MMIF) in adverse weather aims to address the loss of visual information caused by weather-related degradations, providing clearer scene representations. Although less studies have attempted to incorporate textual information to improve semantic perception, they often lack effective categorization and thorough analysis of textual content. In response, we propose AWM-Fuse, a novel fusion method for adverse weather conditions, designed to handle multiple degradations through global and local text perception within a unified, shared weight architecture. In particular, a global feature perception module leverages BLIP-produced captions to extract overall scene features and identify primary degradation types, thus promoting generalization across various adverse weather conditions. Complementing this, the local module employs detailed scene descriptions produced by ChatGPT to concentrate on specific degradation effects through concrete textual cues, thereby capturing finer details. Furthermore, textual descriptions are used to constrain the generation of fusion images, effectively steering the network learning process toward better alignment with real semantic labels, thereby promoting the learning of more meaningful visual features. Extensive experiments demonstrate that AWM-Fuse outperforms current state-of-the-art methods in complex weather conditions and downstream tasks. Our code is available at https://github.com/Feecuin/AWM-Fuse.
CLJun 16, 2025Code
EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware OptimizationZhongqian Fu, Ning Ding, Kai Han et al.
Mixture-of-Experts (MoE) models have emerged as a cornerstone of large-scale deep learning by efficiently distributing computation and enhancing performance. However, their unique architecture-characterized by sparse expert activation and dynamic routing mechanisms-introduces inherent complexities that challenge conventional quantization techniques. Existing post-training quantization (PTQ) methods struggle to address activation outliers, router consistency and sparse expert calibration, leading to significant performance degradation. To bridge this gap, we propose EAQuant, a novel PTQ framework tailored for MoE architectures. Our method systematically tackles these challenges through three key innovations: (1) expert-aware smoothing aggregation to suppress activation outliers and stabilize quantization, (2) router logits distribution alignment to preserve expert selection consistency post-quantization, and (3) expert-level calibration data balance to optimize sparsely activated experts. Extensive experiments across W4A4 and extreme W3A4 quantization configurations demonstrate that EAQuant significantly outperforms existing methods, achieving average score improvements of 1.15 - 2.28% across three diverse MoE architectures, with particularly pronounced gains in reasoning tasks and robust performance retention under aggressive quantization. By integrating these innovations, EAQuant establishes a new state-of-the-art for high-precision, efficient MoE model compression. Our code is available at https://github.com/darren-fzq1/EAQuant.
CVMar 11, 2025Code
Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language ModelsBozhi Luan, Wengang Zhou, Hao Feng et al.
As the computational needs of Large Vision-Language Models (LVLMs) increase, visual token pruning has proven effective in improving inference speed and memory efficiency. Traditional pruning methods in LVLMs predominantly focus on attention scores to determine token relevance, overlooking critical aspects such as spatial position and token similarity. To this end, we introduce AdaptPrune, a novel plug-and-play training-free pruning method that builds on conventional attention-based pruning by integrating spatial distance and token similarity with an adaptive NMS approach. Our method is based on several observed phenomena in large models: the positional bias in the model's image attention and the redundancy of token information ignored by previous approaches. By integrating attention, spatial, and similarity information, our approach ensures a comprehensive evaluation of token importance and substantially refines the pruning decisions. Our method has been extensively tested across various LVLMs and benchmarks, confirming its robustness and adaptability. The results demonstrate that AdaptPrune consistently outperforms existing methods across various pruning ratios. Code is available at https://github.com/bzluan/AdaptPrune.
CVJan 16, 2024Code
SAMF: Small-Area-Aware Multi-focus Image Fusion for Object DetectionXilai Li, Xiaosong Li, Haishu Tan et al.
Existing multi-focus image fusion (MFIF) methods often fail to preserve the uncertain transition region and detect small focus areas within large defocused regions accurately. To address this issue, this study proposes a new small-area-aware MFIF algorithm for enhancing object detection capability. First, we enhance the pixel attributes within the small focus and boundary regions, which are subsequently combined with visual saliency detection to obtain the pre-fusion results used to discriminate the distribution of focused pixels. To accurately ensure pixel focus, we consider the source image as a combination of focused, defocused, and uncertain regions and propose a three-region segmentation strategy. Finally, we design an effective pixel selection rule to generate segmentation decision maps and obtain the final fusion results. Experiments demonstrated that the proposed method can accurately detect small and smooth focus areas while improving object detection performance, outperforming existing methods in both subjective and objective evaluations. The source code is available at https://github.com/ixilai/SAMF.
CVDec 23, 2025
JDPNet: A Network Based on Joint Degradation Processing for Underwater Image EnhancementTao Ye, Hongbin Ren, Chongbing Zhang et al.
Given the complexity of underwater environments and the variability of water as a medium, underwater images are inevitably subject to various types of degradation. The degradations present nonlinear coupling rather than simple superposition, which renders the effective processing of such coupled degradations particularly challenging. Most existing methods focus on designing specific branches, modules, or strategies for specific degradations, with little attention paid to the potential information embedded in their coupling. Consequently, they struggle to effectively capture and process the nonlinear interactions of multiple degradations from a bottom-up perspective. To address this issue, we propose JDPNet, a joint degradation processing network, that mines and unifies the potential information inherent in coupled degradations within a unified framework. Specifically, we introduce a joint feature-mining module, along with a probabilistic bootstrap distribution strategy, to facilitate effective mining and unified adjustment of coupled degradation features. Furthermore, to balance color, clarity, and contrast, we design a novel AquaBalanceLoss to guide the network in learning from multiple coupled degradation losses. Experiments on six publicly available underwater datasets, as well as two new datasets constructed in this study, show that JDPNet exhibits state-of-the-art performance while offering a better tradeoff between performance, parameter size, and computational cost.
AIApr 23
SemanticAgent: A Semantics-Aware Framework for Text-to-SQL Data SynthesisQiang Gao, Zhenping Li, Anqi Zhuo et al.
Existing text-to-SQL synthesis pipelines still conflate executability with semantic validity: syntactic checks and execution-based validation can retain queries that execute successfully while violating database semantics. To address these limitations, we propose SemanticAgent, a semantic-aware synthesis framework. SemanticAgent organizes synthesis around three specialized modules: an analyzer, a synthesizer, and a verifier. Through a three-stage protocol of semantic analysis, stepwise synthesis, and diagnostic refinement, SemanticAgent transforms execution-based validation alone into a traceable reasoning process. Our framework generates synthetic data that consistently outperforms prior synthesis methods under semantic-quality evaluation, leading to stronger downstream fine-tuning performance, especially on semantically demanding benchmarks.
CVFeb 2, 2024
TSJNet: A Multi-modality Target and Semantic Awareness Joint-driven Image Fusion NetworkYuchan Jie, Yushen Xu, Xiaosong Li et al.
Multi-modality image fusion involves integrating complementary information from different modalities into a single image. Current methods primarily focus on enhancing image fusion with a single advanced task such as incorporating semantic or object-related information into the fusion process. This method creates challenges in achieving multiple objectives simultaneously. We introduce a target and semantic awareness joint-driven fusion network called TSJNet. TSJNet comprises fusion, detection, and segmentation subnetworks arranged in a series structure. It leverages object and semantically relevant information derived from dual high-level tasks to guide the fusion network. Additionally, We propose a local significant feature extraction module with a double parallel branch structure to fully capture the fine-grained features of cross-modal images and foster interaction among modalities, targets, and segmentation information. We conducted extensive experiments on four publicly available datasets (MSRS, M3FD, RoadScene, and LLVIP). The results demonstrate that TSJNet can generate visually pleasing fused results, achieving an average increase of 2.84% and 7.47% in object detection and segmentation mAP @0.5 and mIoU, respectively, compared to the state-of-the-art methods.
LGMay 22, 2025
TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time ScalingWeizhe Lin, Xing Li, Zhiyuan Yang et al.
Large Reasoning Models (LRMs) demonstrate exceptional capability in tackling complex mathematical, logical, and coding tasks by leveraging extended Chain-of-Thought (CoT) reasoning. Test-time scaling methods, such as prolonging CoT with explicit token-level exploration, can push LRMs' accuracy boundaries, but they incur significant decoding overhead. A key inefficiency source is LRMs often generate redundant thinking CoTs, which demonstrate clear structured overthinking and underthinking patterns. Inspired by human cognitive reasoning processes and numerical optimization theories, we propose TrimR, a verifier-based, training-free, efficient framework for dynamic CoT compression to trim reasoning and enhance test-time scaling, explicitly tailored for production-level deployment. Our method employs a lightweight, pretrained, instruction-tuned verifier to detect and truncate redundant intermediate thoughts of LRMs without any LRM or verifier fine-tuning. We present both the core algorithm and asynchronous online system engineered for high-throughput industrial applications. Empirical evaluations on Ascend NPUs and vLLM show that our framework delivers substantial gains in inference efficiency under large-batch workloads. In particular, on the four MATH500, AIME24, AIME25, and GPQA benchmarks, the reasoning runtime of Pangu Pro MoE, Pangu-R-38B, QwQ-32B, and DeepSeek-R1-Distill-Qwen-32B is improved by up to 70% with negligible impact on accuracy.
CVApr 27, 2024
MMA-UNet: A Multi-Modal Asymmetric UNet Architecture for Infrared and Visible Image FusionJingxue Huang, Xilai Li, Tianshu Tan et al.
Multi-modal image fusion (MMIF) maps useful information from various modalities into the same representation space, thereby producing an informative fused image. However, the existing fusion algorithms tend to symmetrically fuse the multi-modal images, causing the loss of shallow information or bias towards a single modality in certain regions of the fusion results. In this study, we analyzed the spatial distribution differences of information in different modalities and proved that encoding features within the same network is not conducive to achieving simultaneous deep feature space alignment for multi-modal images. To overcome this issue, a Multi-Modal Asymmetric UNet (MMA-UNet) was proposed. We separately trained specialized feature encoders for different modal and implemented a cross-scale fusion strategy to maintain the features from different modalities within the same representation space, ensuring a balanced information fusion process. Furthermore, extensive fusion and downstream task experiments were conducted to demonstrate the efficiency of MMA-UNet in fusing infrared and visible image information, producing visually natural and semantically rich fusion results. Its performance surpasses that of the state-of-the-art comparison fusion methods.
CVNov 16, 2025
Text-Guided Channel Perturbation and Pretrained Knowledge Integration for Unified Multi-Modality Image FusionXilai Li, Xiaosong Li, Weijun Jiang
Multi-modality image fusion enhances scene perception by combining complementary information. Unified models aim to share parameters across modalities for multi-modality image fusion, but large modality differences often cause gradient conflicts, limiting performance. Some methods introduce modality-specific encoders to enhance feature perception and improve fusion quality. However, this strategy reduces generalisation across different fusion tasks. To overcome this limitation, we propose a unified multi-modality image fusion framework based on channel perturbation and pre-trained knowledge integration (UP-Fusion). To suppress redundant modal information and emphasize key features, we propose the Semantic-Aware Channel Pruning Module (SCPM), which leverages the semantic perception capability of a pre-trained model to filter and enhance multi-modality feature channels. Furthermore, we proposed the Geometric Affine Modulation Module (GAM), which uses original modal features to apply affine transformations on initial fusion features to maintain the feature encoder modal discriminability. Finally, we apply a Text-Guided Channel Perturbation Module (TCPM) during decoding to reshape the channel distribution, reducing the dependence on modality-specific channels. Extensive experiments demonstrate that the proposed algorithm outperforms existing methods on both multi-modality image fusion and downstream tasks.
CVSep 11, 2025
FlexiD-Fuse: Flexible number of inputs multi-modal medical image fusion based on diffusion modelYushen Xu, Xiaosong Li, Yuchun Wang et al.
Different modalities of medical images provide unique physiological and anatomical information for diseases. Multi-modal medical image fusion integrates useful information from different complementary medical images with different modalities, producing a fused image that comprehensively and objectively reflects lesion characteristics to assist doctors in clinical diagnosis. However, existing fusion methods can only handle a fixed number of modality inputs, such as accepting only two-modal or tri-modal inputs, and cannot directly process varying input quantities, which hinders their application in clinical settings. To tackle this issue, we introduce FlexiD-Fuse, a diffusion-based image fusion network designed to accommodate flexible quantities of input modalities. It can end-to-end process two-modal and tri-modal medical image fusion under the same weight. FlexiD-Fuse transforms the diffusion fusion problem, which supports only fixed-condition inputs, into a maximum likelihood estimation problem based on the diffusion process and hierarchical Bayesian modeling. By incorporating the Expectation-Maximization algorithm into the diffusion sampling iteration process, FlexiD-Fuse can generate high-quality fused images with cross-modal information from source images, independently of the number of input images. We compared the latest two and tri-modal medical image fusion methods, tested them on Harvard datasets, and evaluated them using nine popular metrics. The experimental results show that our method achieves the best performance in medical image fusion with varying inputs. Meanwhile, we conducted extensive extension experiments on infrared-visible, multi-exposure, and multi-focus image fusion tasks with arbitrary numbers, and compared them with the perspective SOTA methods. The results of the extension experiments consistently demonstrate the effectiveness and superiority of our method.
CVAug 11, 2025
MambaTrans: Multimodal Fusion Image Translation via Large Language Model Priors for Downstream Visual TasksYushen Xu, Xiaosong Li, Zhenyu Kuang et al.
The goal of multimodal image fusion is to integrate complementary information from infrared and visible images, generating multimodal fused images for downstream tasks. Existing downstream pre-training models are typically trained on visible images. However, the significant pixel distribution differences between visible and multimodal fusion images can degrade downstream task performance, sometimes even below that of using only visible images. This paper explores adapting multimodal fused images with significant modality differences to object detection and semantic segmentation models trained on visible images. To address this, we propose MambaTrans, a novel multimodal fusion image modality translator. MambaTrans uses descriptions from a multimodal large language model and masks from semantic segmentation models as input. Its core component, the Multi-Model State Space Block, combines mask-image-text cross-attention and a 3D-Selective Scan Module, enhancing pure visual capabilities. By leveraging object detection prior knowledge, MambaTrans minimizes detection loss during training and captures long-term dependencies among text, masks, and images. This enables favorable results in pre-trained models without adjusting their parameters. Experiments on public datasets show that MambaTrans effectively improves multimodal image performance in downstream tasks.
HCJan 12, 2018
A Quantitative Approach in Heuristic Evaluation of E-commerce WebsitesXiaosong Li, Ye Liu, Zizhou Fan et al.
This paper presents a pilot study on developing an instrument to predict the quality of e-commerce websites. The 8C model was adopted as the reference model of the heuristic evaluation. Each dimension of the 8C was mapped into a set of quantitative website elements, selected websites were scraped to get the quantitative website elements, and the score of each dimension was calculated. A software was developed in PHP for the experiments. In the training process, 10 experiments were conducted and quantitative analyses were regressively conducted between the experiments. The conversion rate was used to verify the heuristic evaluation of an e-commerce website after each experiment. The results showed that the mapping revisions between the experiments improved the performance of the evaluation instrument, therefore the experiment process and the quantitative mapping revision guideline proposed was on the right track. The software resulted from the experiment 10 can serve as the aimed e-commerce website evaluation instrument. The experiment results and the future work have been discussed.