Rongchang Zhao

CV
h-index2
5papers
60citations
Novelty58%
AI Score43

5 Papers

CVNov 10, 2025Code
Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View

Jianyu Qi, Ding Zou, Wenrui Yan et al.

Recent advances in Multimodal Large Language Models (MLLMs) have spurred significant progress in Chain-of-Thought (CoT) reasoning. Building on the success of Deepseek-R1, researchers extended multimodal reasoning to post-training paradigms based on reinforcement learning (RL), focusing predominantly on mathematical datasets. However, existing post-training paradigms tend to neglect two critical aspects: (1) The lack of quantifiable difficulty metrics capable of strategically screening samples for post-training optimization. (2) Suboptimal post-training paradigms that fail to jointly optimize perception and reasoning capabilities. To address this gap, we propose two novel difficulty-aware sampling strategies: Progressive Image Semantic Masking (PISM) quantifies sample hardness through systematic image degradation, while Cross-Modality Attention Balance (CMAB) assesses cross-modal interaction complexity via attention distribution analysis. Leveraging these metrics, we design a hierarchical training framework that incorporates both GRPO-only and SFT+GRPO hybrid training paradigms, and evaluate them across six benchmark datasets. Experiments demonstrate consistent superiority of GRPO applied to difficulty-stratified samples compared to conventional SFT+GRPO pipelines, indicating that strategic data sampling can obviate the need for supervised fine-tuning while improving model accuracy. Our code will be released at https://github.com/qijianyu277/DifficultySampling.

IVJul 2, 2025
SWinMamba: Serpentine Window State Space Model for Vascular Segmentation

Rongchang Zhao, Huanchi Liu, Jian Zhang

Vascular segmentation in medical images is crucial for disease diagnosis and surgical navigation. However, the segmented vascular structure is often discontinuous due to its slender nature and inadequate prior modeling. In this paper, we propose a novel Serpentine Window Mamba (SWinMamba) to achieve accurate vascular segmentation. The proposed SWinMamba innovatively models the continuity of slender vascular structures by incorporating serpentine window sequences into bidirectional state space models. The serpentine window sequences enable efficient feature capturing by adaptively guiding global visual context modeling to the vascular structure. Specifically, the Serpentine Window Tokenizer (SWToken) adaptively splits the input image using overlapping serpentine window sequences, enabling flexible receptive fields (RFs) for vascular structure modeling. The Bidirectional Aggregation Module (BAM) integrates coherent local features in the RFs for vascular continuity representation. In addition, dual-domain learning with Spatial-Frequency Fusion Unit (SFFU) is designed to enhance the feature representation of vascular structure. Extensive experiments on three challenging datasets demonstrate that the proposed SWinMamba achieves superior performance with complete and connected vessels.

IVSep 26, 2019
A Refined Equilibrium Generative Adversarial Network for Retinal Vessel Segmentation

Yukun Zhou, Zailiang Chen, Hailan Shen et al.

Objective: Recognizing retinal vessel abnormity is vital to early diagnosis of ophthalmological diseases and cardiovascular events. However, segmentation results are highly influenced by elusive vessels, especially in low-contrast background and lesion region. In this work, we present an end-to-end synthetic neural network, containing a symmetric equilibrium generative adversarial network (SEGAN), multi-scale features refine blocks (MSFRB), and attention mechanism (AM) to enhance the performance on vessel segmentation. Method: The proposed network is granted powerful multi-scale representation capability to extract detail information. First, SEGAN constructs a symmetric adversarial architecture, which forces generator to produce more realistic images with local details. Second, MSFRB are devised to prevent high-resolution features from being obscured, thereby merging multi-scale features better. Finally, the AM is employed to encourage the network to concentrate on discriminative features. Results: On public dataset DRIVE, STARE, CHASEDB1, and HRF, we evaluate our network quantitatively and compare it with state-of-the-art works. The ablation experiment shows that SEGAN, MSFRB, and AM both contribute to the desirable performance. Conclusion: The proposed network outperforms the mature methods and effectively functions in elusive vessels segmentation, achieving highest scores in Sensitivity, G-Mean, Precision, and F1-Score while maintaining the top level in other metrics. Significance: The appreciable performance and computational efficiency offer great potential in clinical retinal vessel segmentation application. Meanwhile, the network could be utilized to extract detail information in other biomedical issues

CVSep 11, 2019
Dual-attention Focused Module for Weakly Supervised Object Localization

Yukun Zhou, Zailiang Chen, Hailan Shen et al.

The research on recognizing the most discriminative regions provides referential information for weakly supervised object localization with only image-level annotations. However, the most discriminative regions usually conceal the other parts of the object, thereby impeding entire object recognition and localization. To tackle this problem, the Dual-attention Focused Module (DFM) is proposed to enhance object localization performance. Specifically, we present a dual attention module for information fusion, consisting of a position branch and a channel one. In each branch, the input feature map is deduced into an enhancement map and a mask map, thereby highlighting the most discriminative parts or hiding them. For the position mask map, we introduce a focused matrix to enhance it, which utilizes the principle that the pixels of an object are continuous. Between these two branches, the enhancement map is integrated with the mask map, aiming at partially compensating the lost information and diversifies the features. With the dual-attention module and focused matrix, the entire object region could be precisely recognized with implicit information. We demonstrate outperforming results of DFM in experiments. In particular, DFM achieves state-of-the-art performance in localization accuracy in ILSVRC 2016 and CUB-200-2011.

MMDec 27, 2017
Robust and discriminative zero-watermark scheme based on invariant feature and similarity-based retrieval for protecting large-scale DIBR 3D videos

Xiyao Liu, Yifang Wang, Ziqiang Sun et al.

Digital rights management (DRM) of depth-image-based rendering (DIBR) 3D video is an emerging area of research. Existing schemes for DIBR 3D video cause video distortions, are vulnerable to severe signal and geometric attacks, cannot protect 2D frame and depth map independently or can hardly deal with large-scale videos. To address these issues, a novel zero-watermark scheme based on invariant feature and similarity-based retrieval for protecting DIBR 3D video (RZW-SR3D) is proposed in this study. In RZW-SR3D, invariant features are extracted to generate master and ownership shares for providing distortion-free, robust and discriminative copyright identification under various attacks. Different from traditional zero-watermark schemes, features and ownership shares are stored correlatively, and a similarity-based retrieval phase is designed to provide effective solutions for large-scale videos. In addition, flexible mechanisms based on attention-based fusion are designed to protect 2D frame and depth map independently and simultaneously. Experimental results demonstrate that RZW-SR3D have superior DRM performances than existing schemes. First, RZW-SR3D can extracted the ownership shares relevant to a particular 3D video precisely and reliably for effective copyright identification of large-scale videos. Second, RZW-SR3D ensures lossless, precise, reliable and flexible copyright identification for 2D frame and depth map of 3D videos.