Shengxuming Zhang

CV
h-index20
4papers
15citations
Novelty53%
AI Score36

4 Papers

CVJul 16, 2025Code
Dataset Ownership Verification for Pre-trained Masked Models

Yuechen Xie, Jie Song, Yicheng Shan et al.

High-quality open-source datasets have emerged as a pivotal catalyst driving the swift advancement of deep learning, while facing the looming threat of potential exploitation. Protecting these datasets is of paramount importance for the interests of their owners. The verification of dataset ownership has evolved into a crucial approach in this domain; however, existing verification techniques are predominantly tailored to supervised models and contrastive pre-trained models, rendering them ill-suited for direct application to the increasingly prevalent masked models. In this work, we introduce the inaugural methodology addressing this critical, yet unresolved challenge, termed Dataset Ownership Verification for Masked Modeling (DOV4MM). The central objective is to ascertain whether a suspicious black-box model has been pre-trained on a particular unlabeled dataset, thereby assisting dataset owners in safeguarding their rights. DOV4MM is grounded in our empirical observation that when a model is pre-trained on the target dataset, the difficulty of reconstructing masked information within the embedding space exhibits a marked contrast to models not pre-trained on that dataset. We validated the efficacy of DOV4MM through ten masked image models on ImageNet-1K and four masked language models on WikiText-103. The results demonstrate that DOV4MM rejects the null hypothesis, with a $p$-value considerably below 0.05, surpassing all prior approaches. Code is available at https://github.com/xieyc99/DOV4MM.

CVMar 3, 2025
Towards Enhanced Image Generation Via Multi-modal Chain of Thought in Unified Generative Models

Yi Wang, Mushui Liu, Wanggui He et al.

Unified generative models have shown remarkable performance in text and image generation. For image synthesis tasks, they adopt straightforward text-to-image (T2I) generation. However, direct T2I generation limits the models in handling complex compositional instructions, which frequently occur in real-world scenarios. Although this issue is vital, existing works mainly focus on improving the basic image generation capability of the models. While such improvements help to some extent, they still fail to adequately resolve the problem. Inspired by Chain of Thought (CoT) solving complex problems step by step, this work aims to introduce CoT into unified generative models to address the challenges of complex image generation that direct T2I generation cannot effectively solve, thereby endowing models with enhanced image generation ability. To achieve this, we first propose Functionality-oriented eXperts (FoXperts), an expert-parallel architecture in our model FoX, which assigns experts by function. FoXperts disentangles potential conflicts in mainstream modality-oriented designs and provides a solid foundation for CoT. When introducing CoT, the first question is how to design it for complex image generation. To this end, we emulate a human-like artistic workflow -- planning, acting, reflection, and correction -- and propose the Multimodal Chain of Thought (MCoT) approach, as the data involves both text and image. To address the subsequent challenge -- designing an effective MCoT training paradigm -- we develop a multi-task joint training scheme that equips the model with all capabilities required for each MCoT step in a disentangled manner. This paradigm avoids the difficulty of collecting consistent multi-step data tuples. Extensive experiments show that FoX consistently outperforms existing unified models on various T2I benchmarks, delivering notable improvements in complex image generation.

CVDec 14, 2024
SEW: Self-calibration Enhanced Whole Slide Pathology Image Analysis

Haoming Luo, Xiaotian Yu, Shengxuming Zhang et al.

Pathology images are considered the ``gold standard" for cancer diagnosis and treatment, with gigapixel images providing extensive tissue and cellular information. Existing methods fail to simultaneously extract global structural and local detail features for comprehensive pathology image analysis efficiently. To address these limitations, we propose a self-calibration enhanced framework for whole slide pathology image analysis, comprising three components: a global branch, a focus predictor, and a detailed branch. The global branch initially classifies using the pathological thumbnail, while the focus predictor identifies relevant regions for classification based on the last layer features of the global branch. The detailed extraction branch then assesses whether the magnified regions correspond to the lesion area. Finally, a feature consistency constraint between the global and detail branches ensures that the global branch focuses on the appropriate region and extracts sufficient discriminative features for final identification. These focused discriminative features prove invaluable for uncovering novel prognostic tumor markers from the perspective of feature cluster uniqueness and tissue spatial distribution. Extensive experiment results demonstrate that the proposed framework can rapidly deliver accurate and explainable results for pathological grading and prognosis tasks.

CVDec 12, 2024
Efficient and Comprehensive Feature Extraction in Large Vision-Language Model for Pathology Analysis

Shengxuming Zhang, Weihan Li, Tianhong Gao et al.

Pathological diagnosis is vital for determining disease characteristics, guiding treatment, and assessing prognosis, relying heavily on detailed, multi-scale analysis of high-resolution whole slide images (WSI). However, existing large vision-language models (LVLMs) are limited by input resolution constraints, hindering their efficiency and accuracy in pathology image analysis. To overcome these issues, we propose two innovative strategies: the mixed task-guided feature enhancement, which directs feature extraction toward lesion-related details across scales, and the prompt-guided detail feature completion, which integrates coarse- and fine-grained features from WSI based on specific prompts without compromising inference speed. Leveraging a comprehensive dataset of 490K samples from diverse pathology tasks, we trained the pathology-specialized LVLM, OmniPath. Extensive experiments demonstrate that this model significantly outperforms existing methods in diagnostic accuracy and efficiency, providing an interactive, clinically aligned approach for auxiliary diagnosis in a wide range of pathology applications.