Ziwei Qin

CL
h-index14
5papers
999citations
Novelty43%
AI Score51

5 Papers

39.4CVMay 25Code
Are We Overconfident in Models and Results for Semi-Supervised 3D Medical Image Segmentation?

Jun Li, Ziwei Qin

Semi-supervised learning has become a dominant paradigm for reducing annotation costs. However, we argue that the current progress is clouded by a twofold overconfidence problem. Algorithmically, mainstream pseudo-labeling frameworks often conflate prediction confidence with uncertainty, leading to severe confirmation bias. Strategically, since multiple benchmark datasets lack dedicated validation sets, some studies use the test set for validation as well, leading to inflated performance estimates. Subsequent methods, compelled to employ the same strategy to surpass reported SOTA, trigger an arms race of overfitting. This raises concerns that the impressive numerical gains in the community may reflect overfitting rather than genuine progress. Thus, we propose a tri-space calibrated segmentation framework founded on a principled dual-axis reliability assessment engine. It explicitly decouples confidence from uncertainty and uses this signal to detect and correct confirmation bias across feature, probability, and image spaces in a collaborative manner. Across three benchmark datasets, TCSeg consistently delivers strong performance under existing evaluation protocols. More importantly, we advocate that the community report final-checkpoint results under multiple-run protocols, thereby establishing more rigorous benchmarks with a more realistic perspective. Code will be available: github.com/DirkLiii/TCSeg.

CVDec 23, 2025
MAPI-GNN: Multi-Activation Plane Interaction Graph Neural Network for Multimodal Medical Diagnosis

Ziwei Qin, Xuhui Song, Deqing Huang et al.

Graph neural networks are increasingly applied to multimodal medical diagnosis for their inherent relational modeling capabilities. However, their efficacy is often compromised by the prevailing reliance on a single, static graph built from indiscriminate features, hindering the ability to model patient-specific pathological relationships. To this end, the proposed Multi-Activation Plane Interaction Graph Neural Network (MAPI-GNN) reconstructs this single-graph paradigm by learning a multifaceted graph profile from semantically disentangled feature subspaces. The framework first uncovers latent graph-aware patterns via a multi-dimensional discriminator; these patterns then guide the dynamic construction of a stack of activation graphs; and this multifaceted profile is finally aggregated and contextualized by a relational fusion engine for a robust diagnosis. Extensive experiments on two diverse tasks, comprising over 1300 patient samples, demonstrate that MAPI-GNN significantly outperforms state-of-the-art methods.

CLMay 24, 2023Code
ImageNetVC: Zero- and Few-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories

Heming Xia, Qingxiu Dong, Lei Li et al.

Recently, Large Language Models (LLMs) have been serving as general-purpose interfaces, posing a significant demand for comprehensive visual knowledge. However, it remains unclear how well current LLMs and their visually augmented counterparts (VaLMs) can master visual commonsense knowledge. To investigate this, we propose ImageNetVC, a human-annotated dataset specifically designed for zero- and few-shot visual commonsense evaluation across 1,000 ImageNet categories. Utilizing ImageNetVC, we benchmark the fundamental visual commonsense knowledge of both unimodal LLMs and VaLMs. Furthermore, we analyze the factors affecting the visual commonsense knowledge of large-scale models, providing insights into the development of language models enriched with visual commonsense knowledge. Our code and dataset are available at https://github.com/hemingkx/ImageNetVC.

CLMay 24, 2023Code
Denoising Bottleneck with Mutual Information Maximization for Video Multimodal Fusion

Shaoxiang Wu, Damai Dai, Ziwei Qin et al.

Video multimodal fusion aims to integrate multimodal signals in videos, such as visual, audio and text, to make a complementary prediction with multiple modalities contents. However, unlike other image-text multimodal tasks, video has longer multimodal sequences with more redundancy and noise in both visual and audio modalities. Prior denoising methods like forget gate are coarse in the granularity of noise filtering. They often suppress the redundant and noisy information at the risk of losing critical information. Therefore, we propose a denoising bottleneck fusion (DBF) model for fine-grained video multimodal fusion. On the one hand, we employ a bottleneck mechanism to filter out noise and redundancy with a restrained receptive field. On the other hand, we use a mutual information maximization module to regulate the filter-out module to preserve key information within different modalities. Our DBF model achieves significant improvement over current state-of-the-art baselines on multiple benchmarks covering multimodal sentiment analysis and multimodal summarization tasks. It proves that our model can effectively capture salient features from noisy and redundant video, audio, and text inputs. The code for this paper is publicly available at https://github.com/WSXRHFG/DBF.

CLMay 15, 2021
Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues

Qingxiu Dong, Ziwei Qin, Heming Xia et al.

It is a common practice for recent works in vision language cross-modal reasoning to adopt a binary or multi-choice classification formulation taking as input a set of source image(s) and textual query. In this work, we take a sober look at such an unconditional formulation in the sense that no prior knowledge is specified with respect to the source image(s). Inspired by the designs of both visual commonsense reasoning and natural language inference tasks, we propose a new task termed Premise-based Multi-modal Reasoning(PMR) where a textual premise is the background presumption on each source image. The PMR dataset contains 15,360 manually annotated samples which are created by a multi-phase crowd-sourcing process. With selected high-quality movie screenshots and human-curated premise templates from 6 pre-defined categories, we ask crowd-source workers to write one true hypothesis and three distractors (4 choices) given the premise and image through a cross-check procedure. Besides, we generate adversarial samples to alleviate the annotation artifacts and double the size of PMR. We benchmark various state-of-the-art (pretrained) multi-modal inference models on PMR and conduct comprehensive experimental analyses to showcase the utility of our dataset.