CVJun 21, 2022
Tell Me the Evidence? Dual Visual-Linguistic Interaction for Answer GroundingJunwen Pan, Guanlin Chen, Yi Liu et al.
Answer grounding aims to reveal the visual evidence for visual question answering (VQA), which entails highlighting relevant positions in the image when answering questions about images. Previous attempts typically tackle this problem using pretrained object detectors, but without the flexibility for objects not in the predefined vocabulary. However, these black-box methods solely concentrate on the linguistic generation, ignoring the visual interpretability. In this paper, we propose Dual Visual-Linguistic Interaction (DaVI), a novel unified end-to-end framework with the capability for both linguistic answering and visual grounding. DaVI innovatively introduces two visual-linguistic interaction mechanisms: 1) visual-based linguistic encoder that understands questions incorporated with visual features and produces linguistic-oriented evidence for further answer decoding, and 2) linguistic-based visual decoder that focuses visual features on the evidence-related regions for answer grounding. This way, our approach ranked the 1st place in the answer grounding track of 2022 VizWiz Grand Challenge.
CVAug 13, 2023
Unsupervised Adaptation of Polyp Segmentation Models via Coarse-to-Fine Self-SupervisionJiexiang Wang, Chaoqi Chen
Unsupervised Domain Adaptation~(UDA) has attracted a surge of interest over the past decade but is difficult to be used in real-world applications. Considering the privacy-preservation issues and security concerns, in this work, we study a practical problem of Source-Free Domain Adaptation (SFDA), which eliminates the reliance on annotated source data. Current SFDA methods focus on extracting domain knowledge from the source-trained model but neglects the intrinsic structure of the target domain. Moreover, they typically utilize pseudo labels for self-training in the target domain, but suffer from the notorious error accumulation problem. To address these issues, we propose a new SFDA framework, called Region-to-Pixel Adaptation Network~(RPANet), which learns the region-level and pixel-level discriminative representations through coarse-to-fine self-supervision. The proposed RPANet consists of two modules, Foreground-aware Contrastive Learning (FCL) and Confidence-Calibrated Pseudo-Labeling (CCPL), which explicitly address the key challenges of ``how to distinguish'' and ``how to refine''. To be specific, FCL introduces a supervised contrastive learning paradigm in the region level to contrast different region centroids across different target images, which efficiently involves all pseudo labels while robust to noisy samples. CCPL designs a novel fusion strategy to reduce the overconfidence problem of pseudo labels by fusing two different target predictions without introducing any additional network modules. Extensive experiments on three cross-domain polyp segmentation tasks reveal that RPANet significantly outperforms state-of-the-art SFDA and UDA methods without access to source data, revealing the potential of SFDA in medical applications.
CVDec 5, 2023
Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image RestorationYuang Ai, Huaibo Huang, Xiaoqiang Zhou et al.
Despite substantial progress, all-in-one image restoration (IR) grapples with persistent challenges in handling intricate real-world degradations. This paper introduces MPerceiver: a novel multimodal prompt learning approach that harnesses Stable Diffusion (SD) priors to enhance adaptiveness, generalizability and fidelity for all-in-one image restoration. Specifically, we develop a dual-branch module to master two types of SD prompts: textual for holistic representation and visual for multiscale detail representation. Both prompts are dynamically adjusted by degradation predictions from the CLIP image encoder, enabling adaptive responses to diverse unknown degradations. Moreover, a plug-in detail refinement module improves restoration fidelity via direct encoder-to-decoder information transformation. To assess our method, MPerceiver is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art task-specific methods across most tasks. Post multitask pre-training, MPerceiver attains a generalized representation in low-level vision, exhibiting remarkable zero-shot and few-shot capabilities in unseen tasks. Extensive experiments on 16 IR tasks underscore the superiority of MPerceiver in terms of adaptiveness, generalizability and fidelity.
CLMay 23, 2025
LeTS: Learning to Think-and-Search via Process-and-Outcome Reward HybridizationQi Zhang, Shouqing Yang, Lirong Gao et al.
Large language models (LLMs) have demonstrated impressive capabilities in reasoning with the emergence of reasoning models like OpenAI-o1 and DeepSeek-R1. Recent research focuses on integrating reasoning capabilities into the realm of retrieval-augmented generation (RAG) via outcome-supervised reinforcement learning (RL) approaches, while the correctness of intermediate think-and-search steps is usually neglected. To address this issue, we design a process-level reward module to mitigate the unawareness of intermediate reasoning steps in outcome-level supervision without additional annotation. Grounded on this, we propose Learning to Think-and-Search (LeTS), a novel framework that hybridizes stepwise process reward and outcome-based reward to current RL methods for RAG. Extensive experiments demonstrate the generalization and inference efficiency of LeTS across various RAG benchmarks. In addition, these results reveal the potential of process- and outcome-level reward hybridization in boosting LLMs' reasoning ability via RL under other scenarios. The code will be released soon.
IVJul 18, 2020
Multi-Task Neural Networks with Spatial Activation for Retinal Vessel Segmentation and Artery/Vein ClassificationWenao Ma, Shuang Yu, Kai Ma et al.
Retinal artery/vein (A/V) classification plays a critical role in the clinical biomarker study of how various systemic and cardiovascular diseases affect the retinal vessels. Conventional methods of automated A/V classification are generally complicated and heavily depend on the accurate vessel segmentation. In this paper, we propose a multi-task deep neural network with spatial activation mechanism that is able to segment full retinal vessel, artery and vein simultaneously, without the pre-requirement of vessel segmentation. The input module of the network integrates the domain knowledge of widely used retinal preprocessing and vessel enhancement techniques. We specially customize the output block of the network with a spatial activation mechanism, which takes advantage of a relatively easier task of vessel segmentation and exploits it to boost the performance of A/V classification. In addition, deep supervision is introduced to the network to assist the low level layers to extract more semantic information. The proposed network achieves pixel-wise accuracy of 95.70% for vessel segmentation, and A/V classification accuracy of 94.50%, which is the state-of-the-art performance for both tasks on the AV-DRIVE dataset. Furthermore, we have also tested the model performance on INSPIRE-AVR dataset, which achieves a skeletal A/V classification accuracy of 91.6%.
IVJun 22, 2020
Cardiac Segmentation on Late Gadolinium Enhancement MRI: A Benchmark Study from Multi-Sequence Cardiac MR Segmentation ChallengeXiahai Zhuang, Jiahang Xu, Xinzhe Luo et al.
Accurate computing, analysis and modeling of the ventricles and myocardium from medical images are important, especially in the diagnosis and treatment management for patients suffering from myocardial infarction (MI). Late gadolinium enhancement (LGE) cardiac magnetic resonance (CMR) provides an important protocol to visualize MI. However, automated segmentation of LGE CMR is still challenging, due to the indistinguishable boundaries, heterogeneous intensity distribution and complex enhancement patterns of pathological myocardium from LGE CMR. Furthermore, compared with the other sequences LGE CMR images with gold standard labels are particularly limited, which represents another obstacle for developing novel algorithms for automatic segmentation of LGE CMR. This paper presents the selective results from the Multi-Sequence Cardiac MR (MS-CMR) Segmentation challenge, in conjunction with MICCAI 2019. The challenge offered a data set of paired MS-CMR images, including auxiliary CMR sequences as well as LGE CMR, from 45 patients who underwent cardiomyopathy. It was aimed to develop new algorithms, as well as benchmark existing ones for LGE CMR segmentation and compare them objectively. In addition, the paired MS-CMR images could enable algorithms to combine the complementary information from the other sequences for the segmentation of LGE CMR. Nine representative works were selected for evaluation and comparisons, among which three methods are unsupervised methods and the other six are supervised. The results showed that the average performance of the nine methods was comparable to the inter-observer variations. The success of these methods was mainly attributed to the inclusion of the auxiliary sequences from the MS-CMR images, which provide important label information for the training of deep neural networks.
IVOct 28, 2019
Multi-sequence Cardiac MR Segmentation with Adversarial Domain Adaptation NetworkJiexiang Wang, Hongyu Huang, Chaoqi Chen et al.
Automatic and accurate segmentation of the ventricles and myocardium from multi-sequence cardiac MRI (CMR) is crucial for the diagnosis and treatment management for patients suffering from myocardial infarction (MI). However, due to the existence of domain shift among different modalities of datasets, the performance of deep neural networks drops significantly when the training and testing datasets are distinct. In this paper, we propose an unsupervised domain alignment method to explicitly alleviate the domain shifts among different modalities of CMR sequences, \emph{e.g.,} bSSFP, LGE, and T2-weighted. Our segmentation network is attention U-Net with pyramid pooling module, where multi-level feature space and output space adversarial learning are proposed to transfer discriminative domain knowledge across different datasets. Moreover, we further introduce a group-wise feature recalibration module to enforce the fine-grained semantic-level feature alignment that matching features from different networks but with the same class label. We evaluate our method on the multi-sequence cardiac MR Segmentation Challenge 2019 datasets, which contain three different modalities of MRI sequences. Extensive experimental results show that the proposed methods can obtain significant segmentation improvements compared with the baseline models.
SPAug 22, 2019
Decentralized Cooperative Online Estimation With Random Observation Matrices, Communication Graphs and Time DelaysJiexiang Wang, Tao Li, Xiwei Zhang
We analyze convergence of decentralized cooperative online estimation algorithms by a network of multiple nodes via information exchanging in an uncertain environment. Each node has a linear observation of an unknown parameter with randomly time-varying observation matrices. The underlying communication network is modeled by a sequence of random digraphs and is subjected to nonuniform random time-varying delays in channels. Each node runs an online estimation algorithm consisting of a consensus term taking a weighted sum of its own estimate and neighbours' delayed estimates, and an innovation term processing its own new measurement at each time step. By stochastic time-varying system, martingale convergence theories and the binomial expansion of random matrix products, we transform the convergence analysis of the algorithm into that of the mathematical expectation of random matrix products. Firstly, for the delay-free case, we show that the algorithm gains can be designed properly such that all nodes' estimates converge to the true parameter in mean square and almost surely if the observation matrices and communication graphs satisfy the stochastic spatiotemporal persistence of excitation condition. Secondly, for the case with time delays, we introduce delay matrices to model the random time-varying communication delays between nodes. It is shown that under the stochastic spatio-temporal persistence of excitation condition, for any given boundeddelays, proper algorithm gains can be designed to guarantee mean square convergence for the case with conditionally balanced digraphs.
CVAug 22, 2019
Uncertainty-Guided Domain Alignment for Layer Segmentation in OCT ImagesJiexiang Wang, Cheng Bian, Meng Li et al.
Automatic and accurate segmentation for retinal and choroidal layers of Optical Coherence Tomography (OCT) is crucial for detection of various ocular diseases. However, because of the variations in different equipments, OCT data obtained from different manufacturers might encounter appearance discrepancy, which could lead to performance fluctuation to a deep neural network. In this paper, we propose an uncertainty-guided domain alignment method to aim at alleviating this problem to transfer discriminative knowledge across distinct domains. We disign a novel uncertainty-guided cross-entropy loss for boosting the performance over areas with high uncertainty. An uncertainty-guided curriculum transfer strategy is developed for the self-training (ST), which regards uncertainty as efficient and effective guidance to optimize the learning process in target domain. Adversarial learning with feature recalibration module (FRM) is applied to transfer informative knowledge from the domain feature spaces adaptively. The experiments on two OCT datasets show that the proposed methods can obtain significant segmentation improvements compared with the baseline models.
CVOct 25, 2018
An Adversarial Learning Approach to Medical Image Synthesis for Lesion DetectionLiyan Sun, Jiexiang Wang, Yue Huang et al.
The identification of lesion within medical image data is necessary for diagnosis, treatment and prognosis. Segmentation and classification approaches are mainly based on supervised learning with well-paired image-level or voxel-level labels. However, labeling the lesion in medical images is laborious requiring highly specialized knowledge. We propose a medical image synthesis model named abnormal-to-normal translation generative adversarial network (ANT-GAN) to generate a normal-looking medical image based on its abnormal-looking counterpart without the need for paired training data. Unlike typical GANs, whose aim is to generate realistic samples with variations, our more restrictive model aims at producing a normal-looking image corresponding to one containing lesions, and thus requires a special design. Being able to provide a "normal" counterpart to a medical image can provide useful side information for medical imaging tasks like lesion segmentation or classification validated by our experiments. In the other aspect, the ANT-GAN model is also capable of producing highly realistic lesion-containing image corresponding to the healthy one, which shows the potential in data augmentation verified in our experiments.