CVApr 22, 2022
Dynamic Prototype Convolution Network for Few-Shot Semantic SegmentationJie Liu, Yanqi Bao, Guo-Sen Xie et al.
The key challenge for few-shot semantic segmentation (FSS) is how to tailor a desirable interaction among support and query features and/or their prototypes, under the episodic training scenario. Most existing FSS methods implement such support-query interactions by solely leveraging plain operations - e.g., cosine similarity and feature concatenation - for segmenting the query objects. However, these interaction approaches usually cannot well capture the intrinsic object details in the query images that are widely encountered in FSS, e.g., if the query object to be segmented has holes and slots, inaccurate segmentation almost always happens. To this end, we propose a dynamic prototype convolution network (DPCN) to fully capture the aforementioned intrinsic details for accurate FSS. Specifically, in DPCN, a dynamic convolution module (DCM) is firstly proposed to generate dynamic kernels from support foreground, then information interaction is achieved by convolution operations over query features using these kernels. Moreover, we equip DPCN with a support activation module (SAM) and a feature filtering module (FFM) to generate pseudo mask and filter out background information for the query images, respectively. SAM and FFM together can mine enriched context information from the query features. Our DPCN is also flexible and efficient under the k-shot FSS setting. Extensive experiments on PASCAL-5i and COCO-20i show that DPCN yields superior performances under both 1-shot and 5-shot settings.
CVAug 5, 2023Code
Where and How: Mitigating Confusion in Neural Radiance Fields from Sparse InputsYanqi Bao, Yuxin Li, Jing Huo et al.
Neural Radiance Fields from Sparse input} (NeRF-S) have shown great potential in synthesizing novel views with a limited number of observed viewpoints. However, due to the inherent limitations of sparse inputs and the gap between non-adjacent views, rendering results often suffer from over-fitting and foggy surfaces, a phenomenon we refer to as "CONFUSION" during volume rendering. In this paper, we analyze the root cause of this confusion and attribute it to two fundamental questions: "WHERE" and "HOW". To this end, we present a novel learning framework, WaH-NeRF, which effectively mitigates confusion by tackling the following challenges: (i)"WHERE" to Sample? in NeRF-S -- we introduce a Deformable Sampling strategy and a Weight-based Mutual Information Loss to address sample-position confusion arising from the limited number of viewpoints; and (ii) "HOW" to Predict? in NeRF-S -- we propose a Semi-Supervised NeRF learning Paradigm based on pose perturbation and a Pixel-Patch Correspondence Loss to alleviate prediction confusion caused by the disparity between training and testing viewpoints. By integrating our proposed modules and loss functions, WaH-NeRF outperforms previous methods under the NeRF-S setting. Code is available https://github.com/bbbbby-99/WaH-NeRF.
CVAug 26, 2023Code
InsertNeRF: Instilling Generalizability into NeRF with HyperNet ModulesYanqi Bao, Tianyu Ding, Jing Huo et al.
Generalizing Neural Radiance Fields (NeRF) to new scenes is a significant challenge that existing approaches struggle to address without extensive modifications to vanilla NeRF framework. We introduce InsertNeRF, a method for INStilling gEneRalizabiliTy into NeRF. By utilizing multiple plug-and-play HyperNet modules, InsertNeRF dynamically tailors NeRF's weights to specific reference scenes, transforming multi-scale sampling-aware features into scene-specific representations. This novel design allows for more accurate and efficient representations of complex appearances and geometries. Experiments show that this method not only achieves superior generalization performance but also provides a flexible pathway for integration with other NeRF-like systems, even in sparse input settings. Code will be available https://github.com/bbbbby-99/InsertNeRF.
CVJan 9, 2023
Few-shot Semantic Segmentation with Support-induced Graph Convolutional NetworkJie Liu, Yanqi Bao, Wenzhe Yin et al.
Few-shot semantic segmentation (FSS) aims to achieve novel objects segmentation with only a few annotated samples and has made great progress recently. Most of the existing FSS models focus on the feature matching between support and query to tackle FSS. However, the appearance variations between objects from the same category could be extremely large, leading to unreliable feature matching and query mask prediction. To this end, we propose a Support-induced Graph Convolutional Network (SiGCN) to explicitly excavate latent context structure in query images. Specifically, we propose a Support-induced Graph Reasoning (SiGR) module to capture salient query object parts at different semantic levels with a Support-induced GCN. Furthermore, an instance association (IA) module is designed to capture high-order instance context from both support and query instances. By integrating the proposed two modules, SiGCN can learn rich query context representation, and thus being more robust to appearance variations. Extensive experiments on PASCAL-5i and COCO-20i demonstrate that our SiGCN achieves state-of-the-art performance.
CVJul 24, 2024
3D Gaussian Splatting: Survey, Technologies, Challenges, and OpportunitiesYanqi Bao, Tianyu Ding, Jing Huo et al.
3D Gaussian Splatting (3DGS) has emerged as a prominent technique with the potential to become a mainstream method for 3D representations. It can effectively transform multi-view images into explicit 3D Gaussian through efficient training, and achieve real-time rendering of novel views. This survey aims to analyze existing 3DGS-related works from multiple intersecting perspectives, including related tasks, technologies, challenges, and opportunities. The primary objective is to provide newcomers with a rapid understanding of the field and to assist researchers in methodically organizing existing technologies and challenges. Specifically, we delve into the optimization, application, and extension of 3DGS, categorizing them based on their focuses or motivations. Additionally, we summarize and classify nine types of technical modules and corresponding improvements identified in existing works. Based on these analyses, we further examine the common challenges and technologies across various tasks, proposing potential research opportunities.
CVNov 26, 2024Code
Distractor-free Generalizable 3D Gaussian SplattingYanqi Bao, Jing Liao, Jing Huo et al.
We present DGGS, a novel framework that addresses the previously unexplored challenge: $\textbf{Distractor-free Generalizable 3D Gaussian Splatting}$ (3DGS). It mitigates 3D inconsistency and training instability caused by distractor data in the cross-scenes generalizable train setting while enabling feedforward inference for 3DGS and distractor masks from references in the unseen scenes. To achieve these objectives, DGGS proposes a scene-agnostic reference-based mask prediction and refinement module during the training phase, effectively eliminating the impact of distractor on training stability. Moreover, we combat distractor-induced artifacts and holes at inference time through a novel two-stage inference framework for references scoring and re-selection, complemented by a distractor pruning mechanism that further removes residual distractor 3DGS-primitive influences. Extensive feedforward experiments on the real and our synthetic data show DGGS's reconstruction capability when dealing with novel distractor scenes. Moreover, our generalizable mask prediction even achieves an accuracy superior to existing scene-specific training methods. Homepage is https://github.com/bbbbby-99/DGGS.
CVJun 7, 2024
Training-Free Video Editing via Optical Flow-Enhanced Score DistillationLianghan Zhu, Yanqi Bao, Jing Huo et al.
The rapid advancement in visual generation, particularly the emergence of pre-trained text-to-image and text-to-video models, has catalyzed growing interest in training-free video editing research. Mirroring training-free image editing techniques, current approaches preserve original video information through video input inversion and manipulating intermediate features and attention during the inference process to achieve content editing. Although they have demonstrated promising results, the lossy nature of the inversion process poses significant challenges in maintaining unedited regions of the video. Furthermore, feature and attention manipulation during inference can lead to unintended over-editing and face challenges in both local temporal continuity and global content consistency. To address these challenges, this study proposes a score distillation paradigm based on pre-trained text-to-video models, where the original video is iteratively optimized through multiple steps guided by editing gradients provided by score distillation to ultimately obtain the target video. The iterative optimization starting from the original video, combined with content preservation loss, ensures the maintenance of unedited regions in the original video and suppresses over-editing. To further guarantee video content consistency and temporal continuity, we additionally introduce a global consistency auxiliary loss and optical flow prediction-based local editing gradient smoothing. Experiments demonstrate that these strategies effectively address the aforementioned challenges, achieving comparable or superior performance across multiple dimensions including preservation of unedited regions, local temporal continuity, and global content consistency of editing results, compared to state-of-the-art methods.