IVJun 16, 2022
AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image SegmentationYuanfeng Ji, Haotian Bai, Jie Yang et al.
Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. Information can be found at https://amos22.grand-challenge.org.
CVMar 24, 2023Code
CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene LayoutHaotian Bai, Yuanhuiyi Lyu, Lutao Jiang et al.
Text-to-3D form plays a crucial role in creating editable 3D scenes for AR/VR. Recent advances have shown promise in merging neural radiance fields (NeRFs) with pre-trained diffusion models for text-to-3D object generation. However, one enduring challenge is their inadequate capability to accurately parse and regenerate consistent multi-object environments. Specifically, these models encounter difficulties in accurately representing quantity and style prompted by multi-object texts, often resulting in a collapse of the rendering fidelity that fails to match the semantic intricacies. Moreover, amalgamating these elements into a coherent 3D scene is a substantial challenge, stemming from generic distribution inherent in diffusion models. To tackle the issue of 'guidance collapse' and further enhance scene consistency, we propose a novel framework, dubbed CompoNeRF, by integrating an editable 3D scene layout with object-specific and scene-wide guidance mechanisms. It initiates by interpreting a complex text into the layout populated with multiple NeRFs, each paired with a corresponding subtext prompt for precise object depiction. Next, a tailored composition module seamlessly blends these NeRFs, promoting consistency, while the dual-level text guidance reduces ambiguity and boosts accuracy. Noticeably, our composition design permits decomposition. This enables flexible scene editing and recomposition into new scenes based on the edited layout or text prompts. Utilizing the open-source Stable Diffusion model, CompoNeRF generates multi-object scenes with high fidelity. Remarkably, our framework achieves up to a \textbf{54\%} improvement by the multi-view CLIP score metric. Our user study indicates that our method has significantly improved semantic accuracy, multi-view consistency, and individual recognizability for multi-object scene generation.
CVJul 21, 2022Code
Weakly Supervised Object Localization via Transformer with Implicit Spatial CalibrationHaotian Bai, Ruimao Zhang, Jiong Wang et al.
Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Recent studies leverage the advantage of self-attention in visual Transformer for long-range dependency to re-active semantic regions, aiming to avoid partial activation in traditional class activation mapping (CAM). However, the long-range modeling in Transformer neglects the inherent spatial coherence of the object, and it usually diffuses the semantic-aware regions far from the object boundary, making localization results significantly larger or far smaller. To address such an issue, we introduce a simple yet effective Spatial Calibration Module (SCM) for accurate WSOL, incorporating semantic similarities of patch tokens and their spatial relationships into a unified diffusion model. Specifically, we introduce a learnable parameter to dynamically adjust the semantic correlations and spatial context intensities for effective information propagation. In practice, SCM is designed as an external module of Transformer, and can be removed during inference to reduce the computation cost. The object-sensitive localization ability is implicitly embedded into the Transformer encoder through optimization in the training phase. It enables the generated attention maps to capture the sharper object boundaries and filter the object-irrelevant background area. Extensive experimental results demonstrate the effectiveness of the proposed method, which significantly outperforms its counterpart TS-CAM on both CUB-200 and ImageNet-1K benchmarks. The code is available at https://github.com/164140757/SCM.
CVMar 23, 2023
Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game PerspectiveJinjing Zhu, Haotian Bai, Lin Wang
Endeavors have been recently made to leverage the vision transformer (ViT) for the challenging unsupervised domain adaptation (UDA) task. They typically adopt the cross-attention in ViT for direct domain alignment. However, as the performance of cross-attention highly relies on the quality of pseudo labels for targeted samples, it becomes less effective when the domain gap becomes large. We solve this problem from a game theory's perspective with the proposed model dubbed as PMTrans, which bridges source and target domains with an intermediate domain. Specifically, we propose a novel ViT-based module called PatchMix that effectively builds up the intermediate domain, i.e., probability distribution, by learning to sample patches from both domains based on the game-theoretical models. This way, it learns to mix the patches from the source and target domains to maximize the cross entropy (CE), while exploiting two semi-supervised mixup losses in the feature and label spaces to minimize it. As such, we interpret the process of UDA as a min-max CE game with three players, including the feature extractor, classifier, and PatchMix, to find the Nash Equilibria. Moreover, we leverage attention maps from ViT to re-weight the label of each patch by its importance, making it possible to obtain more domain-discriminative feature representations. We conduct extensive experiments on four benchmark datasets, and the results show that PMTrans significantly surpasses the ViT-based and CNN-based SoTA methods by +3.6% on Office-Home, +1.4% on Office-31, and +17.7% on DomainNet, respectively.
CVJul 28, 2023
Dynamic PlenOctree for Adaptive Sampling Refinement in Explicit NeRFHaotian Bai, Yiqi Lin, Yize Chen et al.
The explicit neural radiance field (NeRF) has gained considerable interest for its efficient training and fast inference capabilities, making it a promising direction such as virtual reality and gaming. In particular, PlenOctree (POT)[1], an explicit hierarchical multi-scale octree representation, has emerged as a structural and influential framework. However, POT's fixed structure for direct optimization is sub-optimal as the scene complexity evolves continuously with updates to cached color and density, necessitating refining the sampling distribution to capture signal complexity accordingly. To address this issue, we propose the dynamic PlenOctree DOT, which adaptively refines the sample distribution to adjust to changing scene complexity. Specifically, DOT proposes a concise yet novel hierarchical feature fusion strategy during the iterative rendering process. Firstly, it identifies the regions of interest through training signals to ensure adaptive and efficient refinement. Next, rather than directly filtering out valueless nodes, DOT introduces the sampling and pruning operations for octrees to aggregate features, enabling rapid parameter learning. Compared with POT, our DOT outperforms it by enhancing visual quality, reducing over $55.15$/$68.84\%$ parameters, and providing 1.7/1.9 times FPS for NeRF-synthetic and Tanks $\&$ Temples, respectively. Project homepage:https://vlislab22.github.io/DOT. [1] Yu, Alex, et al. "Plenoctrees for real-time rendering of neural radiance fields." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
CVMay 21, 2022
Deep Learning for Omnidirectional Vision: A Survey and New PerspectivesHao Ai, Zidong Cao, Jinjing Zhu et al.
Omnidirectional image (ODI) data is captured with a 360x180 field-of-view, which is much wider than the pinhole cameras and contains richer spatial information than the conventional planar images. Accordingly, omnidirectional vision has attracted booming attention due to its more advantageous performance in numerous applications, such as autonomous driving and virtual reality. In recent years, the availability of customer-level 360 cameras has made omnidirectional vision more popular, and the advance of deep learning (DL) has significantly sparked its research and applications. This paper presents a systematic and comprehensive review and analysis of the recent progress in DL methods for omnidirectional vision. Our work covers four main contents: (i) An introduction to the principle of omnidirectional imaging, the convolution methods on the ODI, and datasets to highlight the differences and difficulties compared with the 2D planar image data; (ii) A structural and hierarchical taxonomy of the DL methods for omnidirectional vision; (iii) A summarization of the latest novel learning strategies and applications; (iv) An insightful discussion of the challenges and open problems by highlighting the potential research directions to trigger more research in the community.
CVJun 1, 2023
FMapping: Factorized Efficient Neural Field Mapping for Real-Time Dense RGB SLAMTongyan Hua, Haotian Bai, Zidong Cao et al.
In this paper, we introduce FMapping, an efficient neural field mapping framework that facilitates the continuous estimation of a colorized point cloud map in real-time dense RGB SLAM. To achieve this challenging goal without depth, a hurdle is how to improve efficiency and reduce the mapping uncertainty of the RGB SLAM system. To this end, we first build up a theoretical analysis by decomposing the SLAM system into tracking and mapping parts, and the mapping uncertainty is explicitly defined within the frame of neural representations. Based on the analysis, we then propose an effective factorization scheme for scene representation and introduce a sliding window strategy to reduce the uncertainty for scene reconstruction. Specifically, we leverage the factorized neural field to decompose uncertainty into a lower-dimensional space, which enhances robustness to noise and improves training efficiency. We then propose the sliding window sampler to reduce uncertainty by incorporating coherent geometric cues from observed frames during map initialization to enhance convergence. Our factorized neural mapping approach enjoys some advantages, such as low memory consumption, more efficient computation, and fast convergence during map initialization. Experiments on two benchmark datasets show that our method can update the map of high-fidelity colorized point clouds around 2 seconds in real time while requiring no customized CUDA kernels. Additionally, it utilizes x20 fewer parameters than the most concise neural implicit mapping of prior methods for SLAM, e.g., iMAP [ 31] and around x1000 fewer parameters than the state-of-the-art approach, e.g., NICE-SLAM [ 42]. For more details, please refer to our project homepage: https://vlis2022.github.io/fmap/.
CVSep 20, 2024
High-Fidelity Mask-free Neural Surface Reconstruction for Virtual RealityHaotian Bai, Yize Chen, Lin Wang
Object-centric surface reconstruction from multi-view images is crucial in creating editable digital assets for AR/VR. Due to the lack of geometric constraints, existing methods, e.g., NeuS necessitate annotating the object masks to reconstruct compact surfaces in mesh processing. Mask annotation, however, incurs considerable labor costs due to its cumbersome nature. This paper presents Hi-NeuS, a novel rendering-based framework for neural implicit surface reconstruction, aiming to recover compact and precise surfaces without multi-view object masks. Our key insight is that the overlapping regions in the object-centric views naturally highlight the object of interest as the camera orbits around objects. The object of interest can be specified by estimating the distribution of the rendering weights accumulated from multiple views, which implicitly identifies the surface that a user intends to capture. This inspires us to design a geometric refinement approach, which takes multi-view rendering weights to guide the signed distance functions (SDF) of neural surfaces in a self-supervised manner. Specifically, it retains these weights to resample a pseudo surface based on their distribution. This facilitates the alignment of the SDF to the object of interest. We then regularize the SDF's bias for geometric consistency. Moreover, we propose to use unmasked Chamfer Distance(CD) to measure the extracted mesh without post-processing for more precise evaluation. Our approach has been validated through NeuS and its variant Neuralangelo, demonstrating its adaptability across different NeuS backbones. Extensive benchmark on the DTU dataset shows that our method reduces surface noise by about 20%, and improves the unmasked CD by around 30%, achieving better surface details. The superiority of Hi-NeuS is further validated on BlendedMVS and handheld camera captures for content creation.
CVJan 6, 2024
Hi-Map: Hierarchical Factorized Radiance Field for High-Fidelity Monocular Dense MappingTongyan Hua, Haotian Bai, Zidong Cao et al.
In this paper, we introduce Hi-Map, a novel monocular dense mapping approach based on Neural Radiance Field (NeRF). Hi-Map is exceptional in its capacity to achieve efficient and high-fidelity mapping using only posed RGB inputs. Our method eliminates the need for external depth priors derived from e.g., a depth estimation model. Our key idea is to represent the scene as a hierarchical feature grid that encodes the radiance and then factorizes it into feature planes and vectors. As such, the scene representation becomes simpler and more generalizable for fast and smooth convergence on new observations. This allows for efficient computation while alleviating noise patterns by reducing the complexity of the scene representation. Buttressed by the hierarchical factorized representation, we leverage the Sign Distance Field (SDF) as a proxy of rendering for inferring the volume density, demonstrating high mapping fidelity. Moreover, we introduce a dual-path encoding strategy to strengthen the photometric cues and further boost the mapping quality, especially for the distant and textureless regions. Extensive experiments demonstrate our method's superiority in geometric and textural accuracy over the state-of-the-art NeRF-based monocular mapping methods.
CVJun 19, 2024
PanDA: Towards Panoramic Depth Anything with Unlabeled Panoramas and Mobius Spatial AugmentationZidong Cao, Jinjing Zhu, Weiming Zhang et al.
Recently, Depth Anything Models (DAMs) - a type of depth foundation models - have demonstrated impressive zero-shot capabilities across diverse perspective images. Despite its success, it remains an open question regarding DAMs' performance on panorama images that enjoy a large field-of-view (180x360) but suffer from spherical distortions. To address this gap, we conduct an empirical analysis to evaluate the performance of DAMs on panoramic images and identify their limitations. For this, we undertake comprehensive experiments to assess the performance of DAMs from three key factors: panoramic representations, 360 camera positions for capturing scenarios, and spherical spatial transformations. This way, we reveal some key findings, e.g., DAMs are sensitive to spatial transformations. We then propose a semi-supervised learning (SSL) framework to learn a panoramic DAM, dubbed PanDA. Under the umbrella of SSL, PanDA first learns a teacher model by fine-tuning DAM through joint training on synthetic indoor and outdoor panoramic datasets. Then, a student model is trained using large-scale unlabeled data, leveraging pseudo-labels generated by the teacher model. To enhance PanDA's generalization capability, M"obius transformation-based spatial augmentation (MTSA) is proposed to impose consistency regularization between the predicted depth maps from the original and spatially transformed ones. This subtly improves the student model's robustness to various spatial transformations, even under severe distortions. Extensive experiments demonstrate that PanDA exhibits remarkable zero-shot capability across diverse scenes, and outperforms the data-specific panoramic depth estimation methods on two popular real-world benchmarks.