CVAug 20, 2023Code
Domain Reduction Strategy for Non Line of Sight ImagingHyunbo Shim, In Cho, Daekyu Kwon et al.
This paper presents a novel optimization-based method for non-line-of-sight (NLOS) imaging that aims to reconstruct hidden scenes under general setups with significantly reduced reconstruction time. In NLOS imaging, the visible surfaces of the target objects are notably sparse. To mitigate unnecessary computations arising from empty regions, we design our method to render the transients through partial propagations from a continuously sampled set of points from the hidden space. Our method is capable of accurately and efficiently modeling the view-dependent reflectance using surface normals, which enables us to obtain surface geometry as well as albedo. In this pipeline, we propose a novel domain reduction strategy to eliminate superfluous computations in empty regions. During the optimization process, our domain reduction procedure periodically prunes the empty regions from our sampling domain in a coarse-to-fine manner, leading to substantial improvement in efficiency. We demonstrate the effectiveness of our method in various NLOS scenarios with sparse scanning patterns. Experiments conducted on both synthetic and real-world data support the efficacy in general NLOS scenarios, and the improved efficiency of our method compared to the previous optimization-based solutions. Our code is available at https://github.com/hyunbo9/domain-reduction-strategy.
CVJul 26, 2024Code
Learning to Enhance Aperture Phasor Field for Non-Line-of-Sight ImagingIn Cho, Hyunbo Shim, Seon Joo Kim
This paper aims to facilitate more practical NLOS imaging by reducing the number of samplings and scan areas. To this end, we introduce a phasor-based enhancement network that is capable of predicting clean and full measurements from noisy partial observations. We leverage a denoising autoencoder scheme to acquire rich and noise-robust representations in the measurement space. Through this pipeline, our enhancement network is trained to accurately reconstruct complete measurements from their corrupted and partial counterparts. However, we observe that the \naive application of denoising often yields degraded and over-smoothed results, caused by unnecessary and spurious frequency signals present in measurements. To address this issue, we introduce a phasor-based pipeline designed to limit the spectrum of our network to the frequency range of interests, where the majority of informative signals are detected. The phasor wavefronts at the aperture, which are band-limited signals, are employed as inputs and outputs of the network, guiding our network to learn from the frequency range of interests and discard unnecessary information. The experimental results in more practical acquisition scenarios demonstrate that we can look around the corners with $16\times$ or $64\times$ fewer samplings and $4\times$ smaller apertures. Our code is available at https://github.com/join16/LEAP.
CVAug 1, 2024Code
Hierarchically Structured Neural Bones for Reconstructing Animatable Objects from Casual VideosSubin Jeon, In Cho, Minsu Kim et al.
We propose a new framework for creating and easily manipulating 3D models of arbitrary objects using casually captured videos. Our core ingredient is a novel hierarchy deformation model, which captures motions of objects with a tree-structured bones. Our hierarchy system decomposes motions based on the granularity and reveals the correlations between parts without exploiting any prior structural knowledge. We further propose to regularize the bones to be positioned at the basis of motions, centers of parts, sufficiently covering related surfaces of the part. This is achieved by our bone occupancy function, which identifies whether a given 3D point is placed within the bone. Coupling the proposed components, our framework offers several clear advantages: (1) users can obtain animatable 3D models of the arbitrary objects in improved quality from their casual videos, (2) users can manipulate 3D models in an intuitive manner with minimal costs, and (3) users can interactively add or delete control points as necessary. The experimental results demonstrate the efficacy of our framework on diverse instances, in reconstruction quality, interpretability and easier manipulation. Our code is available at https://github.com/subin6/HSNB.
CVMar 11, 2025Code
Representing 3D Shapes With 64 Latent Vectors for 3D Diffusion ModelsIn Cho, Youngbeom Yoo, Subin Jeon et al.
Constructing a compressed latent space through a variational autoencoder (VAE) is the key for efficient 3D diffusion models. This paper introduces COD-VAE that encodes 3D shapes into a COmpact set of 1D latent vectors without sacrificing quality. COD-VAE introduces a two-stage autoencoder scheme to improve compression and decoding efficiency. First, our encoder block progressively compresses point clouds into compact latent vectors via intermediate point patches. Second, our triplane-based decoder reconstructs dense triplanes from latent vectors instead of directly decoding neural fields, significantly reducing computational overhead of neural fields decoding. Finally, we propose uncertainty-guided token pruning, which allocates resources adaptively by skipping computations in simpler regions and improves the decoder efficiency. Experimental results demonstrate that COD-VAE achieves 16x compression compared to the baseline while maintaining quality. This enables 20.8x speedup in generation, highlighting that a large number of latent vectors is not a prerequisite for high-quality reconstruction and generation. The code is available at https://github.com/join16/COD-VAE.
CVNov 26, 2024
4D Scaffold Gaussian Splatting with Dynamic-Aware Anchor Growing for Efficient and High-Fidelity Dynamic Scene ReconstructionWoong Oh Cho, In Cho, Seoha Kim et al.
Modeling dynamic scenes through 4D Gaussians offers high visual fidelity and fast rendering speeds, but comes with significant storage overhead. Recent approaches mitigate this cost by aggressively reducing the number of Gaussians. However, this inevitably removes Gaussians essential for high-quality rendering, leading to severe degradation in dynamic regions. In this paper, we introduce a novel 4D anchor-based framework that tackles the storage cost in different perspective. Rather than reducing the number of Gaussians, our method retains a sufficient quantity to accurately model dynamic contents, while compressing them into compact, grid-aligned 4D anchor features. Each anchor is processed by an MLP to spawn a set of neural 4D Gaussians, which represent a local spatiotemporal region. We design these neural 4D Gaussians to capture temporal changes with minimal parameters, making them well-suited for the MLP-based spawning. Moreover, we introduce a dynamic-aware anchor growing strategy to effectively assign additional anchors to under-reconstructed dynamic regions. Our method adjusts the accumulated gradients with Gaussians' temporal coverage, significantly improving reconstruction quality in dynamic regions. Experimental results highlight that our method achieves state-of-the-art visual quality in dynamic regions, outperforming all baselines by a large margin with practical storage costs.
CVAug 8, 2025
ExploreGS: Explorable 3D Scene Reconstruction with Virtual Camera Samplings and Diffusion PriorsMinsu Kim, Subin Jeon, In Cho et al.
Recent advances in novel view synthesis (NVS) have enabled real-time rendering with 3D Gaussian Splatting (3DGS). However, existing methods struggle with artifacts and missing regions when rendering from viewpoints that deviate from the training trajectory, limiting seamless scene exploration. To address this, we propose a 3DGS-based pipeline that generates additional training views to enhance reconstruction. We introduce an information-gain-driven virtual camera placement strategy to maximize scene coverage, followed by video diffusion priors to refine rendered results. Fine-tuning 3D Gaussians with these enhanced views significantly improves reconstruction quality. To evaluate our method, we present Wild-Explore, a benchmark designed for challenging scene exploration. Experiments demonstrate that our approach outperforms existing 3DGS-based methods, enabling high-quality, artifact-free rendering from arbitrary viewpoints. https://exploregs.github.io
CVJul 16, 2025
Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion PriorsSubin Jeon, In Cho, Junyoung Hong et al.
This paper introduces KeyDiff3D, a framework for unsupervised monocular 3D keypoints estimation that accurately predicts 3D keypoints from a single image. While previous methods rely on manual annotations or calibrated multi-view images, both of which are expensive to collect, our method enables monocular 3D keypoints estimation using only a collection of single-view images. To achieve this, we leverage powerful geometric priors embedded in a pretrained multi-view diffusion model. In our framework, this model generates multi-view images from a single image, serving as a supervision signal to provide 3D geometric cues to our model. We also use the diffusion model as a powerful 2D multi-view feature extractor and construct 3D feature volumes from its intermediate representations. This transforms implicit 3D priors learned by the diffusion model into explicit 3D features. Beyond accurate keypoints estimation, we further introduce a pipeline that enables manipulation of 3D objects generated by the diffusion model. Experimental results on diverse aspects and datasets, including Human3.6M, Stanford Dogs, and several in-the-wild and out-of-domain datasets, highlight the effectiveness of our method in terms of accuracy, generalization, and its ability to enable manipulation of 3D objects generated by the diffusion model from a single image.
CVOct 4, 2019
Unsupervised Keypoint Learning for Guiding Class-Conditional Video PredictionYunji Kim, Seonghyeon Nam, In Cho et al.
We propose a deep video prediction model conditioned on a single image and an action class. To generate future frames, we first detect keypoints of a moving object and predict future motion as a sequence of keypoints. The input image is then translated following the predicted keypoints sequence to compose future frames. Detecting the keypoints is central to our algorithm, and our method is trained to detect the keypoints of arbitrary objects in an unsupervised manner. Moreover, the detected keypoints of the original videos are used as pseudo-labels to learn the motion of objects. Experimental results show that our method is successfully applied to various datasets without the cost of labeling keypoints in videos. The detected keypoints are similar to human-annotated labels, and prediction results are more realistic compared to the previous methods.