Zhiguo Cao

CV
h-index32
99papers
3,867citations
Novelty56%
AI Score62

99 Papers

CVAug 3, 2023Code
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

Weiyun Wang, Min Shi, Qingyun Li et al.

We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. Using a scalable data engine that incorporates human feedback and efficient models in the loop, we create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. It covers a wide range of 3.5 million common and rare concepts in the real world, and has 132.2 billion tokens that describe the concepts and their attributes. Leveraging this new dataset, we develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding. The model is trained with open-ended language prompts and locations, which allows it to generalize to various vision and language tasks with remarkable zero-shot performance, including region-text retrieval, region recognition, captioning, and question-answering. We hope that this project can serve as a foundation for vision-language artificial general intelligence research. Models and the dataset shall be released at https://github.com/OpenGVLab/All-Seeing, and demo can be seen at https://huggingface.co/spaces/OpenGVLab/all-seeing.

CVAug 29, 2023Code
Learning to Upsample by Learning to Sample

Wenze Liu, Hao Lu, Hongtao Fu et al.

We present DySample, an ultra-lightweight and effective dynamic upsampler. While impressive performance gains have been witnessed from recent kernel-based dynamic upsamplers such as CARAFE, FADE, and SAPA, they introduce much workload, mostly due to the time-consuming dynamic convolution and the additional sub-network used to generate dynamic kernels. Further, the need for high-res feature guidance of FADE and SAPA somehow limits their application scenarios. To address these concerns, we bypass dynamic convolution and formulate upsampling from the perspective of point sampling, which is more resource-efficient and can be easily implemented with the standard built-in function in PyTorch. We first showcase a naive design, and then demonstrate how to strengthen its upsampling behavior step by step towards our new upsampler, DySample. Compared with former kernel-based dynamic upsamplers, DySample requires no customized CUDA package and has much fewer parameters, FLOPs, GPU memory, and latency. Besides the light-weight characteristics, DySample outperforms other upsamplers across five dense prediction tasks, including semantic segmentation, object detection, instance segmentation, panoptic segmentation, and monocular depth estimation. Code is available at https://github.com/tiny-smart/dysample.

CVAug 26, 2023Code
Point-Query Quadtree for Crowd Counting, Localization, and More

Chengxin Liu, Hao Lu, Zhiguo Cao et al.

We show that crowd counting can be viewed as a decomposable point querying process. This formulation enables arbitrary points as input and jointly reasons whether the points are crowd and where they locate. The querying processing, however, raises an underlying problem on the number of necessary querying points. Too few imply underestimation; too many increase computational overhead. To address this dilemma, we introduce a decomposable structure, i.e., the point-query quadtree, and propose a new counting model, termed Point quEry Transformer (PET). PET implements decomposable point querying via data-dependent quadtree splitting, where each querying point could split into four new points when necessary, thus enabling dynamic processing of sparse and dense regions. Such a querying process yields an intuitive, universal modeling of crowd as both the input and output are interpretable and steerable. We demonstrate the applications of PET on a number of crowd-related tasks, including fully-supervised crowd counting and localization, partial annotation learning, and point annotation refinement, and also report state-of-the-art performance. For the first time, we show that a single counting model can address multiple crowd-related tasks across different learning paradigms. Code is available at https://github.com/cxliu0/PET.

CVAug 1, 2022Code
DoF-NeRF: Depth-of-Field Meets Neural Radiance Fields

Zijin Wu, Xingyi Li, Juewen Peng et al.

Neural Radiance Field (NeRF) and its variants have exhibited great success on representing 3D scenes and synthesizing photo-realistic novel views. However, they are generally based on the pinhole camera model and assume all-in-focus inputs. This limits their applicability as images captured from the real world often have finite depth-of-field (DoF). To mitigate this issue, we introduce DoF-NeRF, a novel neural rendering approach that can deal with shallow DoF inputs and can simulate DoF effect. In particular, it extends NeRF to simulate the aperture of lens following the principles of geometric optics. Such a physical guarantee allows DoF-NeRF to operate views with different focus configurations. Benefiting from explicit aperture modeling, DoF-NeRF also enables direct manipulation of DoF effect by adjusting virtual aperture and focus parameters. It is plug-and-play and can be inserted into NeRF-based frameworks. Experiments on synthetic and real-world datasets show that, DoF-NeRF not only performs comparably with NeRF in the all-in-focus setting, but also can synthesize all-in-focus novel views conditioned on shallow DoF inputs. An interesting application of DoF-NeRF to DoF rendering is also demonstrated. The source code will be made available at https://github.com/zijinwuzijin/DoF-NeRF.

CVJul 31, 2022Code
Less is More: Consistent Video Depth Estimation with Masked Frames Modeling

Yiran Wang, Zhiyu Pan, Xingyi Li et al.

Temporal consistency is the key challenge of video depth estimation. Previous works are based on additional optical flow or camera poses, which is time-consuming. By contrast, we derive consistency with less information. Since videos inherently exist with heavy temporal redundancy, a missing frame could be recovered from neighboring ones. Inspired by this, we propose the frame masking network (FMNet), a spatial-temporal transformer network predicting the depth of masked frames based on their neighboring frames. By reconstructing masked temporal features, the FMNet can learn intrinsic inter-frame correlations, which leads to consistency. Compared with prior arts, experimental results demonstrate that our approach achieves comparable spatial accuracy and higher temporal consistency without any additional information. Our work provides a new perspective on consistent video depth estimation. Our official project page is https://github.com/RaymondWang987/FMNet.

CVSep 26, 2022Code
SAPA: Similarity-Aware Point Affiliation for Feature Upsampling

Hao Lu, Wenze Liu, Zixuan Ye et al.

We introduce point affiliation into feature upsampling, a notion that describes the affiliation of each upsampled point to a semantic cluster formed by local decoder feature points with semantic similarity. By rethinking point affiliation, we present a generic formulation for generating upsampling kernels. The kernels encourage not only semantic smoothness but also boundary sharpness in the upsampled feature maps. Such properties are particularly useful for some dense prediction tasks such as semantic segmentation. The key idea of our formulation is to generate similarity-aware kernels by comparing the similarity between each encoder feature point and the spatially associated local region of decoder features. In this way, the encoder feature point can function as a cue to inform the semantic cluster of upsampled feature points. To embody the formulation, we further instantiate a lightweight upsampling operator, termed Similarity-Aware Point Affiliation (SAPA), and investigate its variants. SAPA invites consistent performance improvements on a number of dense prediction tasks, including semantic segmentation, object detection, depth estimation, and image matting. Code is available at: https://github.com/poppinace/sapa

CVJul 21, 2022Code
FADE: Fusing the Assets of Decoder and Encoder for Task-Agnostic Upsampling

Hao Lu, Wenze Liu, Hongtao Fu et al.

We consider the problem of task-agnostic feature upsampling in dense prediction where an upsampling operator is required to facilitate both region-sensitive tasks like semantic segmentation and detail-sensitive tasks such as image matting. Existing upsampling operators often can work well in either type of the tasks, but not both. In this work, we present FADE, a novel, plug-and-play, and task-agnostic upsampling operator. FADE benefits from three design choices: i) considering encoder and decoder features jointly in upsampling kernel generation; ii) an efficient semi-shift convolutional operator that enables granular control over how each feature point contributes to upsampling kernels; iii) a decoder-dependent gating mechanism for enhanced detail delineation. We first study the upsampling properties of FADE on toy data and then evaluate it on large-scale semantic segmentation and image matting. In particular, FADE reveals its effectiveness and task-agnostic characteristic by consistently outperforming recent dynamic upsampling operators in different tasks. It also generalizes well across convolutional and transformer architectures with little computational overhead. Our work additionally provides thoughtful insights on what makes for task-agnostic upsampling. Code is available at: http://lnkiy.in/fade_in

CVJul 15, 2022Code
3D Instances as 1D Kernels

Yizheng Wu, Min Shi, Shuaiyuan Du et al.

We introduce a 3D instance representation, termed instance kernels, where instances are represented by one-dimensional vectors that encode the semantic, positional, and shape information of 3D instances. We show that instance kernels enable easy mask inference by simply scanning kernels over the entire scenes, avoiding the heavy reliance on proposals or heuristic clustering algorithms in standard 3D instance segmentation pipelines. The idea of instance kernel is inspired by recent success of dynamic convolutions in 2D/3D instance segmentation. However, we find it non-trivial to represent 3D instances due to the disordered and unstructured nature of point cloud data, e.g., poor instance localization can significantly degrade instance representation. To remedy this, we construct a novel 3D instance encoding paradigm. First, potential instance centroids are localized as candidates. Then, a candidate merging scheme is devised to simultaneously aggregate duplicated candidates and collect context around the merged centroids to form the instance kernels. Once instance kernels are available, instance masks can be reconstructed via dynamic convolutions whose weights are conditioned on instance kernels. The whole pipeline is instantiated with a dynamic kernel network (DKNet). Results show that DKNet outperforms the state of the arts on both ScanNetV2 and S3DIS datasets with better instance localization. Code is available: https://github.com/W1zheng/DKNet.

CVSep 29, 2022Code
SymmNeRF: Learning to Explore Symmetry Prior for Single-View View Synthesis

Xingyi Li, Chaoyi Hong, Yiran Wang et al.

We study the problem of novel view synthesis of objects from a single image. Existing methods have demonstrated the potential in single-view view synthesis. However, they still fail to recover the fine appearance details, especially in self-occluded areas. This is because a single view only provides limited information. We observe that manmade objects usually exhibit symmetric appearances, which introduce additional prior knowledge. Motivated by this, we investigate the potential performance gains of explicitly embedding symmetry into the scene representation. In this paper, we propose SymmNeRF, a neural radiance field (NeRF) based framework that combines local and global conditioning under the introduction of symmetry priors. In particular, SymmNeRF takes the pixel-aligned image features and the corresponding symmetric features as extra inputs to the NeRF, whose parameters are generated by a hypernetwork. As the parameters are conditioned on the image-encoded latent codes, SymmNeRF is thus scene-independent and can generalize to new scenes. Experiments on synthetic and real-world datasets show that SymmNeRF synthesizes novel views with more details regardless of the pose transformation, and demonstrates good generalization when applied to unseen objects. Code is available at: https://github.com/xingyi-li/SymmNeRF.

CVNov 7, 2022
Efficient Single-Image Depth Estimation on Mobile Devices, Mobile AI & AIM 2022 Challenge: Report

Andrey Ignatov, Grigory Malivenko, Radu Timofte et al. · tencent-ai

Various depth estimation models are now widely used on many mobile and IoT devices for image segmentation, bokeh effect rendering, object tracking and many other mobile tasks. Thus, it is very crucial to have efficient and accurate depth estimation models that can run fast on low-power mobile chipsets. In this Mobile AI challenge, the target was to develop deep learning-based single image depth estimation solutions that can show a real-time performance on IoT platforms and smartphones. For this, the participants used a large-scale RGB-to-depth dataset that was collected with the ZED stereo camera capable to generated depth maps for objects located at up to 50 meters. The runtime of all models was evaluated on the Raspberry Pi 4 platform, where the developed solutions were able to generate VGA resolution depth maps at up to 27 FPS while achieving high fidelity results. All models developed in the challenge are also compatible with any Android or Linux-based mobile devices, their detailed description is provided in this paper.

CVJul 20, 2022Code
Robust Object Detection With Inaccurate Bounding Boxes

Chengxin Liu, Kewei Wang, Hao Lu et al.

Learning accurate object detectors often requires large-scale training data with precise object bounding boxes. However, labeling such data is expensive and time-consuming. As the crowd-sourcing labeling process and the ambiguities of the objects may raise noisy bounding box annotations, the object detectors will suffer from the degenerated training data. In this work, we aim to address the challenge of learning robust object detectors with inaccurate bounding boxes. Inspired by the fact that localization precision suffers significantly from inaccurate bounding boxes while classification accuracy is less affected, we propose leveraging classification as a guidance signal for refining localization results. Specifically, by treating an object as a bag of instances, we introduce an Object-Aware Multiple Instance Learning approach (OA-MIL), featured with object-aware instance selection and object-aware instance extension. The former aims to select accurate instances for training, instead of directly using inaccurate box annotations. The latter focuses on generating high-quality instances for selection. Extensive experiments on synthetic noisy datasets (i.e., noisy PASCAL VOC and MS-COCO) and a real noisy wheat head dataset demonstrate the effectiveness of our OA-MIL. Code is available at https://github.com/cxliu0/OA-MIL.

CVJul 31, 2022Code
Design What You Desire: Icon Generation from Orthogonal Application and Theme Labels

Yinpeng Chen, Zhiyu Pan, Min Shi et al.

Generative adversarial networks (GANs) have been trained to be professional artists able to create stunning artworks such as face generation and image style transfer. In this paper, we focus on a realistic business scenario: automated generation of customizable icons given desired mobile applications and theme styles. We first introduce a theme-application icon dataset, termed AppIcon, where each icon has two orthogonal theme and app labels. By investigating a strong baseline StyleGAN2, we observe mode collapse caused by the entanglement of the orthogonal labels. To solve this challenge, we propose IconGAN composed of a conditional generator and dual discriminators with orthogonal augmentations, and a contrastive feature disentanglement strategy is further designed to regularize the feature space of the two discriminators. Compared with other approaches, IconGAN indicates a superior advantage on the AppIcon benchmark. Further analysis also justifies the effectiveness of disentangling app and theme representations. Our project will be released at: https://github.com/architect-road/IconGAN.

CVJul 17, 2023Code
Box-DETR: Understanding and Boxing Conditional Spatial Queries

Wenze Liu, Hao Lu, Yuliang Liu et al.

Conditional spatial queries are recently introduced into DEtection TRansformer (DETR) to accelerate convergence. In DAB-DETR, such queries are modulated by the so-called conditional linear projection at each decoder stage, aiming to search for positions of interest such as the four extremities of the box. Each decoder stage progressively updates the box by predicting the anchor box offsets, while in cross-attention only the box center is informed as the reference point. The use of only box center, however, leaves the width and height of the previous box unknown to the current stage, which hinders accurate prediction of offsets. We argue that the explicit use of the entire box information in cross-attention matters. In this work, we propose Box Agent to condense the box into head-specific agent points. By replacing the box center with the agent point as the reference point in each head, the conditional cross-attention can search for positions from a more reasonable starting point by considering the full scope of the previous box, rather than always from the previous box center. This significantly reduces the burden of the conditional linear projection. Experimental results show that the box agent leads to not only faster convergence but also improved detection performance, e.g., our single-scale model achieves $44.2$ AP with ResNet-50 based on DAB-DETR. Our Box Agent requires minor modifications to the code and has negligible computational workload. Code is available at https://github.com/tiny-smart/box-detr.

CVJul 17, 2023Code
On Point Affiliation in Feature Upsampling

Wenze Liu, Hao Lu, Yuliang Liu et al.

We introduce the notion of point affiliation into feature upsampling. By abstracting a feature map into non-overlapped semantic clusters formed by points of identical semantic meaning, feature upsampling can be viewed as point affiliation -- designating a semantic cluster for each upsampled point. In the framework of kernel-based dynamic upsampling, we show that an upsampled point can resort to its low-res decoder neighbors and high-res encoder point to reason the affiliation, conditioned on the mutual similarity between them. We therefore present a generic formulation for generating similarity-aware upsampling kernels and prove that such kernels encourage not only semantic smoothness but also boundary sharpness. This formulation constitutes a novel, lightweight, and universal upsampling solution, Similarity-Aware Point Affiliation (SAPA). We show its working mechanism via our preliminary designs with window-shape kernel. After probing the limitations of the designs on object detection, we reveal additional insights for upsampling, leading to SAPA with the dynamic kernel shape. Extensive experiments demonstrate that SAPA outperforms prior upsamplers and invites consistent performance improvements on a number of dense prediction tasks, including semantic segmentation, object detection, instance segmentation, panoptic segmentation, image matting, and depth estimation. Code is made available at: https://github.com/tiny-smart/sapa

CVDec 27, 2022Code
Infusing Definiteness into Randomness: Rethinking Composition Styles for Deep Image Matting

Zixuan Ye, Yutong Dai, Chaoyi Hong et al.

We study the composition style in deep image matting, a notion that characterizes a data generation flow on how to exploit limited foregrounds and random backgrounds to form a training dataset. Prior art executes this flow in a completely random manner by simply going through the foreground pool or by optionally combining two foregrounds before foreground-background composition. In this work, we first show that naive foreground combination can be problematic and therefore derive an alternative formulation to reasonably combine foregrounds. Our second contribution is an observation that matting performance can benefit from a certain occurrence frequency of combined foregrounds and their associated source foregrounds during training. Inspired by this, we introduce a novel composition style that binds the source and combined foregrounds in a definite triplet. In addition, we also find that different orders of foreground combination lead to different foreground patterns, which further inspires a quadruplet-based composition style. Results under controlled experiments on four matting baselines show that our composition styles outperform existing ones and invite consistent performance improvement on both composited and real-world datasets. Code is available at: https://github.com/coconuthust/composition_styles

CVMar 16, 2022
Represent, Compare, and Learn: A Similarity-Aware Framework for Class-Agnostic Counting

Min Shi, Hao Lu, Chen Feng et al.

Class-agnostic counting (CAC) aims to count all instances in a query image given few exemplars. A standard pipeline is to extract visual features from exemplars and match them with query images to infer object counts. Two essential components in this pipeline are feature representation and similarity metric. Existing methods either adopt a pretrained network to represent features or learn a new one, while applying a naive similarity metric with fixed inner product. We find this paradigm leads to noisy similarity matching and hence harms counting performance. In this work, we propose a similarity-aware CAC framework that jointly learns representation and similarity metric. We first instantiate our framework with a naive baseline called Bilinear Matching Network (BMNet), whose key component is a learnable bilinear similarity metric. To further embody the core of our framework, we extend BMNet to BMNet+ that models similarity from three aspects: 1) representing the instances via their self-similarity to enhance feature robustness against intra-class variations; 2) comparing the similarity dynamically to focus on the key patterns of each exemplar; 3) learning from a supervision signal to impose explicit constraints on matching results. Extensive experiments on a recent CAC dataset FSC147 show that our models significantly outperform state-of-the-art CAC approaches. In addition, we also validate the cross-dataset generality of BMNet and BMNet+ on a car counting dataset CARPK. Code is at tiny.one/BMNet

CVJul 17, 2023Code
NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation

Yiran Wang, Min Shi, Jiaqi Li et al.

Video depth estimation aims to infer temporally consistent depth. One approach is to finetune a single-image model on each video with geometry constraints, which proves inefficient and lacks robustness. An alternative is learning to enforce consistency from data, which requires well-designed models and sufficient video depth data. To address both challenges, we introduce NVDS+ that stabilizes inconsistent depth estimated by various single-image models in a plug-and-play manner. We also elaborate a large-scale Video Depth in the Wild (VDW) dataset, which contains 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset. Additionally, a bidirectional inference strategy is designed to improve consistency by adaptively fusing forward and backward predictions. We instantiate a model family ranging from small to large scales for different applications. The method is evaluated on VDW dataset and three public benchmarks. To further prove the versatility, we extend NVDS+ to video semantic segmentation and several downstream applications like bokeh rendering, novel view synthesis, and 3D reconstruction. Experimental results show that our method achieves significant improvements in consistency, accuracy, and efficiency. Our work serves as a solid baseline and data foundation for learning-based video depth estimation. Code and dataset are available at: https://github.com/RaymondWang987/NVDS

CVSep 29, 2023Code
When Epipolar Constraint Meets Non-local Operators in Multi-View Stereo

Tianqi Liu, Xinyi Ye, Weiyue Zhao et al.

Learning-based multi-view stereo (MVS) method heavily relies on feature matching, which requires distinctive and descriptive representations. An effective solution is to apply non-local feature aggregation, e.g., Transformer. Albeit useful, these techniques introduce heavy computation overheads for MVS. Each pixel densely attends to the whole image. In contrast, we propose to constrain non-local feature augmentation within a pair of lines: each point only attends the corresponding pair of epipolar lines. Our idea takes inspiration from the classic epipolar geometry, which shows that one point with different depth hypotheses will be projected to the epipolar line on the other view. This constraint reduces the 2D search space into the epipolar line in stereo matching. Similarly, this suggests that the matching of MVS is to distinguish a series of points lying on the same line. Inspired by this point-to-line search, we devise a line-to-point non-local augmentation strategy. We first devise an optimized searching algorithm to split the 2D feature maps into epipolar line pairs. Then, an Epipolar Transformer (ET) performs non-local feature augmentation among epipolar line pairs. We incorporate the ET into a learning-based MVS baseline, named ET-MVSNet. ET-MVSNet achieves state-of-the-art reconstruction performance on both the DTU and Tanks-and-Temples benchmark with high efficiency. Code is available at https://github.com/TQTQliu/ET-MVSNet.

CVJul 18, 2024Code
FADE: A Task-Agnostic Upsampling Operator for Encoder-Decoder Architectures

Hao Lu, Wenze Liu, Hongtao Fu et al.

The goal of this work is to develop a task-agnostic feature upsampling operator for dense prediction where the operator is required to facilitate not only region-sensitive tasks like semantic segmentation but also detail-sensitive tasks such as image matting. Prior upsampling operators often can work well in either type of the tasks, but not both. We argue that task-agnostic upsampling should dynamically trade off between semantic preservation and detail delineation, instead of having a bias between the two properties. In this paper, we present FADE, a novel, plug-and-play, lightweight, and task-agnostic upsampling operator by fusing the assets of decoder and encoder features at three levels: i) considering both the encoder and decoder feature in upsampling kernel generation; ii) controlling the per-point contribution of the encoder/decoder feature in upsampling kernels with an efficient semi-shift convolutional operator; and iii) enabling the selective pass of encoder features with a decoder-dependent gating mechanism for compensating details. To improve the practicality of FADE, we additionally study parameter- and memory-efficient implementations of semi-shift convolution. We analyze the upsampling behavior of FADE on toy data and show through large-scale experiments that FADE is task-agnostic with consistent performance improvement on a number of dense prediction tasks with little extra cost. For the first time, we demonstrate robust feature upsampling on both region- and detail-sensitive tasks successfully. Code is made available at: https://github.com/poppinace/fade

CVJun 25, 2022
BokehMe: When Neural Rendering Meets Classical Rendering

Juewen Peng, Zhiguo Cao, Xianrui Luo et al.

We propose BokehMe, a hybrid bokeh rendering framework that marries a neural renderer with a classical physically motivated renderer. Given a single image and a potentially imperfect disparity map, BokehMe generates high-resolution photo-realistic bokeh effects with adjustable blur size, focal plane, and aperture shape. To this end, we analyze the errors from the classical scattering-based method and derive a formulation to calculate an error map. Based on this formulation, we implement the classical renderer by a scattering-based method and propose a two-stage neural renderer to fix the erroneous areas from the classical renderer. The neural renderer employs a dynamic multi-scale scheme to efficiently handle arbitrary blur sizes, and it is trained to handle imperfect disparity input. Experiments show that our method compares favorably against previous methods on both synthetic image data and real image data with predicted disparity. A user study is further conducted to validate the advantage of our method.

CVApr 7, 2023
A2J-Transformer: Anchor-to-Joint Transformer Network for 3D Interacting Hand Pose Estimation from a Single RGB Image

Changlong Jiang, Yang Xiao, Cunlin Wu et al.

3D interacting hand pose estimation from a single RGB image is a challenging task, due to serious self-occlusion and inter-occlusion towards hands, confusing similar appearance patterns between 2 hands, ill-posed joint position mapping from 2D to 3D, etc.. To address these, we propose to extend A2J-the state-of-the-art depth-based 3D single hand pose estimation method-to RGB domain under interacting hand condition. Our key idea is to equip A2J with strong local-global aware ability to well capture interacting hands' local fine details and global articulated clues among joints jointly. To this end, A2J is evolved under Transformer's non-local encoding-decoding framework to build A2J-Transformer. It holds 3 main advantages over A2J. First, self-attention across local anchor points is built to make them global spatial context aware to better capture joints' articulation clues for resisting occlusion. Secondly, each anchor point is regarded as learnable query with adaptive feature learning for facilitating pattern fitting capacity, instead of having the same local representation with the others. Last but not least, anchor point locates in 3D space instead of 2D as in A2J, to leverage 3D pose prediction. Experiments on challenging InterHand 2.6M demonstrate that, A2J-Transformer can achieve state-of-the-art model-free performance (3.38mm MPJPE advancement in 2-hand case) and can also be applied to depth domain with strong generalization.

CVAug 3, 2024Code
iControl3D: An Interactive System for Controllable 3D Scene Generation

Xingyi Li, Yizheng Wu, Jun Cen et al.

3D content creation has long been a complex and time-consuming process, often requiring specialized skills and resources. While recent advancements have allowed for text-guided 3D object and scene generation, they still fall short of providing sufficient control over the generation process, leading to a gap between the user's creative vision and the generated results. In this paper, we present iControl3D, a novel interactive system that empowers users to generate and render customizable 3D scenes with precise control. To this end, a 3D creator interface has been developed to provide users with fine-grained control over the creation process. Technically, we leverage 3D meshes as an intermediary proxy to iteratively merge individual 2D diffusion-generated images into a cohesive and unified 3D scene representation. To ensure seamless integration of 3D meshes, we propose to perform boundary-aware depth alignment before fusing the newly generated mesh with the existing one in 3D space. Additionally, to effectively manage depth discrepancies between remote content and foreground, we propose to model remote content separately with an environment map instead of 3D meshes. Finally, our neural rendering interface enables users to build a radiance field of their scene online and navigate the entire scene. Extensive experiments have been conducted to demonstrate the effectiveness of our system. The code will be made available at https://github.com/xingyi-li/iControl3D.

CVAug 20, 2024Code
Training Matting Models without Alpha Labels

Wenze Liu, Zixuan Ye, Hao Lu et al.

The labelling difficulty has been a longstanding problem in deep image matting. To escape from fine labels, this work explores using rough annotations such as trimaps coarsely indicating the foreground/background as supervision. We present that the cooperation between learned semantics from indicated known regions and proper assumed matting rules can help infer alpha values at transition areas. Inspired by the nonlocal principle in traditional image matting, we build a directional distance consistency loss (DDC loss) at each pixel neighborhood to constrain the alpha values conditioned on the input image. DDC loss forces the distance of similar pairs on the alpha matte and on its corresponding image to be consistent. In this way, the alpha values can be propagated from learned known regions to unknown transition areas. With only images and trimaps, a matting model can be trained under the supervision of a known loss and the proposed DDC loss. Experiments on AM-2K and P3M-10K dataset show that our paradigm achieves comparable performance with the fine-label-supervised baseline, while sometimes offers even more satisfying results than human-labelled ground truth. Code is available at \url{https://github.com/poppuppy/alpha-free-matting}.

CVJul 18, 2022
MPIB: An MPI-Based Bokeh Rendering Framework for Realistic Partial Occlusion Effects

Juewen Peng, Jianming Zhang, Xianrui Luo et al.

Partial occlusion effects are a phenomenon that blurry objects near a camera are semi-transparent, resulting in partial appearance of occluded background. However, it is challenging for existing bokeh rendering methods to simulate realistic partial occlusion effects due to the missing information of the occluded area in an all-in-focus image. Inspired by the learnable 3D scene representation, Multiplane Image (MPI), we attempt to address the partial occlusion by introducing a novel MPI-based high-resolution bokeh rendering framework, termed MPIB. To this end, we first present an analysis on how to apply the MPI representation to bokeh rendering. Based on this analysis, we propose an MPI representation module combined with a background inpainting module to implement high-resolution scene representation. This representation can then be reused to render various bokeh effects according to the controlling parameters. To train and test our model, we also design a ray-tracing-based bokeh generator for data generation. Extensive experiments on synthesized and real-world images validate the effectiveness and flexibility of this framework.

CVMar 28, 2023
Real-time Multi-person Eyeblink Detection in the Wild for Untrimmed Video

Wenzheng Zeng, Yang Xiao, Sicheng Wei et al.

Real-time eyeblink detection in the wild can widely serve for fatigue detection, face anti-spoofing, emotion analysis, etc. The existing research efforts generally focus on single-person cases towards trimmed video. However, multi-person scenario within untrimmed videos is also important for practical applications, which has not been well concerned yet. To address this, we shed light on this research field for the first time with essential contributions on dataset, theory, and practices. In particular, a large-scale dataset termed MPEblink that involves 686 untrimmed videos with 8748 eyeblink events is proposed under multi-person conditions. The samples are captured from unconstrained films to reveal "in the wild" characteristics. Meanwhile, a real-time multi-person eyeblink detection method is also proposed. Being different from the existing counterparts, our proposition runs in a one-stage spatio-temporal way with end-to-end learning capacity. Specifically, it simultaneously addresses the sub-tasks of face detection, face tracking, and human instance-level eyeblink detection. This paradigm holds 2 main advantages: (1) eyeblink features can be facilitated via the face's global context (e.g., head pose and illumination condition) with joint optimization and interaction, and (2) addressing these sub-tasks in parallel instead of sequential manner can save time remarkably to meet the real-time running requirement. Experiments on MPEblink verify the essential challenges of real-time multi-person eyeblink detection in the wild for untrimmed video. Our method also outperforms existing approaches by large margins and with a high inference speed.

CVOct 27, 2023Code
End-to-end Video Gaze Estimation via Capturing Head-face-eye Spatial-temporal Interaction Context

Yiran Guan, Zhuoguang Chen, Wenzheng Zeng et al.

In this letter, we propose a new method, Multi-Clue Gaze (MCGaze), to facilitate video gaze estimation via capturing spatial-temporal interaction context among head, face, and eye in an end-to-end learning way, which has not been well concerned yet. The main advantage of MCGaze is that the tasks of clue localization of head, face, and eye can be solved jointly for gaze estimation in a one-step way, with joint optimization to seek optimal performance. During this, spatial-temporal context exchange happens among the clues on the head, face, and eye. Accordingly, the final gazes obtained by fusing features from various queries can be aware of global clues from heads and faces, and local clues from eyes simultaneously, which essentially leverages performance. Meanwhile, the one-step running way also ensures high running efficiency. Experiments on the challenging Gaze360 dataset verify the superiority of our proposition. The source code will be released at https://github.com/zgchen33/MCGaze.

CVJul 24, 2023
Fast Full-frame Video Stabilization with Iterative Optimization

Weiyue Zhao, Xin Li, Zhan Peng et al.

Video stabilization refers to the problem of transforming a shaky video into a visually pleasing one. The question of how to strike a good trade-off between visual quality and computational speed has remained one of the open challenges in video stabilization. Inspired by the analogy between wobbly frames and jigsaw puzzles, we propose an iterative optimization-based learning approach using synthetic datasets for video stabilization, which consists of two interacting submodules: motion trajectory smoothing and full-frame outpainting. First, we develop a two-level (coarse-to-fine) stabilizing algorithm based on the probabilistic flow field. The confidence map associated with the estimated optical flow is exploited to guide the search for shared regions through backpropagation. Second, we take a divide-and-conquer approach and propose a novel multiframe fusion strategy to render full-frame stabilized views. An important new insight brought about by our iterative optimization approach is that the target video can be interpreted as the fixed point of nonlinear mapping for video stabilization. We formulate video stabilization as a problem of minimizing the amount of jerkiness in motion trajectories, which guarantees convergence with the help of fixed-point theory. Extensive experimental results are reported to demonstrate the superiority of the proposed approach in terms of computational speed and visual quality. The code will be available on GitHub.

CVMar 28, 2023
Learning Second-Order Attentive Context for Efficient Correspondence Pruning

Xinyi Ye, Weiyue Zhao, Hao Lu et al.

Correspondence pruning aims to search consistent correspondences (inliers) from a set of putative correspondences. It is challenging because of the disorganized spatial distribution of numerous outliers, especially when putative correspondences are largely dominated by outliers. It's more challenging to ensure effectiveness while maintaining efficiency. In this paper, we propose an effective and efficient method for correspondence pruning. Inspired by the success of attentive context in correspondence problems, we first extend the attentive context to the first-order attentive context and then introduce the idea of attention in attention (ANA) to model second-order attentive context for correspondence pruning. Compared with first-order attention that focuses on feature-consistent context, second-order attention dedicates to attention weights itself and provides an additional source to encode consistent context from the attention map. For efficiency, we derive two approximate formulations for the naive implementation of second-order attention to optimize the cubic complexity to linear complexity, such that second-order attention can be used with negligible computational overheads. We further implement our formulations in a second-order context layer and then incorporate the layer in an ANA block. Extensive experiments demonstrate that our method is effective and efficient in pruning outliers, especially in high-outlier-ratio cases. Compared with the state-of-the-art correspondence pruning approach LMCNet, our method runs 14 times faster while maintaining a competitive accuracy.

CVFeb 17, 2023
Find Beauty in the Rare: Contrastive Composition Feature Clustering for Nontrivial Cropping Box Regression

Zhiyu Pan, Yinpeng Chen, Jiale Zhang et al.

Automatic image cropping algorithms aim to recompose images like human-being photographers by generating the cropping boxes with improved composition quality. Cropping box regression approaches learn the beauty of composition from annotated cropping boxes. However, the bias of annotations leads to quasi-trivial recomposing results, which has an obvious tendency to the average location of training samples. The crux of this predicament is that the task is naively treated as a box regression problem, where rare samples might be dominated by normal samples, and the composition patterns of rare samples are not well exploited. Observing that similar composition patterns tend to be shared by the cropping boundaries annotated nearly, we argue to find the beauty of composition from the rare samples by clustering the samples with similar cropping boundary annotations, ie, similar composition patterns. We propose a novel Contrastive Composition Clustering (C2C) to regularize the composition features by contrasting dynamically established similar and dissimilar pairs. In this way, common composition patterns of multiple images can be better summarized, which especially benefits the rare samples and endows our model with better generalizability to render nontrivial results. Extensive experimental results show the superiority of our model compared with prior arts. We also illustrate the philosophy of our design with an interesting analytical visualization.

CVMar 10, 2023
3D Cinemagraphy from a Single Image

Xingyi Li, Zhiguo Cao, Huiqiang Sun et al.

We present 3D Cinemagraphy, a new technique that marries 2D image animation with 3D photography. Given a single still image as input, our goal is to generate a video that contains both visual content animation and camera motion. We empirically find that naively combining existing 2D image animation and 3D photography methods leads to obvious artifacts or inconsistent animation. Our key insight is that representing and animating the scene in 3D space offers a natural solution to this task. To this end, we first convert the input image into feature-based layered depth images using predicted depth values, followed by unprojecting them to a feature point cloud. To animate the scene, we perform motion estimation and lift the 2D motion into the 3D scene flow. Finally, to resolve the problem of hole emergence as points move forward, we propose to bidirectionally displace the point cloud as per the scene flow and synthesize novel views by separately projecting them into target image planes and blending the results. Extensive experiments demonstrate the effectiveness of our method. A user study is also conducted to validate the compelling rendering results of our method.

CVJun 7, 2023
Learning Probabilistic Coordinate Fields for Robust Correspondences

Weiyue Zhao, Hao Lu, Xinyi Ye et al.

We introduce Probabilistic Coordinate Fields (PCFs), a novel geometric-invariant coordinate representation for image correspondence problems. In contrast to standard Cartesian coordinates, PCFs encode coordinates in correspondence-specific barycentric coordinate systems (BCS) with affine invariance. To know \textit{when and where to trust} the encoded coordinates, we implement PCFs in a probabilistic network termed PCF-Net, which parameterizes the distribution of coordinate fields as Gaussian mixture models. By jointly optimizing coordinate fields and their confidence conditioned on dense flows, PCF-Net can work with various feature descriptors when quantifying the reliability of PCFs by confidence maps. An interesting observation of this work is that the learned confidence map converges to geometrically coherent and semantically consistent regions, which facilitates robust coordinate representation. By delivering the confident coordinates to keypoint/feature descriptors, we show that PCF-Net can be used as a plug-in to existing correspondence-dependent approaches. Extensive experiments on both indoor and outdoor datasets suggest that accurate geometric invariant coordinates help to achieve the state of the art in several correspondence problems, such as sparse feature matching, dense image registration, camera pose estimation, and consistency filtering. Further, the interpretable confidence map predicted by PCF-Net can also be leveraged to other novel applications from texture transfer to multi-homography classification.

CVJun 5, 2023
A2B: Anchor to Barycentric Coordinate for Robust Correspondence

Weiyue Zhao, Hao Lu, Zhiguo Cao et al.

There is a long-standing problem of repeated patterns in correspondence problems, where mismatches frequently occur because of inherent ambiguity. The unique position information associated with repeated patterns makes coordinate representations a useful supplement to appearance representations for improving feature correspondences. However, the issue of appropriate coordinate representation has remained unresolved. In this study, we demonstrate that geometric-invariant coordinate representations, such as barycentric coordinates, can significantly reduce mismatches between features. The first step is to establish a theoretical foundation for geometrically invariant coordinates. We present a seed matching and filtering network (SMFNet) that combines feature matching and consistency filtering with a coarse-to-fine matching strategy in order to acquire reliable sparse correspondences. We then introduce DEGREE, a novel anchor-to-barycentric (A2B) coordinate encoding approach, which generates multiple affine-invariant correspondence coordinates from paired images. DEGREE can be used as a plug-in with standard descriptors, feature matchers, and consistency filters to improve the matching quality. Extensive experiments in synthesized indoor and outdoor datasets demonstrate that DEGREE alleviates the problem of repeated patterns and helps achieve state-of-the-art performance. Furthermore, DEGREE also reports competitive performance in the third Image Matching Challenge at CVPR 2021. This approach offers a new perspective to alleviate the problem of repeated patterns and emphasizes the importance of choosing coordinate representations for feature correspondences.

CVJul 18, 2024Code
Exposure Completing for Temporally Consistent Neural High Dynamic Range Video Rendering

Jiahao Cui, Wei Jiang, Zhan Peng et al.

High dynamic range (HDR) video rendering from low dynamic range (LDR) videos where frames are of alternate exposure encounters significant challenges, due to the exposure change and absence at each time stamp. The exposure change and absence make existing methods generate flickering HDR results. In this paper, we propose a novel paradigm to render HDR frames via completing the absent exposure information, hence the exposure information is complete and consistent. Our approach involves interpolating neighbor LDR frames in the time dimension to reconstruct LDR frames for the absent exposures. Combining the interpolated and given LDR frames, the complete set of exposure information is available at each time stamp. This benefits the fusing process for HDR results, reducing noise and ghosting artifacts therefore improving temporal consistency. Extensive experimental evaluations on standard benchmarks demonstrate that our method achieves state-of-the-art performance, highlighting the importance of absent exposure completing in HDR video rendering. The code is available at https://github.com/cuijiahao666/NECHDR.

CVAug 20, 2023
Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

Liao Shen, Xingyi Li, Huiqiang Sun et al.

We study the problem of synthesizing a long-term dynamic video from only a single image. This is challenging since it requires consistent visual content movements given large camera motions. Existing methods either hallucinate inconsistent perpetual views or struggle with long camera trajectories. To address these issues, it is essential to estimate the underlying 4D (including 3D geometry and scene motion) and fill in the occluded regions. To this end, we present Make-It-4D, a novel method that can generate a consistent long-term dynamic video from a single image. On the one hand, we utilize layered depth images (LDIs) to represent a scene, and they are then unprojected to form a feature point cloud. To animate the visual content, the feature point cloud is displaced based on the scene flow derived from motion estimation and the corresponding camera pose. Such 4D representation enables our method to maintain the global consistency of the generated dynamic video. On the other hand, we fill in the occluded regions by using a pretrained diffusion model to inpaint and outpaint the input image. This enables our method to work under large camera motions. Benefiting from our design, our method can be training-free which saves a significant amount of training time. Experimental results demonstrate the effectiveness of our approach, which showcases compelling rendering results.

CVApr 11, 2023
Point-and-Shoot All-in-Focus Photo Synthesis from Smartphone Camera Pair

Xianrui Luo, Juewen Peng, Weiyue Zhao et al.

All-in-Focus (AIF) photography is expected to be a commercial selling point for modern smartphones. Standard AIF synthesis requires manual, time-consuming operations such as focal stack compositing, which is unfriendly to ordinary people. To achieve point-and-shoot AIF photography with a smartphone, we expect that an AIF photo can be generated from one shot of the scene, instead of from multiple photos captured by the same camera. Benefiting from the multi-camera module in modern smartphones, we introduce a new task of AIF synthesis from main (wide) and ultra-wide cameras. The goal is to recover sharp details from defocused regions in the main-camera photo with the help of the ultra-wide-camera one. The camera setting poses new challenges such as parallax-induced occlusions and inconsistent color between cameras. To overcome the challenges, we introduce a predict-and-refine network to mitigate occlusions and propose dynamic frequency-domain alignment for color correction. To enable effective training and evaluation, we also build an AIF dataset with 2686 unique scenes. Each scene includes two photos captured by the main camera, one photo captured by the ultrawide camera, and a synthesized AIF photo. Results show that our solution, termed EasyAIF, can produce high-quality AIF photos and outperforms strong baselines quantitatively and qualitatively. For the first time, we demonstrate point-and-shoot AIF photo synthesis successfully from main and ultra-wide cameras.

CVJun 7, 2023
Defocus to focus: Photo-realistic bokeh rendering by fusing defocus and radiance priors

Xianrui Luo, Juewen Peng, Ke Xian et al.

We consider the problem of realistic bokeh rendering from a single all-in-focus image. Bokeh rendering mimics aesthetic shallow depth-of-field (DoF) in professional photography, but these visual effects generated by existing methods suffer from simple flat background blur and blurred in-focus regions, giving rise to unrealistic rendered results. In this work, we argue that realistic bokeh rendering should (i) model depth relations and distinguish in-focus regions, (ii) sustain sharp in-focus regions, and (iii) render physically accurate Circle of Confusion (CoC). To this end, we present a Defocus to Focus (D2F) framework to learn realistic bokeh rendering by fusing defocus priors with the all-in-focus image and by implementing radiance priors in layered fusion. Since no depth map is provided, we introduce defocus hallucination to integrate depth by learning to focus. The predicted defocus map implies the blur amount of bokeh and is used to guide weighted layered rendering. In layered rendering, we fuse images blurred by different kernels based on the defocus map. To increase the reality of the bokeh, we adopt radiance virtualization to simulate scene radiance. The scene radiance used in weighted layered rendering reassigns weights in the soft disk kernel to produce the CoC. To ensure the sharpness of in-focus regions, we propose to fuse upsampled bokeh images and original images. We predict the initial fusion mask from our defocus map and refine the mask with a deep network. We evaluate our model on a large-scale bokeh dataset. Extensive experiments show that our approach is capable of rendering visually pleasing bokeh effects in complex scenes. In particular, our solution receives the runner-up award in the AIM 2020 Rendering Realistic Bokeh Challenge.

CVAug 4, 2023
Diffusion-Augmented Depth Prediction with Sparse Annotations

Jiaqi Li, Yiran Wang, Zihao Huang et al.

Depth estimation aims to predict dense depth maps. In autonomous driving scenes, sparsity of annotations makes the task challenging. Supervised models produce concave objects due to insufficient structural information. They overfit to valid pixels and fail to restore spatial structures. Self-supervised methods are proposed for the problem. Their robustness is limited by pose estimation, leading to erroneous results in natural scenes. In this paper, we propose a supervised framework termed Diffusion-Augmented Depth Prediction (DADP). We leverage the structural characteristics of diffusion model to enforce depth structures of depth models in a plug-and-play manner. An object-guided integrality loss is also proposed to further enhance regional structure integrality by fetching objective information. We evaluate DADP on three driving benchmarks and achieve significant improvements in depth structures and robustness. Our work provides a new perspective on depth estimation with sparse annotations in autonomous driving scenes.

CVMar 3Code
Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

Zhiyu Pan, Yizheng Wu, Jiashen Hua et al.

Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-55K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. Project at: https://github.com/zhiyupan42/VC-STaR.

CVSep 26, 2024
Self-Distilled Depth Refinement with Noisy Poisson Fusion

Jiaqi Li, Yiran Wang, Jinghong Zheng et al.

Depth refinement aims to infer high-resolution depth with fine-grained edges and details, refining low-resolution results of depth estimation models. The prevailing methods adopt tile-based manners by merging numerous patches, which lacks efficiency and produces inconsistency. Besides, prior arts suffer from fuzzy depth boundaries and limited generalizability. Analyzing the fundamental reasons for these limitations, we model depth refinement as a noisy Poisson fusion problem with local inconsistency and edge deformation noises. We propose the Self-distilled Depth Refinement (SDDR) framework to enforce robustness against the noises, which mainly consists of depth edge representation and edge-based guidance. With noisy depth predictions as input, SDDR generates low-noise depth edge representations as pseudo-labels by coarse-to-fine self-distillation. Edge-based guidance with edge-guided gradient loss and edge-based fusion loss serves as the optimization objective equivalent to Poisson fusion. When depth maps are better refined, the labels also become more noise-free. Our model can acquire strong robustness to the noises, achieving significant improvements in accuracy, edge quality, efficiency, and generalizability on five different benchmarks. Moreover, directly training another model with edge labels produced by SDDR brings improvements, suggesting that our method could help with training robust refinement models in future works.

CVJul 2, 2024
WildAvatar: Learning In-the-wild 3D Avatars from the Web

Zihao Huang, Shoukang Hu, Guangcong Wang et al.

Existing research on avatar creation is typically limited to laboratory datasets, which require high costs against scalability and exhibit insufficient representation of the real world. On the other hand, the web abounds with off-the-shelf real-world human videos, but these videos vary in quality and require accurate annotations for avatar creation. To this end, we propose an automatic annotating pipeline with filtering protocols to curate these humans from the web. Our pipeline surpasses state-of-the-art methods on the EMDB benchmark, and the filtering protocols boost verification metrics on web videos. We then curate WildAvatar, a web-scale in-the-wild human avatar creation dataset extracted from YouTube, with $10000+$ different human subjects and scenes. WildAvatar is at least $10\times$ richer than previous datasets for 3D human avatar creation and closer to the real world. To explore its potential, we demonstrate the quality and generalizability of avatar creation methods on WildAvatar. We will publicly release our code, data source links and annotations to push forward 3D human avatar creation and other related fields for real-world applications.

CVJul 18, 2023
Constraining Depth Map Geometry for Multi-View Stereo: A Dual-Depth Approach with Saddle-shaped Depth Cells

Xinyi Ye, Weiyue Zhao, Tianqi Liu et al.

Learning-based multi-view stereo (MVS) methods deal with predicting accurate depth maps to achieve an accurate and complete 3D representation. Despite the excellent performance, existing methods ignore the fact that a suitable depth geometry is also critical in MVS. In this paper, we demonstrate that different depth geometries have significant performance gaps, even using the same depth prediction error. Therefore, we introduce an ideal depth geometry composed of Saddle-Shaped Cells, whose predicted depth map oscillates upward and downward around the ground-truth surface, rather than maintaining a continuous and smooth depth plane. To achieve it, we develop a coarse-to-fine framework called Dual-MVSNet (DMVSNet), which can produce an oscillating depth plane. Technically, we predict two depth values for each pixel (Dual-Depth), and propose a novel loss function and a checkerboard-shaped selecting strategy to constrain the predicted depth geometry. Compared to existing methods,DMVSNet achieves a high rank on the DTU benchmark and obtains the top performance on challenging scenes of Tanks and Temples, demonstrating its strong performance and generalization ability. Our method also points to a new research direction for considering depth geometry in MVS.

CVSep 15, 2024
DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion

Liao Shen, Tianqi Liu, Huiqiang Sun et al.

We study the problem of generating intermediate images from image pairs with large motion while maintaining semantic consistency. Due to the large motion, the intermediate semantic information may be absent in input images. Existing methods either limit to small motion or focus on topologically similar objects, leading to artifacts and inconsistency in the interpolation results. To overcome this challenge, we delve into pre-trained image diffusion models for their capabilities in semantic cognition and representations, ensuring consistent expression of the absent intermediate semantic representations with the input. To this end, we propose DreamMover, a novel image interpolation framework with three main components: 1) A natural flow estimator based on the diffusion model that can implicitly reason about the semantic correspondence between two images. 2) To avoid the loss of detailed information during fusion, our key insight is to fuse information in two parts, high-level space and low-level space. 3) To enhance the consistency between the generated images and input, we propose the self-attention concatenation and replacement approach. Lastly, we present a challenging benchmark dataset InterpBench to evaluate the semantic consistency of generated results. Extensive experiments demonstrate the effectiveness of our method. Our project is available at https://dreamm0ver.github.io .

CVJul 8, 2024
Dynamic Neural Radiance Field From Defocused Monocular Video

Xianrui Luo, Huiqiang Sun, Juewen Peng et al.

Dynamic Neural Radiance Field (NeRF) from monocular videos has recently been explored for space-time novel view synthesis and achieved excellent results. However, defocus blur caused by depth variation often occurs in video capture, compromising the quality of dynamic reconstruction because the lack of sharp details interferes with modeling temporal consistency between input views. To tackle this issue, we propose D2RF, the first dynamic NeRF method designed to restore sharp novel views from defocused monocular videos. We introduce layered Depth-of-Field (DoF) volume rendering to model the defocus blur and reconstruct a sharp NeRF supervised by defocused views. The blur model is inspired by the connection between DoF rendering and volume rendering. The opacity in volume rendering aligns with the layer visibility in DoF rendering. To execute the blurring, we modify the layered blur kernel to the ray-based kernel and employ an optimized sparse kernel to gather the input rays efficiently and render the optimized rays with our layered DoF volume rendering. We synthesize a dataset with defocused dynamic scenes for our task, and extensive experiments on our dataset show that our method outperforms existing approaches in synthesizing all-in-focus novel views from defocus blur while maintaining spatial-temporal consistency in the scene.

CVMar 4
ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

Zihao Huang, Tianqi Liu, Zhaoxi Chen et al.

Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent zero-shot approaches leverage video diffusion models to synthesize human-object interactions, they are largely confined to rigid-object manipulation and lack explicit 4D geometric reasoning. To bridge this gap, we formulate articulated HOI synthesis as a 4D reconstruction problem from monocular video priors: given only a video generated by a diffusion model, we reconstruct a full 4D articulated scene without any 3D supervision. This reconstruction-based approach treats the generated 2D video as supervision for an inverse rendering problem, recovering geometrically consistent and physically plausible 4D scenes that naturally respect contact, articulation, and temporal coherence. We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors. Our key designs are: 1) Flow-based part segmentation: leveraging optical flow as a geometric cue to disentangle dynamic from static regions in monocular video; 2) Decoupled reconstruction pipeline: joint optimization of human motion and object articulation is unstable under monocular ambiguity, so we first recover object articulation, then synthesize human motion conditioned on the reconstructed object states. ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded. Across diverse articulated scenes (e.g., opening fridges, cabinets, microwaves), ArtHOI significantly outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity, extending zero-shot interaction synthesis beyond rigid manipulation through reconstruction-informed synthesis.

CVAug 12, 2024
Towards Robust Monocular Depth Estimation in Non-Lambertian Surfaces

Junrui Zhang, Jiaqi Li, Yachuan Huang et al.

In the field of monocular depth estimation (MDE), many models with excellent zero-shot performance in general scenes emerge recently. However, these methods often fail in predicting non-Lambertian surfaces, such as transparent or mirror (ToM) surfaces, due to the unique reflective properties of these regions. Previous methods utilize externally provided ToM masks and aim to obtain correct depth maps through direct in-painting of RGB images. These methods highly depend on the accuracy of additional input masks, and the use of random colors during in-painting makes them insufficiently robust. We are committed to incrementally enabling the baseline model to directly learn the uniqueness of non-Lambertian surface regions for depth estimation through a well-designed training framework. Therefore, we propose non-Lambertian surface regional guidance, which constrains the predictions of MDE model from the gradient domain to enhance its robustness. Noting the significant impact of lighting on this task, we employ the random tone-mapping augmentation during training to ensure the network can predict correct results for varying lighting inputs. Additionally, we propose an optional novel lighting fusion module, which uses Variational Autoencoders to fuse multiple images and obtain the most advantageous input RGB image for depth estimation when multi-exposure images are available. Our method achieves accuracy improvements of 33.39% and 5.21% in zero-shot testing on the Booster and Mirror3D dataset for non-Lambertian surfaces, respectively, compared to the Depth Anything V2. The state-of-the-art performance of 90.75 in delta1.05 within the ToM regions on the TRICKY2024 competition test set demonstrates the effectiveness of our approach.

CVDec 4, 2025
Light-X: Generative 4D Video Rendering with Camera and Illumination Control

Tianqi Liu, Zhaoxi Chen, Zihao Huang et al.

Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.

CVDec 13, 2023Code
Semi-Supervised Class-Agnostic Motion Prediction with Pseudo Label Regeneration and BEVMix

Kewei Wang, Yizheng Wu, Zhiyu Pan et al.

Class-agnostic motion prediction methods aim to comprehend motion within open-world scenarios, holding significance for autonomous driving systems. However, training a high-performance model in a fully-supervised manner always requires substantial amounts of manually annotated data, which can be both expensive and time-consuming to obtain. To address this challenge, our study explores the potential of semi-supervised learning (SSL) for class-agnostic motion prediction. Our SSL framework adopts a consistency-based self-training paradigm, enabling the model to learn from unlabeled data by generating pseudo labels through test-time inference. To improve the quality of pseudo labels, we propose a novel motion selection and re-generation module. This module effectively selects reliable pseudo labels and re-generates unreliable ones. Furthermore, we propose two data augmentation strategies: temporal sampling and BEVMix. These strategies facilitate consistency regularization in SSL. Experiments conducted on nuScenes demonstrate that our SSL method can surpass the self-supervised approach by a large margin by utilizing only a tiny fraction of labeled data. Furthermore, our method exhibits comparable performance to weakly and some fully supervised methods. These results highlight the ability of our method to strike a favorable balance between annotation costs and performance. Code will be available at https://github.com/kwwcv/SSMP.

CVApr 26, 2024Code
Geometry-aware Reconstruction and Fusion-refined Rendering for Generalizable Neural Radiance Fields

Tianqi Liu, Xinyi Ye, Min Shi et al.

Generalizable NeRF aims to synthesize novel views for unseen scenes. Common practices involve constructing variance-based cost volumes for geometry reconstruction and encoding 3D descriptors for decoding novel views. However, existing methods show limited generalization ability in challenging conditions due to inaccurate geometry, sub-optimal descriptors, and decoding strategies. We address these issues point by point. First, we find the variance-based cost volume exhibits failure patterns as the features of pixels corresponding to the same point can be inconsistent across different views due to occlusions or reflections. We introduce an Adaptive Cost Aggregation (ACA) approach to amplify the contribution of consistent pixel pairs and suppress inconsistent ones. Unlike previous methods that solely fuse 2D features into descriptors, our approach introduces a Spatial-View Aggregator (SVA) to incorporate 3D context into descriptors through spatial and inter-view interaction. When decoding the descriptors, we observe the two existing decoding strategies excel in different areas, which are complementary. A Consistency-Aware Fusion (CAF) strategy is proposed to leverage the advantages of both. We incorporate the above ACA, SVA, and CAF into a coarse-to-fine framework, termed Geometry-aware Reconstruction and Fusion-refined Rendering (GeFu). GeFu attains state-of-the-art performance across multiple datasets. Code is available at https://github.com/TQTQliu/GeFu .

CVMar 23, 2024Code
In-Context Matting

He Guo, Zixuan Ye, Zhiguo Cao et al.

We introduce in-context matting, a novel task setting of image matting. Given a reference image of a certain foreground and guided priors such as points, scribbles, and masks, in-context matting enables automatic alpha estimation on a batch of target images of the same foreground category, without additional auxiliary input. This setting marries good performance in auxiliary input-based matting and ease of use in automatic matting, which finds a good trade-off between customization and automation. To overcome the key challenge of accurate foreground matching, we introduce IconMatting, an in-context matting model built upon a pre-trained text-to-image diffusion model. Conditioned on inter- and intra-similarity matching, IconMatting can make full use of reference context to generate accurate target alpha mattes. To benchmark the task, we also introduce a novel testing dataset ICM-$57$, covering 57 groups of real-world images. Quantitative and qualitative results on the ICM-57 testing set show that IconMatting rivals the accuracy of trimap-based matting while retaining the automation level akin to automatic matting. Code is available at https://github.com/tiny-smart/in-context-matting

CVMar 16, 2025Code
Exploring Contextual Attribute Density in Referring Expression Counting

Zhicheng Wang, Zhiyu Pan, Zhan Peng et al.

Referring expression counting (REC) algorithms are for more flexible and interactive counting ability across varied fine-grained text expressions. However, the requirement for fine-grained attribute understanding poses challenges for prior arts, as they struggle to accurately align attribute information with correct visual patterns. Given the proven importance of ''visual density'', it is presumed that the limitations of current REC approaches stem from an under-exploration of ''contextual attribute density'' (CAD). In the scope of REC, we define CAD as the measure of the information intensity of one certain fine-grained attribute in visual regions. To model the CAD, we propose a U-shape CAD estimator in which referring expression and multi-scale visual features from GroundingDINO can interact with each other. With additional density supervision, we can effectively encode CAD, which is subsequently decoded via a novel attention procedure with CAD-refined queries. Integrating all these contributions, our framework significantly outperforms state-of-the-art REC methods, achieves $30\%$ error reduction in counting metrics and a $10\%$ improvement in localization accuracy. The surprising results shed light on the significance of contextual attribute density for REC. Code will be at github.com/Xu3XiWang/CAD-GD.