CVApr 2, 2022
IR-GAN: Image Manipulation with Linguistic Instruction by Increment ReasoningZhenhuan Liu, Jincan Deng, Liang Li et al. · nvidia, pku
Conditional image generation is an active research topic including text2image and image translation. Recently image manipulation with linguistic instruction brings new challenges of multimodal conditional generation. However, traditional conditional image generation models mainly focus on generating high-quality and visually realistic images, and lack resolving the partial consistency between image and instruction. To address this issue, we propose an Increment Reasoning Generative Adversarial Network (IR-GAN), which aims to reason the consistency between visual increment in images and semantic increment in instructions. First, we introduce the word-level and instruction-level instruction encoders to learn user's intention from history-correlated instructions as semantic increment. Second, we embed the representation of semantic increment into that of source image for generating target image, where source image plays the role of referring auxiliary. Finally, we propose a reasoning discriminator to measure the consistency between visual increment and semantic increment, which purifies user's intention and guarantees the good logic of generated target image. Extensive experiments and visualization conducted on two datasets show the effectiveness of IR-GAN.
CVApr 2, 2022
Unsupervised Coherent Video Cartoonization with Perceptual Motion ConsistencyZhenhuan Liu, Liang Li, Huajie Jiang et al. · nvidia
In recent years, creative content generations like style transfer and neural photo editing have attracted more and more attention. Among these, cartoonization of real-world scenes has promising applications in entertainment and industry. Different from image translations focusing on improving the style effect of generated images, video cartoonization has additional requirements on the temporal consistency. In this paper, we propose a spatially-adaptive semantic alignment framework with perceptual motion consistency for coherent video cartoonization in an unsupervised manner. The semantic alignment module is designed to restore deformation of semantic structure caused by spatial information lost in the encoder-decoder architecture. Furthermore, we devise the spatio-temporal correlative map as a style-independent, global-aware regularization on the perceptual motion consistency. Deriving from similarity measurement of high-level features in photo and cartoon frames, it captures global semantic information beyond raw pixel-value in optical flow. Besides, the similarity measurement disentangles temporal relationships from domain-specific style properties, which helps regularize the temporal consistency without hurting style effects of cartoon images. Qualitative and quantitative experiments demonstrate our method is able to generate highly stylistic and temporal consistent cartoon videos.
CVSep 8, 2024
CD-NGP: A Fast Scalable Continual Representation for Dynamic ScenesZhenhuan Liu, Shuai Liu, Zhiwei Ning et al.
Novel view synthesis (NVS) in dynamic scenes faces persistent challenges in memory consumption, model complexity, training efficiency, and rendering quality. Offline methods offer high fidelity but suffer from high memory usage and limited scalability, while online approaches often trade quality for speed and compactness. We propose Continual Dynamic Neural Graphics Primitives (CD-NGP), a continual learning framework that reduces memory overhead and enhances scalability through parameter reuse. To avoid feature interference in dynamic scenes and improve rendering quality, our method combines spatial and temporal hash encodings, which compactly represent scene structures and motion patterns. We also introduce a new dataset comprising multi-view, long-duration ($>1200$ frames) videos with both rigid and non-rigid motion, which is not found in existing benchmarks. CD-NGP is evaluated on public datasets and our long video dataset, demonstrating superior scalability and reconstruction quality. It significantly reduces training memory usage to <14GB and requires only 0.4MB/frame in streaming bandwidth on DyNeRF -- substantially lower than most online baselines.
CVDec 18, 2023
T-Code: Simple Temporal Latent Code for Efficient Dynamic View SynthesisZhenhuan Liu, Shuai Liu, Jie Yang et al.
Novel view synthesis for dynamic scenes is one of the spotlights in computer vision. The key to efficient dynamic view synthesis is to find a compact representation to store the information across time. Though existing methods achieve fast dynamic view synthesis by tensor decomposition or hash grid feature concatenation, their mixed representations ignore the structural difference between time domain and spatial domain, resulting in sub-optimal computation and storage cost. This paper presents T-Code, the efficient decoupled latent code for the time dimension only. The decomposed feature design enables customizing modules to cater for different scenarios with individual specialty and yielding desired results at lower cost. Based on T-Code, we propose our highly compact hybrid neural graphics primitives (HybridNGP) for multi-camera setting and deformation neural graphics primitives with T-Code (DNGP-T) for monocular scenario. Experiments show that HybridNGP delivers high fidelity results at top processing speed with much less storage consumption, while DNGP-T achieves state-of-the-art quality and high training speed for monocular reconstruction.
CVFeb 23, 2025
Efficient 4D Gaussian Stream with Low Rank AdaptationZhenhuan Liu, Shuai Liu, Yidong Lu et al.
Recent methods have made significant progress in synthesizing novel views with long video sequences. This paper proposes a highly scalable method for dynamic novel view synthesis with continual learning. We leverage the 3D Gaussians to represent the scene and a low-rank adaptation-based deformation model to capture the dynamic scene changes. Our method continuously reconstructs the dynamics with chunks of video frames, reduces the streaming bandwidth by $90\%$ while maintaining high rendering quality comparable to the off-line SOTA methods.
CVJun 24, 2024
FASTC: A Fast Attentional Framework for Semantic Traversability Classification Using Point CloudYirui Chen, Pengjin Wei, Zhenhuan Liu et al.
Producing traversability maps and understanding the surroundings are crucial prerequisites for autonomous navigation. In this paper, we address the problem of traversability assessment using point clouds. We propose a novel pillar feature extraction module that utilizes PointNet to capture features from point clouds organized in vertical volume and a 2D encoder-decoder structure to conduct traversability classification instead of the widely used 3D convolutions. This results in less computational cost while even better performance is achieved at the same time. We then propose a new spatio-temporal attention module to fuse multi-frame information, which can properly handle the varying density problem of LIDAR point clouds, and this makes our module able to assess distant areas more accurately. Comprehensive experimental results on augmented Semantic KITTI and RELLIS-3D datasets show that our method is able to achieve superior performance over existing approaches both quantitatively and quantitatively.