CVJul 31, 2024Code
Tora: Trajectory-oriented Diffusion Transformer for Video GenerationZhenghao Zhang, Junchao Liao, Menghao Li et al.
Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that concurrently integrates textual, visual, and trajectory conditions, thereby enabling scalable video generation with effective motion guidance. Specifically, Tora consists of a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D motion compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos that accurately follow designated trajectories. Our design aligns seamlessly with DiT's scalability, allowing precise control of video content's dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate that Tora excels in achieving high motion fidelity compared to the foundational DiT model, while also accurately simulating the complex movements of the physical world. Code is made available at https://github.com/alibaba/Tora .
CVMar 3, 2022
NeW CRFs: Neural Window Fully-connected CRFs for Monocular Depth EstimationWeihao Yuan, Xiaodong Gu, Zuozhuo Dai et al.
Estimating the accurate depth from a single image is challenging since it is inherently ambiguous and ill-posed. While recent works design increasingly complicated and powerful networks to directly regress the depth map, we take the path of CRFs optimization. Due to the expensive computation, CRFs are usually performed between neighborhoods rather than the whole graph. To leverage the potential of fully-connected CRFs, we split the input into windows and perform the FC-CRFs optimization within each window, which reduces the computation complexity and makes FC-CRFs feasible. To better capture the relationships between nodes in the graph, we exploit the multi-head attention mechanism to compute a multi-head potential function, which is fed to the networks to output an optimized depth map. Then we build a bottom-up-top-down structure, where this neural window FC-CRFs module serves as the decoder, and a vision transformer serves as the encoder. The experiments demonstrate that our method significantly improves the performance across all metrics on both the KITTI and NYUv2 datasets, compared to previous methods. Furthermore, the proposed method can be directly applied to panorama images and outperforms all previous panorama methods on the MatterPort3D dataset. Project page: https://weihaosky.github.io/newcrfs.
CVDec 3, 2024Code
HunyuanVideo: A Systematic Framework For Large Video Generative ModelsWeijie Kong, Qi Tian, Zijian Zhang et al. · tencent-ai, tsinghua
Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at https://github.com/Tencent/HunyuanVideo.
CVNov 21, 2023
AnimateAnything: Fine-Grained Open Domain Image Animation with Motion GuidanceZuozhuo Dai, Zhenghao Zhang, Yao Yao et al.
Image animation is a key task in computer vision which aims to generate dynamic visual content from static image. Recent image animation methods employ neural based rendering technique to generate realistic animations. Despite these advancements, achieving fine-grained and controllable image animation guided by text remains challenging, particularly for open-domain images captured in diverse real environments. In this paper, we introduce an open domain image animation method that leverages the motion prior of video diffusion model. Our approach introduces targeted motion area guidance and motion strength guidance, enabling precise control the movable area and its motion speed. This results in enhanced alignment between the animated visual elements and the prompting text, thereby facilitating a fine-grained and interactive animation generation process for intricate motion sequences. We validate the effectiveness of our method through rigorous experiments on an open-domain dataset, with the results showcasing its superior performance. Project page can be found at https://animationai.github.io/AnimateAnything.
CVMay 23, 2022
RCP: Recurrent Closest Point for Scene Flow Estimation on 3D Point CloudsXiaodong Gu, Chengzhou Tang, Weihao Yuan et al.
3D motion estimation including scene flow and point cloud registration has drawn increasing interest. Inspired by 2D flow estimation, recent methods employ deep neural networks to construct the cost volume for estimating accurate 3D flow. However, these methods are limited by the fact that it is difficult to define a search window on point clouds because of the irregular data structure. In this paper, we avoid this irregularity by a simple yet effective method.We decompose the problem into two interlaced stages, where the 3D flows are optimized point-wisely at the first stage and then globally regularized in a recurrent network at the second stage. Therefore, the recurrent network only receives the regular point-wise information as the input. In the experiments, we evaluate the proposed method on both the 3D scene flow estimation and the point cloud registration task. For 3D scene flow estimation, we make comparisons on the widely used FlyingThings3D and KITTIdatasets. For point cloud registration, we follow previous works and evaluate the data pairs with large pose and partially overlapping from ModelNet40. The results show that our method outperforms the previous method and achieves a new state-of-the-art performance on both 3D scene flow estimation and point cloud registration, which demonstrates the superiority of the proposed zero-order method on irregular point cloud data.
CVJan 20, 2023
Towards Robust Video Instance Segmentation with Temporal-Aware TransformerZhenghao Zhang, Fangtao Shao, Zuozhuo Dai et al.
Most existing transformer based video instance segmentation methods extract per frame features independently, hence it is challenging to solve the appearance deformation problem. In this paper, we observe the temporal information is important as well and we propose TAFormer to aggregate spatio-temporal features both in transformer encoder and decoder. Specifically, in transformer encoder, we propose a novel spatio-temporal joint multi-scale deformable attention module which dynamically integrates the spatial and temporal information to obtain enriched spatio-temporal features. In transformer decoder, we introduce a temporal self-attention module to enhance the frame level box queries with the temporal relation. Moreover, TAFormer adopts an instance level contrastive loss to increase the discriminability of instance query embeddings. Therefore the tracking error caused by visually similar instances can be decreased. Experimental results show that TAFormer effectively leverages the spatial and temporal information to obtain context-aware feature representation and outperforms state-of-the-art methods.
CVJul 14, 2023
Fine-grained Text-Video Retrieval with Frozen Image EncodersZuozhuo Dai, Fangtao Shao, Qingkun Su et al.
State-of-the-art text-video retrieval (TVR) methods typically utilize CLIP and cosine similarity for efficient retrieval. Meanwhile, cross attention methods, which employ a transformer decoder to compute attention between each text query and all frames in a video, offer a more comprehensive interaction between text and videos. However, these methods lack important fine-grained spatial information as they directly compute attention between text and video-level tokens. To address this issue, we propose CrossTVR, a two-stage text-video retrieval architecture. In the first stage, we leverage existing TVR methods with cosine similarity network for efficient text/video candidate selection. In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions. Additionally, we employ the frozen CLIP model strategy in fine-grained retrieval, enabling scalability to larger pre-trained vision models like ViT-G, resulting in improved retrieval performance. Experiments on text video retrieval datasets demonstrate the effectiveness and scalability of our proposed CrossTVR compared to state-of-the-art approaches.
CVMar 18, 2024Code
EffiVED:Efficient Video Editing via Text-instruction Diffusion ModelsZhenghao Zhang, Zuozhuo Dai, Long Qin et al.
Large-scale text-to-video models have shown remarkable abilities, but their direct application in video editing remains challenging due to limited available datasets. Current video editing methods commonly require per-video fine-tuning of diffusion models or specific inversion optimization to ensure high-fidelity edits. In this paper, we introduce EffiVED, an efficient diffusion-based model that directly supports instruction-guided video editing. To achieve this, we present two efficient workflows to gather video editing pairs, utilizing augmentation and fundamental vision-language techniques. These workflows transform vast image editing datasets and open-world videos into a high-quality dataset for training EffiVED. Experimental results reveal that EffiVED not only generates high-quality editing videos but also executes rapidly. Finally, we demonstrate that our data collection method significantly improves editing performance and can potentially tackle the scarcity of video editing data. Code can be found at https://github.com/alibaba/EffiVED.
CVMar 21, 2024
Champ: Controllable and Consistent Human Image Animation with 3D Parametric GuidanceShenhao Zhu, Junming Leo Chen, Zuozhuo Dai et al.
In this study, we introduce a methodology for human image animation by leveraging a 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion guidance in curernt human generative techniques. The methodology utilizes the SMPL(Skinned Multi-Person Linear) model as the 3D human parametric model to establish a unified representation of body shape and pose. This facilitates the accurate capture of intricate human geometry and motion characteristics from source videos. Specifically, we incorporate rendered depth images, normal maps, and semantic maps obtained from SMPL sequences, alongside skeleton-based motion guidance, to enrich the conditions to the latent diffusion model with comprehensive 3D shape and detailed pose attributes. A multi-layer motion fusion module, integrating self-attention mechanisms, is employed to fuse the shape and motion latent representations in the spatial domain. By representing the 3D human parametric model as the motion guidance, we can perform parametric shape alignment of the human body between the reference image and the source video motion. Experimental evaluations conducted on benchmark datasets demonstrate the methodology's superior ability to generate high-quality human animations that accurately capture both pose and shape variations. Furthermore, our approach also exhibits superior generalization capabilities on the proposed in-the-wild dataset. Project page: https://fudan-generative-vision.github.io/champ.
CVDec 6, 2023
Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian ParticleYoutian Lin, Zuozhuo Dai, Siyu Zhu et al.
We introduce Gaussian-Flow, a novel point-based approach for fast dynamic scene reconstruction and real-time rendering from both multi-view and monocular videos. In contrast to the prevalent NeRF-based approaches hampered by slow training and rendering speeds, our approach harnesses recent advancements in point-based 3D Gaussian Splatting (3DGS). Specifically, a novel Dual-Domain Deformation Model (DDDM) is proposed to explicitly model attribute deformations of each Gaussian point, where the time-dependent residual of each attribute is captured by a polynomial fitting in the time domain, and a Fourier series fitting in the frequency domain. The proposed DDDM is capable of modeling complex scene deformations across long video footage, eliminating the need for training separate 3DGS for each frame or introducing an additional implicit neural field to model 3D dynamics. Moreover, the explicit deformation modeling for discretized Gaussian points ensures ultra-fast training and rendering of a 4D scene, which is comparable to the original 3DGS designed for static 3D reconstruction. Our proposed approach showcases a substantial efficiency improvement, achieving a $5\times$ faster training speed compared to the per-frame 3DGS modeling. In addition, quantitative results demonstrate that the proposed Gaussian-Flow significantly outperforms previous leading methods in novel view rendering quality. Project page: https://nju-3dv.github.io/projects/Gaussian-Flow
CVNov 24, 2025Code
HunyuanVideo 1.5 Technical ReportBing Wu, Chang Zou, Changlin Li et al.
We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models. By releasing the code and model weights, we provide the community with a high-performance foundation that lowers the barrier to video creation and research, making advanced video generation accessible to a broader audience. All open-source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.
CVMay 22, 2023Code
UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything ModelZhenghao Zhang, Shengfan Zhang, Zhichao Wei et al.
The current state-of-the-art methods for unsupervised video object segmentation (UVOS) require extensive training on video datasets with mask annotations, limiting their effectiveness in handling challenging scenarios. However, the Segment Anything Model (SAM) introduces a new prompt-driven paradigm for image segmentation, offering new possibilities. In this study, we investigate SAM's potential for UVOS through different prompt strategies. We then propose UVOSAM, a mask-free paradigm for UVOS that utilizes the STD-Net tracker. STD-Net incorporates a spatial-temporal decoupled deformable attention mechanism to establish an effective correlation between intra- and inter-frame features, remarkably enhancing the quality of box prompts in complex video scenes. Extensive experiments on the DAVIS2017-unsupervised and YoutubeVIS19\&21 datasets demonstrate the superior performance of UVOSAM without mask supervision compared to existing mask-supervised methods, as well as its ability to generalize to weakly-annotated video datasets. Code can be found at https://github.com/alibaba/UVOSAM.
CVMar 24, 2021Code
DRO: Deep Recurrent Optimizer for Video to DepthXiaodong Gu, Weihao Yuan, Zuozhuo Dai et al.
There are increasing interests of studying the video-to-depth (V2D) problem with machine learning techniques. While earlier methods directly learn a mapping from images to depth maps and camera poses, more recent works enforce multi-view geometry constraints through optimization embedded in the learning framework. This paper presents a novel optimization method based on recurrent neural networks to further exploit the potential of neural networks in V2D. Specifically, our neural optimizer alternately updates the depth and camera poses through iterations to minimize a feature-metric cost, and two gated recurrent units iteratively improve the results by tracing historical information. Extensive experimental results demonstrate that our method outperforms previous methods and is more efficient in computation and memory consumption than cost-volume-based methods. In particular, our self-supervised method outperforms previous supervised methods on the KITTI and ScanNet datasets. Our source code is available at https://github.com/aliyun/dro-sfm.
CVMar 22, 2021Code
Cluster Contrast for Unsupervised Person Re-IdentificationZuozhuo Dai, Guangyuan Wang, Weihao Yuan et al.
State-of-the-art unsupervised re-ID methods train the neural networks using a memory-based non-parametric softmax loss. Instance feature vectors stored in memory are assigned pseudo-labels by clustering and updated at instance level. However, the varying cluster sizes leads to inconsistency in the updating progress of each cluster. To solve this problem, we present Cluster Contrast which stores feature vectors and computes contrast loss at the cluster level. Our approach employs a unique cluster representation to describe each cluster, resulting in a cluster-level memory dictionary. In this way, the consistency of clustering can be effectively maintained throughout the pipline and the GPU memory consumption can be significantly reduced. Thus, our method can solve the problem of cluster inconsistency and be applicable to larger data sets. In addition, we adopt different clustering algorithms to demonstrate the robustness and generalization of our framework. The application of Cluster Contrast to a standard unsupervised re-ID pipeline achieves considerable improvements of 9.9%, 8.3%, 12.1% compared to state-of-the-art purely unsupervised re-ID methods and 5.5%, 4.8%, 4.4% mAP compared to the state-of-the-art unsupervised domain adaptation re-ID methods on the Market, Duke, and MSMT17 datasets. Code is available at https://github.com/alibaba/cluster-contrast.
CVOct 17, 2020Code
MeshMVS: Multi-View Stereo Guided Mesh ReconstructionRakesh Shrestha, Zhiwen Fan, Qingkun Su et al.
Deep learning based 3D shape generation methods generally utilize latent features extracted from color images to encode the semantics of objects and guide the shape generation process. These color image semantics only implicitly encode 3D information, potentially limiting the accuracy of the generated shapes. In this paper we propose a multi-view mesh generation method which incorporates geometry information explicitly by using the features from intermediate depth representations of multi-view stereo and regularizing the 3D shapes against these depth images. First, our system predicts a coarse 3D volume from the color images by probabilistically merging voxel occupancy grids from the prediction of individual views. Then the depth images from multi-view stereo along with the rendered depth images of the coarse shape are used as a contrastive input whose features guide the refinement of the coarse shape through a series of graph convolution networks. Notably, we achieve superior results than state-of-the-art multi-view shape generation methods with 34% decrease in Chamfer distance to ground truth and 14% increase in F1-score on ShapeNet dataset.Our source code is available at https://git.io/Jmalg
CVDec 13, 2019
Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo MatchingXiaodong Gu, Zhiwen Fan, Zuozhuo Dai et al.
The deep multi-view stereo (MVS) and stereo matching approaches generally construct 3D cost volumes to regularize and regress the output depth or disparity. These methods are limited when high-resolution outputs are needed since the memory and time costs grow cubically as the volume resolution increases. In this paper, we propose a both memory and time efficient cost volume formulation that is complementary to existing multi-view stereo and stereo matching approaches based on 3D cost volumes. First, the proposed cost volume is built upon a standard feature pyramid encoding geometry and context at gradually finer scales. Then, we can narrow the depth (or disparity) range of each stage by the depth (or disparity) map from the previous stage. With gradually higher cost volume resolution and adaptive adjustment of depth (or disparity) intervals, the output is recovered in a coarser to fine manner. We apply the cascade cost volume to the representative MVS-Net, and obtain a 23.1% improvement on DTU benchmark (1st place), with 50.6% and 74.2% reduction in GPU memory and run-time. It is also the state-of-the-art learning-based method on Tanks and Temples benchmark. The statistics of accuracy, run-time and GPU memory on other representative stereo CNNs also validate the effectiveness of our proposed method.
CVNov 17, 2018
Batch DropBlock Network for Person Re-identification and BeyondZuozhuo Dai, Mingqiang Chen, Xiaodong Gu et al.
Since the person re-identification task often suffers from the problem of pose changes and occlusions, some attentive local features are often suppressed when training CNNs. In this paper, we propose the Batch DropBlock (BDB) Network which is a two branch network composed of a conventional ResNet-50 as the global branch and a feature dropping branch. The global branch encodes the global salient representations. Meanwhile, the feature dropping branch consists of an attentive feature learning module called Batch DropBlock, which randomly drops the same region of all input feature maps in a batch to reinforce the attentive feature learning of local regions. The network then concatenates features from both branches and provides a more comprehensive and spatially distributed feature representation. Albeit simple, our method achieves state-of-the-art on person re-identification and it is also applicable to general metric learning tasks. For instance, we achieve 76.4% Rank-1 accuracy on the CUHK03-Detect dataset and 83.0% Recall-1 score on the Stanford Online Products dataset, outperforming the existing works by a large margin (more than 6%).