CVNov 2, 2023
VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning on Language-Video Foundation ModelsHong Chen, Xin Wang, Guanning Zeng et al.
Customized text-to-video generation aims to generate text-guided videos with user-given subjects, which has gained increasing attention. However, existing works are primarily limited to single-subject oriented text-to-video generation, leaving the more challenging problem of customized multi-subject generation unexplored. In this paper, we fill this gap and propose a novel VideoDreamer framework, which can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. Specifically, VideoDreamer adopts the pretrained Stable Diffusion with temporal modules as its base video generator, taking the power of the text-to-image model to generate diversified content. The video generator is further customized for multi-subjects, which leverages the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy, to tackle the attribute binding problem of multi-subject generation. Additionally, we present a disentangled motion customization strategy to finetune the temporal modules so that we can generate videos with both customized subjects and motions. To evaluate the performance of customized multi-subject text-to-video generation, we introduce the MultiStudioBench benchmark. Extensive experiments demonstrate the remarkable ability of VideoDreamer to generate videos with new content such as new events and backgrounds, tailored to the customized multiple subjects.
AISep 23, 2024
Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the UnificationXin Wang, Yuwei Zhou, Bin Huang et al.
Multi-modal generative AI (Artificial Intelligence) has attracted increasing attention from both academia and industry. Particularly, two dominant families of techniques have emerged: i) Multi-modal large language models (LLMs) demonstrate impressive ability for multi-modal understanding; and ii) Diffusion models exhibit remarkable multi-modal powers in terms of multi-modal generation. Therefore, this paper provides a comprehensive overview of multi-modal generative AI, including multi-modal LLMs, diffusions, and the unification for understanding and generation. To lay a solid foundation for unified models, we first provide a detailed review of both multi-modal LLMs and diffusion models respectively, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video LLMs as well as text-to-image/video generation. Furthermore, we explore the emerging efforts toward unified models for understanding and generation. To achieve the unification of understanding and generation, we investigate key designs including autoregressive-based and diffusion-based modeling, as well as dense and Mixture-of-Experts (MoE) architectures. We then introduce several strategies for unified models, analyzing their potential advantages and disadvantages. In addition, we summarize the common datasets widely used for multi-modal generative AI pretraining. Last but not least, we present several challenging future research directions which may contribute to the ongoing advancement of multi-modal generative AI.
MLMay 16
CAST: Causal Anchored Simplex Transport for Distribution-Valued Time SeriesJiecheng Lu, Jieqi Di, Runhua Wu et al.
Many decision-facing stochastic systems are observed through aggregate distributions rather than scalar trajectories: queue occupancies, mobility shares, public-health mixtures, generation-source shares, ecological compositions, and air-quality severity profiles all live on the probability simplex and evolve over time. We study causal (online) forecasting for these distribution-valued time series and argue that the transition operator itself should be structured around the simplex. We introduce CAST (Causal Anchored Simplex Transport), a successor-local operator that (i) retrieves empirical successors from causal context, (ii) stabilizes them with a persistence anchor, and (iii) applies a bounded local stochastic transport on ordered supports; every stage preserves the simplex by construction. We identify a structural failure mode, latent transition-kernel aliasing, where similar observed distributions evolve differently under different contextual regimes, and prove that any forecaster depending only on an aliased summary incurs an irreducible weighted Jensen-Shannon excess-risk lower bound, while the CAST hypothesis class contains the regime-aware Bayes successor; for ordered supports an additional Pinsker separation holds whenever the transported successor lies outside the no-transport anchor hull. On eleven public and simulated benchmarks spanning ecology, energy, diet, mortality, employment, air quality, severe weather, mobility, and G/G/1, G_t/G/1 queue occupancy, CAST attains the best average rank on both one-step KL (1.27) and autoregressive rollout JSD (1.91), winning 8/11 sections on each metric against a broad statistical, compositional, recurrent, convolutional, and Transformer baseline set, and top-2 on all 11 sections for offline KL. Component ablations and a controlled synthetic aliasing experiment corroborate the theory.
CVMay 21, 2024
DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial ControlHong Chen, Xin Wang, Yipeng Zhang et al.
Generating customized content in videos has received increasing attention recently. However, existing works primarily focus on customized text-to-video generation for single subject, suffering from subject-missing and attribute-binding problems when the video is expected to contain multiple subjects. Furthermore, existing models struggle to assign the desired actions to the corresponding subjects (action-binding problem), failing to achieve satisfactory multi-subject generation performance. To tackle the problems, in this paper, we propose DisenStudio, a novel framework that can generate text-guided videos for customized multiple subjects, given few images for each subject. Specifically, DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism to associate each subject with the desired action. Then the model is customized for the multiple subjects with the proposed motion-preserved disentangled finetuning, which involves three tuning strategies: multi-subject co-occurrence tuning, masked single-subject tuning, and multi-subject motion-preserved tuning. The first two strategies guarantee the subject occurrence and preserve their visual attributes, and the third strategy helps the model maintain the temporal motion-generation ability when finetuning on static images. We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics. Additionally, we show that DisenStudio can be used as a powerful tool for various controllable generation applications.
CVDec 21, 2023
LLM4VG: Large Language Models Evaluation for Video GroundingWei Feng, Xin Wang, Hong Chen et al.
Recently, researchers have attempted to investigate the capability of LLMs in handling videos and proposed several video LLM models. However, the ability of LLMs to handle video grounding (VG), which is an important time-related video task requiring the model to precisely locate the start and end timestamps of temporal moments in videos that match the given textual queries, still remains unclear and unexplored in literature. To fill the gap, in this paper, we propose the LLM4VG benchmark, which systematically evaluates the performance of different LLMs on video grounding tasks. Based on our proposed LLM4VG, we design extensive experiments to examine two groups of video LLM models on video grounding: (i) the video LLMs trained on the text-video pairs (denoted as VidLLM), and (ii) the LLMs combined with pretrained visual description models such as the video/image captioning model. We propose prompt methods to integrate the instruction of VG and description from different kinds of generators, including caption-based generators for direct visual description and VQA-based generators for information enhancement. We also provide comprehensive comparisons of various VidLLMs and explore the influence of different choices of visual models, LLMs, prompt designs, etc, as well. Our experimental evaluations lead to two conclusions: (i) the existing VidLLMs are still far away from achieving satisfactory video grounding performance, and more time-related video tasks should be included to further fine-tune these models, and (ii) the combination of LLMs and visual models shows preliminary abilities for video grounding with considerable potential for improvement by resorting to more reliable models and further guidance of prompt instructions.
CVFeb 21
Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video UnderstandingHoulun Chen, Xin Wang, Guangyao Li et al.
Long video understanding is challenging due to rich and complicated multimodal clues in long temporal range.Current methods adopt reasoning to improve the model's ability to analyze complex video clues in long videos via text-form reasoning.However,the existing literature suffers from the fact that the text-only reasoning under fixed video context may exacerbate hallucinations since detailed crucial clues are often ignored under limited video context length due to the temporal redundancy of long videos.To address this gap,we propose Video-TwG,a curriculum reinforced framework that employs a novel Think-with-Grounding paradigm,enabling video LLMs to actively decide when to perform on-demand grounding during interleaved text-video reasoning, selectively zooming into question-relevant clips only when necessary.Video-TwG can be trained end-to-end in a straightforward manner, without relying on complex auxiliary modules or heavily annotated reasoning tracesIn detail,we design a Two-stage Reinforced Curriculum Strategy, where the model first learns think-with-grounding behavior on a small short-video GQA dataset with grounding labels,and then scales to diverse general QA data with videos of diverse domains to encourage generalization. Further, to handle complex think-with-grounding reasoning for various kinds of data,we propose TwG-GRPO algorithm which features the fine-grained grounding reward, self-confirmed pseudo reward and accuracy-gated mechanism.Finally,we propose to construct a new TwG-51K dataset that facilitates training. Experiments on Video-MME, LongVideoBench, and MLVU show that Video-TwG consistently outperforms strong LVU baselines.Further ablation validates the necessity of our Two-stage Reinforced Curriculum Strategy and shows our TwG-GRPO better leverages diverse unlabeled data to improve grounding quality and reduce redundant groundings without sacrificing QA performance.
ROAug 21, 2025
Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and ReasoningYijun Liu, Yuwei Liu, Yuan Meng et al.
Vision-centric hierarchical embodied models have demonstrated strong potential. However, existing methods lack spatial awareness capabilities, limiting their effectiveness in bridging visual plans to actionable control in complex environments. To address this problem, we propose Spatial Policy (SP), a unified spatial-aware visuomotor robotic manipulation framework via explicit spatial modeling and reasoning. Specifically, we first design a spatial-conditioned embodied video generation module to model spatially guided predictions through the spatial plan table. Then, we propose a flow-based action prediction module to infer executable actions with coordination. Finally, we propose a spatial reasoning feedback policy to refine the spatial plan table via dual-stage replanning. Extensive experiments show that SP substantially outperforms state-of-the-art baselines, achieving over 33% improvement on Meta-World and over 25% improvement on iTHOR, demonstrating strong effectiveness across 23 embodied control tasks. We additionally evaluate SP in real-world robotic experiments to verify its practical viability. SP enhances the practicality of embodied models for robotic control applications. Code and checkpoints are maintained at https://plantpotatoonmoon.github.io/SpatialPolicy/.
CVMay 5, 2023
DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image GenerationHong Chen, Yipeng Zhang, Simin Wu et al.
Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to the failure of subject-driven text-to-image generation as follows: (i) the identity-irrelevant information hidden in the entangled embedding may dominate the generation process, resulting in the generated images heavily dependent on the irrelevant information while ignoring the given text descriptions; (ii) the identity-relevant information carried in the entangled embedding can not be appropriately preserved, resulting in identity change of the subject in the generated images. To tackle the problems, we propose DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation. Specifically, DisenBooth finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise each image, DisenBooth instead utilizes disentangled embeddings to respectively preserve the subject identity and capture the identity-irrelevant information. We further design the novel weak denoising and contrastive embedding auxiliary tuning objectives to achieve the disentanglement. Extensive experiments show that our proposed DisenBooth framework outperforms baseline models for subject-driven text-to-image generation with the identity-preserved embedding. Additionally, by combining the identity-preserved embedding and identity-irrelevant embedding, DisenBooth demonstrates more generation flexibility and controllability