CVNov 26, 2024
Video-Guided Foley Sound Generation with Multimodal ControlsZiyang Chen, Prem Seetharaman, Bryan Russell et al.
Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow). MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation. Through automated evaluations and human studies, we demonstrate that MultiFoley successfully generates synchronized high-quality sounds across varied conditional inputs and outperforms existing methods. Please see our project page for video results: https://ificl.github.io/MultiFoley/
CVMar 21, 2025
Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface NetworksBhishma Dedhia, David Bourgin, Krishna Kumar Singh et al.
Diffusion Transformers (DiTs) can generate short photorealistic videos, yet directly training and sampling longer videos with full attention across the video remains computationally challenging. Alternative methods break long videos down into sequential generation of short video segments, requiring multiple sampling chain iterations and specialized consistency modules. To overcome these challenges, we introduce a new paradigm called Video Interface Networks (VINs), which augment DiTs with an abstraction module to enable parallel inference of video chunks. At each diffusion step, VINs encode global semantics from the noisy input of local chunks and the encoded representations, in turn, guide DiTs in denoising chunks in parallel. The coupling of VIN and DiT is learned end-to-end on the denoising objective. Further, the VIN architecture maintains fixed-size encoding tokens that encode the input via a single cross-attention step. Disentangling the encoding tokens from the input thus enables VIN to scale to long videos and learn essential semantics. Experiments on VBench demonstrate that VINs surpass existing chunk-based methods in preserving background consistency and subject coherence. We then show via an optical flow analysis that our approach attains state-of-the-art motion smoothness while using 25-40% fewer FLOPs than full generation. Finally, human raters favorably assessed the overall video quality and temporal consistency of our method in a user study.
CVMar 11, 2025
REGEN: Learning Compact Video Embedding with (Re-)Generative DecoderYitian Zhang, Long Mai, Aniruddha Mahapatra et al.
We present a novel perspective on learning video embedders for generative modeling: rather than requiring an exact reproduction of an input video, an effective embedder should focus on synthesizing visually plausible reconstructions. This relaxed criterion enables substantial improvements in compression ratios without compromising the quality of downstream generative models. Specifically, we propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework that employs a diffusion transformer (DiT) to synthesize missing details from a compact latent space. Therein, we develop a dedicated latent conditioning module to condition the DiT decoder on the encoded video latent embedding. Our experiments demonstrate that our approach enables superior encoding-decoding performance compared to state-of-the-art methods, particularly as the compression ratio increases. To demonstrate the efficacy of our approach, we report results from our video embedders achieving a temporal compression ratio of up to 32x (8x higher than leading video embedders) and validate the robustness of this ultra-compact latent space for text-to-video generation, providing a significant efficiency boost in latent diffusion model training and inference.
CVJan 9, 2025
Progressive Growing of Video Tokenizers for Temporally Compact Latent SpacesAniruddha Mahapatra, Long Mai, David Bourgin et al.
Video tokenizers are essential for latent video diffusion models, converting raw video data into spatiotemporally compressed latent spaces for efficient training. However, extending state-of-the-art video tokenizers to achieve a temporal compression ratio beyond 4x without increasing channel capacity poses significant challenges. In this work, we propose an alternative approach to enhance temporal compression. We find that the reconstruction quality of temporally subsampled videos from a low-compression encoder surpasses that of high-compression encoders applied to original videos. This indicates that high-compression models can leverage representations from lower-compression models. Building on this insight, we develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models. Our method includes a cross-level feature-mixing module to retain information from the pretrained low-compression model and guide higher-compression blocks to capture the remaining details from the full video sequence. Evaluation of video benchmarks shows that our method significantly improves reconstruction quality while increasing temporal compression compared to directly training the full model. Furthermore, the resulting compact latent space effectively trains a video diffusion model for high-quality video generation with a significantly reduced token budget.
AIApr 15, 2019
Predicting human decisions with behavioral theories and machine learningOri Plonsky, Reut Apel, Eyal Ert et al.
Predicting human decisions under risk and uncertainty remains a fundamental challenge across disciplines. Existing models often struggle even in highly stylized tasks like choice between lotteries. We introduce BEAST Gradient Boosting (BEAST-GB), a hybrid model integrating behavioral theory (BEAST) with machine learning. We first present CPC18, a competition for predicting risky choice, in which BEAST-GB won. Then, using two large datasets, we demonstrate BEAST-GB predicts more accurately than neural networks trained on extensive data and dozens of existing behavioral models. BEAST-GB also generalizes robustly across unseen experimental contexts, surpassing direct empirical generalization, and helps refine and improve the behavioral theory itself. Our analyses highlight the potential of anchoring predictions on behavioral theory even in data-rich settings and even when the theory alone falters. Our results underscore how integrating machine learning with theoretical frameworks, especially those-like BEAST-designed for prediction, can improve our ability to predict and understand human behavior.