LGMay 27
Random Process Flow Matching: Generative Implicit Representations of Multivariate Random FieldsJulien Lalanne, David Picard, Lionel Boillot et al.
Generative modeling provides a powerful framework for learning data distributions. These models initially relied on probabilistic methods such as Gaussian Processes (GP) for uncertainty-aware predictions and shifted towards larger trainable models to learn more complex distributions. In this work, we introduce Random Process (RP) Flow, a Flow Matching-based framework that represents the vector field as a neural implicit function. Unlike modern generative methods, our setting involves a single observed field, from which only sparse measurements are available. RP Flow uses Random Fourier Features to learn an implicit signal representation that can be queried at any arbitrary location from a limited set of observations, while encoding uncertainty through ensemble sampling. We propose constructing a Bayesian posterior by GP regression in the source space to generate high-quality samples. Our empirical results demonstrate that this framework generates realistic samples along with calibrated uncertainty estimates, even under challenging conditions such as high frequency, high sparsity, or high dimensionality. These findings position RP Flow as a milestone towards generative models for reconstruction tasks where data is scarce and uncertainty must remain traceable.
CVOct 1, 2023
LiveChat: Video Comment Generation from Audio-Visual Multimodal ContextsJulien Lalanne, Raphael Bournet, Yi Yu
Live commenting on video, a popular feature of live streaming platforms, enables viewers to engage with the content and share their comments, reactions, opinions, or questions with the streamer or other viewers while watching the video or live stream. It presents a challenging testbed for AI agents, which involves the simultaneous understanding of audio-visual multimodal contexts from live streams and the ability to interact with human viewers through dialogue. As existing live streaming-based comments datasets contain limited categories and lack a diversity, we create a large-scale audio-visual multimodal dialogue dataset to facilitate the development of live commenting technologies. The data is collected from Twitch, with 11 different categories and 575 streamers for a total of 438 hours of video and 3.2 million comments. Moreover, we propose a novel multimodal generation model capable of generating live comments that align with the temporal and spatial events within the video, as well as with the ongoing multimodal dialogue context. Our initial results have demonstrated the effectiveness of the proposed model, providing a robust foundation for further research and practical applications in the field of live video interaction.
CVOct 30, 2025
Semantic Frame Aggregation-based Transformer for Live Video Comment GenerationAnam Fatima, Yi Yu, Janak Kapuriya et al.
Live commenting on video streams has surged in popularity on platforms like Twitch, enhancing viewer engagement through dynamic interactions. However, automatically generating contextually appropriate comments remains a challenging and exciting task. Video streams can contain a vast amount of data and extraneous content. Existing approaches tend to overlook an important aspect of prioritizing video frames that are most relevant to ongoing viewer interactions. This prioritization is crucial for producing contextually appropriate comments. To address this gap, we introduce a novel Semantic Frame Aggregation-based Transformer (SFAT) model for live video comment generation. This method not only leverages CLIP's visual-text multimodal knowledge to generate comments but also assigns weights to video frames based on their semantic relevance to ongoing viewer conversation. It employs an efficient weighted sum of frames technique to emphasize informative frames while focusing less on irrelevant ones. Finally, our comment decoder with a cross-attention mechanism that attends to each modality ensures that the generated comment reflects contextual cues from both chats and video. Furthermore, to address the limitations of existing datasets, which predominantly focus on Chinese-language content with limited video categories, we have constructed a large scale, diverse, multimodal English video comments dataset. Extracted from Twitch, this dataset covers 11 video categories, totaling 438 hours and 3.2 million comments. We demonstrate the effectiveness of our SFAT model by comparing it to existing methods for generating comments from live video and ongoing dialogue contexts.