19.8CVMay 25Code
MTLLFM: Multimodal-Temporal Laughter Localization: UR-FUNNY-Temporal and SMILE-Temporal Benchmarks with an Adaptive Multimodal Fusion ModelEyal Hanania, Nadav Kirsch, Daniel Arkushin et al.
Detecting laughter in video is essential for affective computing and narrative understanding, yet existing approaches treat it as coarse clip-level classification, failing to capture precise temporal boundaries of brief, transient laughter events. We address this gap with two complementary contributions. First, we introduce UR-FUNNY-Temporal and SMILE-Temporal, fully annotated temporal laughter datasets extending two widely-used humor benchmarks. Our annotations cover over 11,053 videos (78.8 hours) and provide precise onset/offset boundaries for each laughter event, along with rich metadata distinguishing speaker vs. audience laughter, modality dominance (acoustic, visual, or both), and intensity levels. Second, we propose a lightweight weakly-supervised framework for temporal laughter localization. Our architecture combines fixed HuBERT and MAE encoders with temporal softmax pooling and adaptive modality gating, learning fine-grained temporal grounding from clip-level labels without requiring frame-level annotations during training. Experiments across three datasets demonstrate that our approach substantially outperforms multimodal foundation models including Gemini 3 Flash, achieving 99% F1 and 68.1% localization precision on sports broadcast data. Ablations validate each architectural component. Furthermore, our precise temporal tags improve downstream laughter reasoning by 227% on CIDEr, enabling GPT-3.5 to outperform GPT-4o. The code, UR-FUNNY-Temporal and SMILE-Temporal datasets are publicly available at https://github.com/WSCSports/MTLLFM-temporal-laughter-localization.
CVNov 24, 2022
GEFF: Improving Any Clothes-Changing Person ReID Model using Gallery Enrichment with Face FeaturesDaniel Arkushin, Bar Cohen, Shmuel Peleg et al.
In the Clothes-Changing Re-Identification (CC-ReID) problem, given a query sample of a person, the goal is to determine the correct identity based on a labeled gallery in which the person appears in different clothes. Several models tackle this challenge by extracting clothes-independent features. However, the performance of these models is still lower for the clothes-changing setting compared to the same-clothes setting in which the person appears with the same clothes in the labeled gallery. As clothing-related features are often dominant features in the data, we propose a new process we call Gallery Enrichment, to utilize these features. In this process, we enrich the original gallery by adding to it query samples based on their face features, using an unsupervised algorithm. Additionally, we show that combining ReID and face feature extraction modules alongside an enriched gallery results in a more accurate ReID model, even for query samples with new outfits that do not include faces. Moreover, we claim that existing CC-ReID benchmarks do not fully represent real-world scenarios, and propose a new video CC-ReID dataset called 42Street, based on a theater play that includes crowded scenes and numerous clothes changes. When applied to multiple ReID models, our method (GEFF) achieves an average improvement of 33.5% and 6.7% in the Top-1 clothes-changing metric on the PRCC and LTCC benchmarks. Combined with the latest ReID models, our method achieves new SOTA results on the PRCC, LTCC, CCVID, LaST and VC-Clothes benchmarks and the proposed 42Street dataset.