StreamSense: Streaming Social Task Detection with Selective Vision-Language Model Routing
This addresses the problem of efficient real-time monitoring for live streaming platforms, though it is incremental as it builds on existing VLM and routing techniques.
The paper tackled real-time social signal detection on live streaming platforms by proposing StreamSense, a system that uses a lightweight encoder with selective routing to a Vision-Language Model, achieving higher accuracy than VLM-only methods while reducing latency and compute.
Live streaming platforms require real-time monitoring and reaction to social signals, utilizing partial and asynchronous evidence from video, text, and audio. We propose StreamSense, a streaming detector that couples a lightweight streaming encoder with selective routing to a Vision-Language Model (VLM) expert. StreamSense handles most timestamps with the lightweight streaming encoder, escalates hard/ambiguous cases to the VLM, and defers decisions when context is insufficient. The encoder is trained using (i) a cross-modal contrastive term to align visual/audio cues with textual signals, and (ii) an IoU-weighted loss that down-weights poorly overlapping target segments, mitigating label interference across segment boundaries. We evaluate StreamSense on multiple social streaming detection tasks (e.g., sentiment classification and hate content moderation), and the results show that StreamSense achieves higher accuracy than VLM-only streaming while only occasionally invoking the VLM, thereby reducing average latency and compute. Our results indicate that selective escalation and deferral are effective primitives for understanding streaming social tasks. Code is publicly available on GitHub.