CVJul 15, 2022Code
Learning Long-Term Spatial-Temporal Graphs for Active Speaker DetectionKyle Min, Sourya Roy, Subarna Tripathi et al.
Active speaker detection (ASD) in videos with multiple speakers is a challenging task as it requires learning effective audiovisual features and spatial-temporal correlations over long temporal windows. In this paper, we present SPELL, a novel spatial-temporal graph learning framework that can solve complex tasks such as ASD. To this end, each person in a video frame is first encoded in a unique node for that frame. Nodes corresponding to a single person across frames are connected to encode their temporal dynamics. Nodes within a frame are also connected to encode inter-person relationships. Thus, SPELL reduces ASD to a node classification task. Importantly, SPELL is able to reason over long temporal contexts for all nodes without relying on computationally expensive fully connected graph neural networks. Through extensive experiments on the AVA-ActiveSpeaker dataset, we demonstrate that learning graph-based representations can significantly improve the active speaker detection performance owing to its explicit spatial and temporal structure. SPELL outperforms all previous state-of-the-art approaches while requiring significantly lower memory and computational resources. Our code is publicly available at https://github.com/SRA2/SPELL
85.1CCMay 26
Low Soundness Linearity Testing on the Half-SliceHaakon Larsen, Tushant Mittal, Silas Richelson et al.
Let $f: T\to \{ 0,1 \}$ be a Boolean function on the Boolean half-slice, $T$, \ie elements of $\{0,1\}^n$ with Hamming weight $n/2$. We show that if $f(x)+f(y)=f(x+y)$ holds with probability $\frac{1+δ}{2}$ over a uniform pair $(x,y)$ such that $x,y,x+y\in T$, then $f$ agrees with some linear function on at least $\frac{1+δ}{2}-o(1)$ fraction of the points in $T$. More generally, we show that if $f$ passes the natural $k$-query BLR test with probability $\frac{1+δ}{2}$ for any $k\geq3$, then it must agree with some affine function at $\frac{1+δ^{\frac{1}{k-2}}}{2}-o(1)$ fraction of the points in $T$. The only other known linearity test for the slice in the low soundness regime (i.e., when $δ$ can be arbitrarily small) was given by Kalai, Lifshitz, Minzer, and Ziegler [FOCS'24]. Our result improves upon this result in two significant ways: firstly, it works for $k=3$ queries, instead of requiring $k\geq4$; secondly, our result is sharper, e.g., when $k=4$, we are able to conclude an agreement of $\frac{1+\sqrtδ}{2}-o(1)$ instead of $\frac{1+c\sqrtδ}{2}$ for $c\approx.0035$. In particular, our result matches (up to the $o(1)$ term) the conclusion one obtains over the full hypercube via the classical BLR analysis. Our main technical contribution is a new dense model theorem using bounds on Krawtchouk polynomials. Using these Krawtchouk polynomial bounds, we also obtain a simple $k$-query test ($k\geq 5$) that avoids any use of the dense model machinery. This simplified test naturally extends to the slice over the $q$-ary hypercube, giving the first such result over larger alphabets.
CVDec 2, 2021
Learning Spatial-Temporal Graphs for Active Speaker DetectionSourya Roy, Kyle Min, Subarna Tripathi et al.
We address the problem of active speaker detection through a new framework, called SPELL, that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data. We cast active speaker detection as a node classification task that is aware of longer-term dependencies. We first construct a graph from a video so that each node corresponds to one person. Nodes representing the same identity share edges between them within a defined temporal window. Nodes within the same video frame are also connected to encode inter-person interactions. Through extensive experiments on the Ava-ActiveSpeaker dataset, we demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance. SPELL outperforms several relevant baselines and performs at par with state of the art models while requiring an order of magnitude lower computation cost.
CVAug 6, 2018
Incorporating Scalability in Unsupervised Spatio-Temporal Feature LearningSujoy Paul, Sourya Roy, Amit K. Roy-Chowdhury
Deep neural networks are efficient learning machines which leverage upon a large amount of manually labeled data for learning discriminative features. However, acquiring substantial amount of supervised data, especially for videos can be a tedious job across various computer vision tasks. This necessitates learning of visual features from videos in an unsupervised setting. In this paper, we propose a computationally simple, yet effective, framework to learn spatio-temporal feature embedding from unlabeled videos. We train a Convolutional 3D Siamese network using positive and negative pairs mined from videos under certain probabilistic assumptions. Experimental results on three datasets demonstrate that our proposed framework is able to learn weights which can be used for same as well as cross dataset and tasks.
CVJul 27, 2018
W-TALC: Weakly-supervised Temporal Activity Localization and ClassificationSujoy Paul, Sourya Roy, Amit K Roy-Chowdhury
Most activity localization methods in the literature suffer from the burden of frame-wise annotation requirement. Learning from weak labels may be a potential solution towards reducing such manual labeling effort. Recent years have witnessed a substantial influx of tagged videos on the Internet, which can serve as a rich source of weakly-supervised training data. Specifically, the correlations between videos with similar tags can be utilized to temporally localize the activities. Towards this goal, we present W-TALC, a Weakly-supervised Temporal Activity Localization and Classification framework using only video-level labels. The proposed network can be divided into two sub-networks, namely the Two-Stream based feature extractor network and a weakly-supervised module, which we learn by optimizing two complimentary loss functions. Qualitative and quantitative results on two challenging datasets - Thumos14 and ActivityNet1.2, demonstrate that the proposed method is able to detect activities at a fine granularity and achieve better performance than current state-of-the-art methods.
CVMay 3, 2016
WEPSAM: Weakly Pre-Learnt Saliency ModelAvisek Lahiri, Sourya Roy, Anirban Santara et al.
Visual saliency detection tries to mimic human vision psychology which concentrates on sparse, important areas in natural image. Saliency prediction research has been traditionally based on low level features such as contrast, edge, etc. Recent thrust in saliency prediction research is to learn high level semantics using ground truth eye fixation datasets. In this paper we present, WEPSAM : Weakly Pre-Learnt Saliency Model as a pioneering effort of using domain specific pre-learing on ImageNet for saliency prediction using a light weight CNN architecture. The paper proposes a two step hierarchical learning, in which the first step is to develop a framework for weakly pre-training on a large scale dataset such as ImageNet which is void of human eye fixation maps. The second step refines the pre-trained model on a limited set of ground truth fixations. Analysis of loss on iSUN and SALICON datasets reveal that pre-trained network converges much faster compared to randomly initialized network. WEPSAM also outperforms some recent state-of-the-art saliency prediction models on the challenging MIT300 dataset.
CVApr 17, 2016
Visual saliency detection: a Kalman filter based approachSourya Roy, Pabitra Mitra
In this paper we propose a Kalman filter aided saliency detection model which is based on the conjecture that salient regions are considerably different from our "visual expectation" or they are "visually surprising" in nature. In this work, we have structured our model with an immediate objective to predict saliency in static images. However, the proposed model can be easily extended for space-time saliency prediction. Our approach was evaluated using two publicly available benchmark data sets and results have been compared with other existing saliency models. The results clearly illustrate the superior performance of the proposed model over other approaches.