Abhinav Shukla

h-index4

6papers

104citations

Novelty57%

AI Score30

Ranked #139,757 of 194,257 authors (top 72%)#1,230 in HC (top 49%)

6 Papers

14.4ROOct 2, 2023Code

GRID: A Platform for General Robot Intelligence Development

Sai Vemprala, Shuhang Chen, Abhinav Shukla et al.

Developing machine intelligence abilities in robots and autonomous systems is an expensive and time consuming process. Existing solutions are tailored to specific applications and are harder to generalize. Furthermore, scarcity of training data adds a layer of complexity in deploying deep machine learning models. We present a new platform for General Robot Intelligence Development (GRID) to address both of these issues. The platform enables robots to learn, compose and adapt skills to their physical capabilities, environmental constraints and goals. The platform addresses AI problems in robotics via foundation models that know the physical world. GRID is designed from the ground up to be extensible to accommodate new types of robots, vehicles, hardware platforms and software protocols. In addition, the modular design enables various deep ML components and existing foundation models to be easily usable in a wider variety of robot-centric problems. We demonstrate the platform in various aerial robotics scenarios and demonstrate how the platform dramatically accelerates development of machine intelligent robots.

8.6ASJul 8, 2020

Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision

Abhinav Shukla, Stavros Petridis, Maja Pantic

The intuitive interaction between the audio and visual modalities is valuable for cross-modal self-supervised learning. This concept has been demonstrated for generic audiovisual tasks like video action recognition and acoustic scene classification. However, self-supervision remains under-explored for audiovisual speech. We propose a method to learn self-supervised speech representations from the raw audio waveform. We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio). The visual pretext task drives the audio representations to capture information related to lip movements. This enriches the audio encoder with visual information and the encoder can be used for evaluation without the visual modality. Our method attains competitive performance with respect to existing self-supervised audio features on established isolated word classification benchmarks, and significantly outperforms other methods at learning from fewer labels. Notably, our method also outperforms fully supervised training, thus providing a strong initialization for speech related tasks. Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.

7.3ASJan 13, 2020

Visually Guided Self Supervised Learning of Speech Representations

Abhinav Shukla, Konstantinos Vougioukas, Pingchuan Ma et al.

Self supervised representation learning has recently attracted a lot of research interest for both the audio and visual modalities. However, most works typically focus on a particular modality or feature alone and there has been very limited work that studies the interaction between the two modalities for learning self supervised representations. We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech. We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment. Through this process, the audio encoder network learns useful speech representations that we evaluate on emotion recognition and speech recognition. We achieve state of the art results for emotion recognition and competitive results for speech recognition. This demonstrates the potential of visual supervision for learning audio representations as a novel way for self-supervised learning which has not been explored in the past. The proposed unsupervised audio features can leverage a virtually unlimited amount of training data of unlabelled audiovisual speech and have a large number of potentially promising applications.

5.4HCDec 8, 2018

Engagement Estimation in Advertisement Videos with EEG

Sangeetha Balasubramanian, Shruti Shriya Gullapuram, Abhinav Shukla

Engagement is a vital metric in the advertising industry and its automatic estimation has huge commercial implications. This work presents a basic and simple framework for engagement estimation using EEG (electroencephalography) data specifically recorded while watching advertisement videos, and is meant to be a first step in a promising line of research. The system combines recent advances in low cost commercial Brain-Computer Interfaces with modeling user engagement in response to advertisement videos. We achieve an F1 score of nearly 0.7 for a binary classification of high and low values of self-reported engagement from multiple users. This study illustrates the possibility of seamless engagement measurement in the wild when interacting with media using a non invasive and readily available commercial EEG device. Performing engagement measurement via implicit tagging in this manner with a direct feedback from physiological signals, thus requiring no additional human effort, demonstrates a novel and potentially commercially relevant application in the area of advertisement video analysis.

4.6CVAug 14, 2018

Looking Beyond a Clever Narrative: Visual Context and Attention are Primary Drivers of Affect in Video Advertisements

Abhinav Shukla, Harish Katti, Mohan Kankanhalli et al.

Emotion evoked by an advertisement plays a key role in influencing brand recall and eventual consumer choices. Automatic ad affect recognition has several useful applications. However, the use of content-based feature representations does not give insights into how affect is modulated by aspects such as the ad scene setting, salient object attributes and their interactions. Neither do such approaches inform us on how humans prioritize visual information for ad understanding. Our work addresses these lacunae by decomposing video content into detected objects, coarse scene structure, object statistics and actively attended objects identified via eye-gaze. We measure the importance of each of these information channels by systematically incorporating related information into ad affect prediction models. Contrary to the popular notion that ad affect hinges on the narrative and the clever use of linguistic and social cues, we find that actively attended objects and the coarse scene structure better encode affective information as compared to individual scene objects or conspicuous background elements.

13.4HCSep 6, 2017

Evaluating Content-centric vs User-centric Ad Affect Recognition

Abhinav Shukla, Shruti Shriya Gullapuram, Harish Katti et al.

Despite the fact that advertisements (ads) often include strongly emotional content, very little work has been devoted to affect recognition (AR) from ads. This work explicitly compares content-centric and user-centric ad AR methodologies, and evaluates the impact of enhanced AR on computational advertising via a user study. Specifically, we (1) compile an affective ad dataset capable of evoking coherent emotions across users; (2) explore the efficacy of content-centric convolutional neural network (CNN) features for encoding emotions, and show that CNN features outperform low-level emotion descriptors; (3) examine user-centered ad AR by analyzing Electroencephalogram (EEG) responses acquired from eleven viewers, and find that EEG signals encode emotional information better than content descriptors; (4) investigate the relationship between objective AR and subjective viewer experience while watching an ad-embedded online video stream based on a study involving 12 users. To our knowledge, this is the first work to (a) expressly compare user vs content-centered AR for ads, and (b) study the relationship between modeling of ad emotions and its impact on a real-life advertising application.