Shreyank Jyoti

CVDec 11, 2020Code

ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti et al.

We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first network to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behaviour in the previous state-of-the-art models \cite{tsiami2020stavis} for audio-visual saliency prediction. Our findings contrast with previous works on deep learning-based audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio in a more effective manner. The code and pre-trained models are available at https://github.com/samyak0210/ViNet.

CVJun 13, 2018

Expression Empowered ResiDen Network for Facial Action Unit Detection

Shreyank Jyoti, Abhinav Dhall

The paper explores the topic of Facial Action Unit (FAU) detection in the wild. In particular, we are interested in answering the following questions: (1) how useful are residual connections across dense blocks for face analysis? (2) how useful is the information from a network trained for categorical Facial Expression Recognition (FER) for the task of FAU detection? The proposed network (ResiDen) exploits dense blocks along with residual connections and uses auxiliary information from a FER network. The experiments are performed on the EmotionNet and DISFA datasets. The experiments show the usefulness of facial expression information for AU detection. The proposed network achieves state-of-art results on the two databases. Analysis of the results for cross database protocol shows the effectiveness of the network.

Shreyank Jyoti

2 Papers