CVFeb 24, 2025
Dimitra: Audio-driven Diffusion model for Expressive Talking Head GenerationBaptiste Chopin, Tashvik Dhamija, Pranav Balaji et al.
We propose Dimitra, a novel framework for audio-driven talking head generation, streamlined to learn lip motion, facial expression, as well as head pose motion. Specifically, we train a conditional Motion Diffusion Transformer (cMDT) by modeling facial motion sequences with 3D representation. We condition the cMDT with only two input signals, an audio-sequence, as well as a reference facial image. By extracting additional features directly from audio, Dimitra is able to increase quality and realism of generated videos. In particular, phoneme sequences contribute to the realism of lip motion, whereas text transcript to facial expression and head pose realism. Quantitative and qualitative experiments on two widely employed datasets, VoxCeleb2 and HDTF, showcase that Dimitra is able to outperform existing approaches for generating realistic talking heads imparting lip motion, facial expression, and head pose.
CVNov 27, 2025
AI killed the video star. Audio-driven diffusion model for expressive talking head generationBaptiste Chopin, Tashvik Dhamija, Pranav Balaji et al.
We propose Dimitra++, a novel framework for audio-driven talking head generation, streamlined to learn lip motion, facial expression, as well as head pose motion. Specifically, we propose a conditional Motion Diffusion Transformer (cMDT) to model facial motion sequences, employing a 3D representation. The cMDT is conditioned on two inputs: a reference facial image, which determines appearance, as well as an audio sequence, which drives the motion. Quantitative and qualitative experiments, as well as a user study on two widely employed datasets, i.e., VoxCeleb2 and CelebV-HQ, suggest that Dimitra++ is able to outperform existing approaches in generating realistic talking heads imparting lip motion, facial expression, and head pose.
CVNov 27, 2025
Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech RecognitionMaheswar Bora, Tashvik Dhamija, Shukesh Reddy et al.
Deepfake generation has witnessed remarkable progress, contributing to highly realistic generated images, videos, and audio. While technically intriguing, such progress has raised serious concerns related to the misuse of manipulated media. To mitigate such misuse, robust and reliable deepfake detection is urgently needed. Towards this, we propose a novel network FauxNet, which is based on pre-trained Visual Speech Recognition (VSR) features. By extracting temporal VSR features from videos, we identify and segregate real videos from manipulated ones. The holy grail in this context has to do with zero-shot detection, i.e., generalizable detection, which we focus on in this work. FauxNet consistently outperforms the state-of-the-art in this setting. In addition, FauxNet is able to attribute - distinguish between generation techniques from which the videos stem. Finally, we propose new datasets, referred to as Authentica-Vox and Authentica-HDTF, comprising about 38,000 real and fake videos in total, the latter created with six recent deepfake generation techniques. We provide extensive analysis and results on the Authentica datasets and FaceForensics++, demonstrating the superiority of FauxNet. The Authentica datasets will be made publicly available.
CLApr 23, 2021
Comparative Analysis of Machine Learning and Deep Learning Algorithms for Detection of Online Hate SpeechTashvik Dhamija, Anjum, Rahul Katarya
In the day and age of social media, users have become prone to online hate speech. Several attempts have been made to classify hate speech using machine learning but the state-of-the-art models are not robust enough for practical applications. This is attributed to the use of primitive NLP feature engineering techniques. In this paper, we explored various feature engineering techniques ranging from different embeddings to conventional NLP algorithms. We also experimented with combinations of different features. From our experimentation, we realized that roBERTa (robustly optimized BERT approach) based sentence embeddings classified using decision trees gives the best results of 0.9998 F1 score. In our paper, we concluded that BERT based embeddings give the most useful features for this problem and have the capacity to be made into a practical robust model.