CVNov 12, 2022Code
MARLIN: Masked Autoencoder for facial video Representation LearnINgZhixi Cai, Shreya Ghosh, Kalin Stefanov et al.
This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR (1.13% gain over supervised benchmark), FER (2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervised benchmark), LS (29.36% gain for Frechet Inception Distance), and even in low data regime. Our code and models are available at https://github.com/ControlNet/MARLIN .
CVNov 26, 2023Code
AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake DatasetZhixi Cai, Shreya Ghosh, Aman Pankaj Adatia et al.
The detection and localization of highly realistic deepfake audio-visual content are challenging even for the most advanced state-of-the-art methods. While most of the research efforts in this domain are focused on detecting high-quality deepfake images and videos, only a few works address the problem of the localization of small segments of audio-visual manipulations embedded in real videos. In this research, we emulate the process of such content generation and propose the AV-Deepfake1M dataset. The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos. The paper provides a thorough description of the proposed data generation pipeline accompanied by a rigorous analysis of the quality of the generated data. The comprehensive benchmark of the proposed dataset utilizing state-of-the-art deepfake detection and localization methods indicates a significant drop in performance compared to previous datasets. The proposed dataset will play a vital role in building the next-generation deepfake localization methods. The dataset and associated code are available at https://github.com/ControlNet/AV-Deepfake1M .
CVJul 4, 2024
7th ABAW Competition: Multi-Task Learning and Compound Expression RecognitionDimitrios Kollias, Stefanos Zafeiriou, Irene Kotsia et al.
This paper describes the 7th Affective Behavior Analysis in-the-wild (ABAW) Competition, which is part of the respective Workshop held in conjunction with ECCV 2024. The 7th ABAW Competition addresses novel challenges in understanding human expressions and behaviors, crucial for the development of human-centered technologies. The Competition comprises of two sub-challenges: i) Multi-Task Learning (the goal is to learn at the same time, in a multi-task learning setting, to estimate two continuous affect dimensions, valence and arousal, to recognise between the mutually exclusive classes of the 7 basic expressions and 'other'), and to detect 12 Action Units); and ii) Compound Expression Recognition (the target is to recognise between the 7 mutually exclusive compound expression classes). s-Aff-Wild2, which is a static version of the A/V Aff-Wild2 database and contains annotations for valence-arousal, expressions and Action Units, is utilized for the purposes of the Multi-Task Learning Challenge; a part of C-EXPR-DB, which is an A/V in-the-wild database with compound expression annotations, is utilized for the purposes of the Compound Expression Recognition Challenge. In this paper, we introduce the two challenges, detailing their datasets and the protocols followed for each. We also outline the evaluation metrics, and highlight the baseline systems and their results. Additional information about the competition can be found at \url{https://affective-behavior-analysis-in-the-wild.github.io/7th}.
CVSep 11, 2024Code
1M-Deepfakes Detection ChallengeZhixi Cai, Abhinav Dhall, Shreya Ghosh et al.
The detection and localization of deepfake content, particularly when small fake segments are seamlessly mixed with real videos, remains a significant challenge in the field of digital media security. Based on the recently released AV-Deepfake1M dataset, which contains more than 1 million manipulated videos across more than 2,000 subjects, we introduce the 1M-Deepfakes Detection Challenge. This challenge is designed to engage the research community in developing advanced methods for detecting and localizing deepfake manipulations within the large-scale high-realistic audio-visual dataset. The participants can access the AV-Deepfake1M dataset and are required to submit their inference results for evaluation across the metrics for detection or localization tasks. The methodologies developed through the challenge will contribute to the development of next-generation deepfake detection and localization systems. Evaluation scripts, baseline models, and accompanying code will be available on https://github.com/ControlNet/AV-Deepfake1M.
CVSep 10, 2024Code
MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context UnderstandingSurbhi Madan, Shreya Ghosh, Lownish Rai Sookha et al.
Estimating the Most Important Person (MIP) in any social event setup is a challenging problem mainly due to contextual complexity and scarcity of labeled data. Moreover, the causality aspects of MIP estimation are quite subjective and diverse. To this end, we aim to address the problem by annotating a large-scale `in-the-wild' dataset for identifying human perceptions about the `Most Important Person (MIP)' in an image. The paper provides a thorough description of our proposed Multimodal Large Language Model (MLLM) based data annotation strategy, and a thorough data quality analysis. Further, we perform a comprehensive benchmarking of the proposed dataset utilizing state-of-the-art MIP localization methods, indicating a significant drop in performance compared to existing datasets. The performance drop shows that the existing MIP localization algorithms must be more robust with respect to `in-the-wild' situations. We believe the proposed dataset will play a vital role in building the next-generation social situation understanding methods. The code and data is available at https://github.com/surbhimadan92/MIP-GAF.
CLJun 24, 2023
Spatio-temporal Storytelling? Leveraging Generative Models for Semantic Trajectory AnalysisShreya Ghosh, Saptarshi Sengupta, Prasenjit Mitra
In this paper, we lay out a vision for analysing semantic trajectory traces and generating synthetic semantic trajectory data (SSTs) using generative language model. Leveraging the advancements in deep learning, as evident by progress in the field of natural language processing (NLP), computer vision, etc. we intend to create intelligent models that can study the semantic trajectories in various contexts, predicting future trends, increasing machine understanding of the movement of animals, humans, goods, etc. enhancing human-computer interactions, and contributing to an array of applications ranging from urban-planning to personalized recommendation engines and business strategy.
CLSep 27, 2024
Exploring Language Model Generalization in Low-Resource Extractive QASaptarshi Sengupta, Wenpeng Yin, Preslav Nakov et al.
In this paper, we investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift, i.e., can LLMs generalize to domains that require specific knowledge such as medicine and law in a zero-shot fashion without additional in-domain training? To this end, we devise a series of experiments to explain the performance gap empirically. Our findings suggest that: (a) LLMs struggle with dataset demands of closed domains such as retrieving long answer spans; (b) Certain LLMs, despite showing strong overall performance, display weaknesses in meeting basic requirements as discriminating between domain-specific senses of words which we link to pre-processing decisions; (c) Scaling model parameters is not always effective for cross domain generalization; and (d) Closed-domain datasets are quantitatively much different than open-domain EQA datasets and current LLMs struggle to deal with them. Our findings point out important directions for improving existing LLMs.
CVJul 7, 2022
AV-Gaze: A Study on the Effectiveness of Audio Guided Visual Attention Estimation for Non-Profilic FacesShreya Ghosh, Abhinav Dhall, Munawar Hayat et al.
In challenging real-life conditions such as extreme head-pose, occlusions, and low-resolution images where the visual information fails to estimate visual attention/gaze direction, audio signals could provide important and complementary information. In this paper, we explore if audio-guided coarse head-pose can further enhance visual attention estimation performance for non-prolific faces. Since it is difficult to annotate audio signals for estimating the head-pose of the speaker, we use off-the-shelf state-of-the-art models to facilitate cross-modal weak-supervision. During the training phase, the framework learns complementary information from synchronized audio-visual modality. Our model can utilize any of the available modalities i.e. audio, visual or audio-visual for task-specific inference. It is interesting to note that, when AV-Gaze is tested on benchmark datasets with these specific modalities, it achieves competitive results on multiple datasets, while being highly adaptive toward challenging scenarios.
CLOct 25, 2023
TOP-Training: Target-Oriented Pretraining for Medical Extractive Question AnsweringSaptarshi Sengupta, Connor Heaton, Shreya Ghosh et al.
We study extractive question-answering in the medical domain (Medical-EQA). This problem has two main challenges: (i) domain specificity, as most AI models lack necessary domain knowledge, and (ii) extraction-based answering style, which restricts most autoregressive LLMs due to potential hallucinations. To handle those challenges, we propose TOP-Training, a target-oriented pre-training paradigm that stands out among all domain adaptation techniques with two desirable features: (i) TOP-Training moves one step further than popular domain-oriented fine-tuning since it not only moves closer to the target domain, but also familiarizes itself with the target dataset, and (ii) it does not assume the existence of a large set of unlabeled instances from the target domain. Specifically, for a target Medical-EQA dataset, we extract its entities and leverage large language models (LLMs) to generate synthetic texts containing those entities; we then demonstrate that pretraining on this synthetic text data yields better performance on the target Medical-EQA benchmarks. Overall, our contributions are threefold: (i) TOP-Training, a new pretraining technique to effectively adapt LLMs to better solve a target problem, (ii) TOP-Training has a wide application scope because it does not require the target problem to have a large set of unlabeled data, and (iii) our experiments highlight the limitations of autoregressive LLMs, emphasizing TOP-Training as a means to unlock the true potential of bidirectional LLMs.
CVAug 4, 2022
RAZE: Region Guided Self-Supervised Gaze Representation LearningNeeru Dubey, Shreya Ghosh, Abhinav Dhall
Automatic eye gaze estimation is an important problem in vision based assistive technology with use cases in different emerging topics such as augmented reality, virtual reality and human-computer interaction. Over the past few years, there has been an increasing interest in unsupervised and self-supervised learning paradigms as it overcomes the requirement of large scale annotated data. In this paper, we propose RAZE, a Region guided self-supervised gAZE representation learning framework which leverage from non-annotated facial image data. RAZE learns gaze representation via auxiliary supervision i.e. pseudo-gaze zone classification where the objective is to classify visual field into different gaze zones (i.e. left, right and center) by leveraging the relative position of pupil-centers. Thus, we automatically annotate pseudo gaze zone labels of 154K web-crawled images and learn feature representations via `Ize-Net' framework. `Ize-Net' is a capsule layer based CNN architecture which can efficiently capture rich eye representation. The discriminative behaviour of the feature representation is evaluated on four benchmark datasets: CAVE, TabletGaze, MPII and RT-GENE. Additionally, we evaluate the generalizability of the proposed network on two other downstream task (i.e. driver gaze estimation and visual attention estimation) which demonstrate the effectiveness of the learnt eye gaze representation.
HCJun 5, 2023
Lumos in the Night Sky: AI-enabled Visual Tool for Exploring Night-Time Light PatternsJakob Hederich, Shreya Ghosh, Zeyu He et al.
We introduce NightPulse, an interactive tool for Night-time light (NTL) data visualization and analytics, which enables researchers and stakeholders to explore and analyze NTL data with a user-friendly platform. Powered by efficient system architecture, NightPulse supports image segmentation, clustering, and change pattern detection to identify urban development and sprawl patterns. It captures temporal trends of NTL and semantics of cities, answering questions about demographic factors, city boundaries, and unusual differences.
CVAug 3, 2022
'Labelling the Gaps': A Weakly Supervised Automatic Eye Gaze EstimationShreya Ghosh, Abhinav Dhall, Jarrod Knibbe et al.
Over the past few years, there has been an increasing interest to interpret gaze direction in an unconstrained environment with limited supervision. Owing to data curation and annotation issues, replicating gaze estimation method to other platforms, such as unconstrained outdoor or AR/VR, might lead to significant drop in performance due to insufficient availability of accurately annotated data for model training. In this paper, we explore an interesting yet challenging problem of gaze estimation method with a limited amount of labelled data. The proposed method distills knowledge from the labelled subset with visual features; including identity-specific appearance, gaze trajectory consistency and motion features. Given a gaze trajectory, the method utilizes label information of only the start and the end frames of a gaze sequence. An extension of the proposed method further reduces the requirement of labelled frames to only the start frame with a minor drop in the generated label's quality. We evaluate the proposed method on four benchmark datasets (CAVE, TabletGaze, MPII and Gaze360) as well as web-crawled YouTube videos. Our proposed method reduces the annotation effort to as low as 2.67%, with minimal impact on performance; indicating the potential of our model enabling gaze estimation 'in-the-wild' setup.
PEJul 21, 2023
Analysis of Elephant Movement in Sub-Saharan Africa: Ecological, Climatic, and Conservation PerspectivesMatthew Hines, Gregory Glatzer, Shreya Ghosh et al.
The interaction between elephants and their environment has profound implications for both ecology and conservation strategies. This study presents an analytical approach to decipher the intricate patterns of elephant movement in Sub-Saharan Africa, concentrating on key ecological drivers such as seasonal variations and rainfall patterns. Despite the complexities surrounding these influential factors, our analysis provides a holistic view of elephant migratory behavior in the context of the dynamic African landscape. Our comprehensive approach enables us to predict the potential impact of these ecological determinants on elephant migration, a critical step in establishing informed conservation strategies. This projection is particularly crucial given the impacts of global climate change on seasonal and rainfall patterns, which could substantially influence elephant movements in the future. The findings of our work aim to not only advance the understanding of movement ecology but also foster a sustainable coexistence of humans and elephants in Sub-Saharan Africa. By predicting potential elephant routes, our work can inform strategies to minimize human-elephant conflict, effectively manage land use, and enhance anti-poaching efforts. This research underscores the importance of integrating movement ecology and climatic variables for effective wildlife management and conservation planning.
HCOct 30, 2023
Empathy Detection from Text, Audiovisual, Audio or Physiological Signals: A Systematic Review of Task Formulations and Machine Learning MethodsMd Rakibul Hasan, Md Zakir Hossain, Shreya Ghosh et al.
Empathy indicates an individual's ability to understand others. Over the past few years, empathy has drawn attention from various disciplines, including but not limited to Affective Computing, Cognitive Science, and Psychology. Detecting empathy has potential applications in society, healthcare and education. Despite being a broad and overlapping topic, the avenue of empathy detection leveraging Machine Learning remains underexplored from a systematic literature review perspective. We collected 849 papers from 10 well-known academic databases, systematically screened them and analysed the final 82 papers. Our analyses reveal several prominent task formulations - including empathy on localised utterances or overall expressions, unidirectional or parallel empathy, and emotional contagion - in monadic, dyadic and group interactions. Empathy detection methods are summarised based on four input modalities - text, audiovisual, audio and physiological signals - thereby presenting modality-specific network architecture design protocols. We discuss challenges, research gaps and potential applications in the Affective Computing-based empathy domain, which can facilitate new avenues of exploration. We further enlist the public availability of datasets and codes. This paper, therefore, provides a structured overview of recent advancements and remaining challenges towards developing a robust empathy detection system that could meaningfully contribute to enhancing human well-being.
CVSep 11, 2024
MRAC Track 1: 2nd Workshop on Multimodal, Generative and Responsible Affective ComputingShreya Ghosh, Zhixi Cai, Abhinav Dhall et al.
With the rapid advancements in multimodal generative technology, Affective Computing research has provoked discussion about the potential consequences of AI systems equipped with emotional intelligence. Affective Computing involves the design, evaluation, and implementation of Emotion AI and related technologies aimed at improving people's lives. Designing a computational model in affective computing requires vast amounts of multimodal data, including RGB images, video, audio, text, and physiological signals. Moreover, Affective Computing research is deeply engaged with ethical considerations at various stages-from training emotionally intelligent models on large-scale human data to deploying these models in specific applications. Fundamentally, the development of any AI system must prioritize its impact on humans, aiming to augment and enhance human abilities rather than replace them, while drawing inspiration from human intelligence in a safe and responsible manner. The MRAC 2024 Track 1 workshop seeks to extend these principles from controlled, small-scale lab environments to real-world, large-scale contexts, emphasizing responsible development. The workshop also aims to highlight the potential implications of generative technology, along with the ethical consequences of its use, to researchers and industry professionals. To the best of our knowledge, this is the first workshop series to comprehensively address the full spectrum of multimodal, generative affective computing from a responsible AI perspective, and this is the second iteration of this workshop. Webpage: https://react-ws.github.io/2024/
SPSep 16, 2024
Machine Learning to Detect Anxiety Disorders from Error-Related Negativity and EEG SignalsRamya Chandrasekar, Md Rakibul Hasan, Shreya Ghosh et al.
Anxiety is a common mental health condition characterised by excessive worry, fear and apprehension about everyday situations. Even with significant progress over the past few years, predicting anxiety from electroencephalographic (EEG) signals, specifically using error-related negativity (ERN), still remains challenging. Following the PRISMA protocol, this paper systematically reviews 54 research papers on using EEG and ERN markers for anxiety detection published in the last 10 years (2013 -- 2023). Our analysis highlights the wide usage of traditional machine learning, such as support vector machines and random forests, as well as deep learning models, such as convolutional neural networks and recurrent neural networks across different data types. Our analysis reveals that the development of a robust and generic anxiety prediction method still needs to address real-world challenges, such as task-specific setup, feature selection and computational modelling. We conclude this review by offering potential future direction for non-invasive, objective anxiety diagnostics, deployed across diverse populations and anxiety sub-types.
CVNov 8, 2025
CSGaze: Context-aware Social Gaze PredictionSurbhi Madan, Shreya Ghosh, Ramanathan Subramanian et al.
A person's gaze offers valuable insights into their focus of attention, level of social engagement, and confidence. In this work, we investigate how contextual cues combined with visual scene and facial information can be effectively utilized to predict and interpret social gaze patterns during conversational interactions. We introduce CSGaze, a context aware multimodal approach that leverages facial, scene information as complementary inputs to enhance social gaze pattern prediction from multi-person images. The model also incorporates a fine-grained attention mechanism centered on the principal speaker, which helps in better modeling social gaze dynamics. Experimental results show that CSGaze performs competitively with state-of-the-art methods on GP-Static, UCO-LAEO and AVA-LAEO. Our findings highlight the role of contextual cues in improving social gaze prediction. Additionally, we provide initial explainability through generated attention scores, offering insights into the model's decision-making process. We also demonstrate our model's generalizability by testing our model on open set datasets that demonstrating its robustness across diverse scenarios.
CLNov 12, 2025Code
mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language ModelsArka Mukherjee, Shreya Ghosh
Contemporary vision-language models (VLMs) perform well on existing multimodal reasoning benchmarks (78-85\% accuracy on MMMU, MathVista). Yet, these results fail to sufficiently distinguish true scientific reasoning articulation capabilities from pattern-matching. To address this gap, we introduce \textbf{mmJEE-Eval}, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India's JEE Advanced examination (2019-2025) spanning pre-college Physics, Chemistry, and Mathematics domains. Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84\% accuracy on held-out 2025 questions, open-source models plateau at 37-45\% despite scaling to 400B parameters, a significant difference not observed on existing benchmarks. While closed frontiers from Google and OpenAI show high problem-solving accuracies (up to 100\% pass@3 scores), they fully collapse when the reasoning load is increased meta-cognitively (GPT-5 fixes just 5.2\% errors). Systematic ablations show mmJEE-Eval's difficulty stems from complexity and reasoning depth rather than memorization. Effectively, our benchmark segregates superior training and reasoning methodologies where alternatives fail. We publicly release our code and data: https://mmjee-eval.github.io
MMJun 1, 2025Code
Multiverse Through Deepfakes: The MultiFakeVerse Dataset of Person-Centric Visual and Conceptual ManipulationsParul Gupta, Shreya Ghosh, Tom Gedeon et al.
The rapid advancement of GenAI technology over the past few years has significantly contributed towards highly realistic deepfake content generation. Despite ongoing efforts, the research community still lacks a large-scale and reasoning capability driven deepfake benchmark dataset specifically tailored for person-centric object, context and scene manipulations. In this paper, we address this gap by introducing MultiFakeVerse, a large scale person-centric deepfake dataset, comprising 845,286 images generated through manipulation suggestions and image manipulations both derived from vision-language models (VLM). The VLM instructions were specifically targeted towards modifications to individuals or contextual elements of a scene that influence human perception of importance, intent, or narrative. This VLM-driven approach enables semantic, context-aware alterations such as modifying actions, scenes, and human-object interactions rather than synthetic or low-level identity swaps and region-specific edits that are common in existing datasets. Our experiments reveal that current state-of-the-art deepfake detection models and human observers struggle to detect these subtle yet meaningful manipulations. The code and dataset are available on \href{https://github.com/Parul-Gupta/MultiFakeVerse}{GitHub}.
CRJan 7, 2024Code
Improving Transferability of Network Intrusion Detection in a Federated Learning SetupShreya Ghosh, Abu Shafin Mohammad Mahdee Jameel, Aly El Gamal
Network Intrusion Detection Systems (IDS) aim to detect the presence of an intruder by analyzing network packets arriving at an internet connected device. Data-driven deep learning systems, popular due to their superior performance compared to traditional IDS, depend on availability of high quality training data for diverse intrusion classes. A way to overcome this limitation is through transferable learning, where training for one intrusion class can lead to detection of unseen intrusion classes after deployment. In this paper, we provide a detailed study on the transferability of intrusion detection. We investigate practical federated learning configurations to enhance the transferability of intrusion detection. We propose two techniques to significantly improve the transferability of a federated intrusion detection system. The code for this work can be found at https://github.com/ghosh64/transferability.
CLAug 22, 2025Code
Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story GenerationArka Mukherjee, Shreya Ghosh
As Vision-Language Models (VLMs) achieve widespread deployment across diverse cultural contexts, ensuring their cultural competence becomes critical for responsible AI systems. While prior work has evaluated cultural awareness in text-only models and VLM object recognition tasks, no research has systematically assessed how VLMs adapt outputs when cultural identity cues are embedded in both textual prompts and visual inputs during generative tasks. We present the first comprehensive evaluation of VLM cultural competence through multimodal story generation, developing a novel multimodal framework that perturbs cultural identity and evaluates 5 contemporary VLMs on a downstream task: story generation. Our analysis reveals significant cultural adaptation capabilities, with rich culturally-specific vocabulary spanning names, familial terms, and geographic markers. However, we uncover concerning limitations: cultural competence varies dramatically across architectures, some models exhibit inverse cultural alignment, and automated metrics show architectural bias contradicting human assessments. Cross-modal evaluation shows that culturally distinct outputs are indeed detectable through visual-semantic similarity (28.7% within-nationality vs. 0.2% cross-nationality recall), yet visual-cultural understanding remains limited. In essence, we establish the promise and challenges of cultural competence in multimodal AI. We publicly release our codebase and data: https://github.com/ArkaMukherjee0/mmCultural
CRAug 12, 2025Code
Developing a Transferable Federated Network Intrusion Detection SystemAbu Shafin Mohammad Mahdee Jameel, Shreya Ghosh, Aly El Gamal
Intrusion Detection Systems (IDS) are a vital part of a network-connected device. In this paper, we develop a deep learning based intrusion detection system that is deployed in a distributed setup across devices connected to a network. Our aim is to better equip deep learning models against unknown attacks using knowledge from known attacks. To this end, we develop algorithms to maximize the number of transferability relationships. We propose a Convolutional Neural Network (CNN) model, along with two algorithms that maximize the number of relationships observed. One is a two step data pre-processing stage, and the other is a Block-Based Smart Aggregation (BBSA) algorithm. The proposed system succeeds in achieving superior transferability performance while maintaining impressive local detection rates. We also show that our method is generalizable, exhibiting transferability potential across datasets and even with different backbones. The code for this work can be found at https://github.com/ghosh64/tabfidsv2.
LGAug 12, 2025Code
FetFIDS: A Feature Embedding Attention based Federated Network Intrusion Detection AlgorithmShreya Ghosh, Abu Shafin Mohammad Mahdee Jameel, Aly El Gamal
Intrusion Detection Systems (IDS) have an increasingly important role in preventing exploitation of network vulnerabilities by malicious actors. Recent deep learning based developments have resulted in significant improvements in the performance of IDS systems. In this paper, we present FetFIDS, where we explore the employment of feature embedding instead of positional embedding to improve intrusion detection performance of a transformer based deep learning system. Our model is developed with the aim of deployments in edge learning scenarios, where federated learning over multiple communication rounds can ensure both privacy and localized performance improvements. FetFIDS outperforms multiple state-of-the-art intrusion detection systems in a federated environment and demonstrates a high degree of suitability to federated learning. The code for this work can be found at https://github.com/ghosh64/fetfids.
CVJul 30, 2025Code
Gems: Group Emotion Profiling Through Multimodal Situational UnderstandingAnubhav Kataria, Surbhi Madan, Shreya Ghosh et al.
Understanding individual, group and event level emotions along with contextual information is crucial for analyzing a multi-person social situation. To achieve this, we frame emotion comprehension as the task of predicting fine-grained individual emotion to coarse grained group and event level emotion. We introduce GEMS that leverages a multimodal swin-transformer and S3Attention based architecture, which processes an input scene, group members, and context information to generate joint predictions. Existing multi-person emotion related benchmarks mainly focus on atomic interactions primarily based on emotion perception over time and group level. To this end, we extend and propose VGAF-GEMS to provide more fine grained and holistic analysis on top of existing group level annotation of VGAF dataset. GEMS aims to predict basic discrete and continuous emotions (including valence and arousal) as well as individual, group and event level perceived emotions. Our benchmarking effort links individual, group and situational emotional responses holistically. The quantitative and qualitative comparisons with adapted state-of-the-art models demonstrate the effectiveness of GEMS framework on VGAF-GEMS benchmarking. We believe that it will pave the way of further research. The code and data is available at: https://github.com/katariaak579/GEMS
CVFeb 19, 2025Code
A Racing Dataset and Baseline Model for Track Detection in Autonomous RacingShreya Ghosh, Yi-Huan Chen, Ching-Hsiang Huang et al.
A significant challenge in racing-related research is the lack of publicly available datasets containing raw images with corresponding annotations for the downstream task. In this paper, we introduce RoRaTrack, a novel dataset that contains annotated multi-camera image data from racing scenarios for track detection. The data is collected on a Dallara AV-21 at a racing circuit in Indiana, in collaboration with the Indy Autonomous Challenge (IAC). RoRaTrack addresses common problems such as blurriness due to high speed, color inversion from the camera, and absence of lane markings on the track. Consequently, we propose RaceGAN, a baseline model based on a Generative Adversarial Network (GAN) that effectively addresses these challenges. The proposed model demonstrates superior performance compared to current state-of-the-art machine learning models in track detection. The dataset and code for this work are available at https://github.com/ghosh64/RaceGAN.
CRDec 17, 2023Code
A Study on Transferability of Deep Learning Models for Network Intrusion DetectionShreya Ghosh, Abu Shafin Mohammad Mahdee Jameel, Aly El Gamal
In this paper, we explore transferability in learning between different attack classes in a network intrusion detection setup. We evaluate transferability of attack classes by training a deep learning model with a specific attack class and testing it on a separate attack class. We observe the effects of real and synthetically generated data augmentation techniques on transferability. We investigate the nature of observed transferability relationships, which can be either symmetric or asymmetric. We also examine explainability of the transferability relationships using the recursive feature elimination algorithm. We study data preprocessing techniques to boost model performance. The code for this work can be found at https://github.com/ghosh64/transferability.
CVMay 10, 2023Code
Pavlok-Nudge: A Feedback Mechanism for Atomic Behaviour Modification with Snoring UsecaseMd Rakibul Hasan, Shreya Ghosh, Pradyumna Agrawal et al.
This paper proposes an atomic behaviour intervention strategy using the Pavlok wearable device. Pavlok utilises beeps, vibration and shocks as a mode of aversion technique to help individuals with behaviour modification. While the device can be useful in certain periodic daily life situations, like alarms and exercise notifications, it relies on manual operations that limit its usage. To automate behaviour modification, we propose a framework that first detects targeted behaviours through a lightweight deep learning model and subsequently nudges the user. Our proposed solution is implemented and verified in the context of snoring, which captures audio from the environment following a prediction of whether the audio content is a snore or not using a lightweight 1D convolutional neural network. Based on the prediction, we use Pavlok to nudge users for preventive measures, such as a change in sleeping posture. We believe that this simple solution can help people change their atomic habits, which may lead to long-term health benefits. Our proposed lightweight model (99.8% fewer parameters over SOTA; 790,273$\rightarrow$1,337) achieves SOTA test accuracy of 0.99 on a public benchmark. The code and model are publicly available at https://github.com/hasan-rakibul/pavlok-nudge-snore.
CVMay 3, 2023Code
Glitch in the Matrix: A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and LocalizationZhixi Cai, Shreya Ghosh, Abhinav Dhall et al.
Most deepfake detection methods focus on detecting spatial and/or spatio-temporal changes in facial attributes and are centered around the binary classification task of detecting whether a video is real or fake. This is because available benchmark datasets contain mostly visual-only modifications present in the entirety of the video. However, a sophisticated deepfake may include small segments of audio or audio-visual manipulations that can completely change the meaning of the video content. To addresses this gap, we propose and benchmark a new dataset, Localized Audio Visual DeepFake (LAV-DF), consisting of strategic content-driven audio, visual and audio-visual manipulations. The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture which effectively captures multimodal manipulations. We further improve (i.e. BA-TFD+) the baseline method by replacing the backbone with a Multiscale Vision Transformer and guide the training process with contrastive, frame classification, boundary matching and multimodal boundary matching loss functions. The quantitative analysis demonstrates the superiority of BA-TFD+ on temporal forgery localization and deepfake detection tasks using several benchmark datasets including our newly proposed dataset. The dataset, models and code are available at https://github.com/ControlNet/LAV-DF.
CVAug 12, 2021Code
Automatic Gaze Analysis: A Survey of Deep Learning based ApproachesShreya Ghosh, Abhinav Dhall, Munawar Hayat et al.
Eye gaze analysis is an important research problem in the field of Computer Vision and Human-Computer Interaction. Even with notable progress in the last 10 years, automatic gaze analysis still remains challenging due to the uniqueness of eye appearance, eye-head interplay, occlusion, image quality, and illumination conditions. There are several open questions, including what are the important cues to interpret gaze direction in an unconstrained environment without prior knowledge and how to encode them in real-time. We review the progress across a range of gaze analysis tasks and applications to elucidate these fundamental questions, identify effective methods in gaze analysis, and provide possible future directions. We analyze recent gaze estimation and segmentation methods, especially in the unsupervised and weakly supervised domain, based on their advantages and reported evaluation metrics. Our analysis shows that the development of a robust and generic gaze analysis method still needs to address real-world challenges such as unconstrained setup and learning with less supervision. We conclude by discussing future research directions for designing a real-world gaze analysis system that can propagate to other domains including Computer Vision, Augmented Reality (AR), Virtual Reality (VR), and Human Computer Interaction (HCI). Project Page: https://github.com/i-am-shreya/EyeGazeSurvey}{https://github.com/i-am-shreya/EyeGazeSurvey
CLFeb 27, 2025
TripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel PlanningSoumyabrata Chaudhuri, Pranav Purkar, Ritwik Raghav et al.
Recent advancements in probing Large Language Models (LLMs) have explored their latent potential as personalized travel planning agents, yet existing benchmarks remain limited in real world applicability. Existing datasets, such as TravelPlanner and TravelPlanner+, suffer from semi synthetic data reliance, spatial inconsistencies, and a lack of key travel constraints, making them inadequate for practical itinerary generation. To address these gaps, we introduce TripCraft, a spatiotemporally coherent travel planning dataset that integrates real world constraints, including public transit schedules, event availability, diverse attraction categories, and user personas for enhanced personalization. To evaluate LLM generated plans beyond existing binary validation methods, we propose five continuous evaluation metrics, namely Temporal Meal Score, Temporal Attraction Score, Spatial Score, Ordering Score, and Persona Score which assess itinerary quality across multiple dimensions. Our parameter informed setting significantly enhances meal scheduling, improving the Temporal Meal Score from 61% to 80% in a 7 day scenario. TripCraft establishes a new benchmark for LLM driven personalized travel planning, offering a more realistic, constraint aware framework for itinerary generation. Dataset and Codebase will be made publicly available upon acceptance.
AIDec 21, 2025
IntelliCode: A Multi-Agent LLM Tutoring System with Centralized Learner ModelingJones David, Shreya Ghosh
LLM-based tutors are typically single-turn assistants that lack persistent representations of learner knowledge, making it difficult to provide principled, transparent, and long-term pedagogical support. We introduce IntelliCode, a multi-agent LLM tutoring system built around a centralized, versioned learner state that integrates mastery estimates, misconceptions, review schedules, and engagement signals. A StateGraph Orchestrator coordinates six specialized agents: skill assessment, learner profiling, graduated hinting, curriculum selection, spaced repetition, and engagement monitoring, each operating as a pure transformation over the shared state under a single-writer policy. This architecture enables auditable mastery updates, proficiency-aware hints, dependency-aware curriculum adaptation, and safety-aligned prompting. The demo showcases an end-to-end tutoring workflow: a learner attempts a DSA problem, receives a conceptual hint when stuck, submits a corrected solution, and immediately sees mastery updates and a personalized review interval. We report validation results with simulated learners, showing stable state updates, improved task success with graduated hints, and diverse curriculum coverage. IntelliCode demonstrates how persistent learner modeling, orchestrated multi-agent reasoning, and principled instructional design can be combined to produce transparent and reliable LLM-driven tutoring.
CVJul 28, 2025
AV-Deepfake1M++: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World PerturbationsZhixi Cai, Kartik Kuckreja, Shreya Ghosh et al.
The rapid surge of text-to-speech and face-voice reenactment models makes video fabrication easier and highly realistic. To encounter this problem, we require datasets that rich in type of generation methods and perturbation strategy which is usually common for online videos. To this end, we propose AV-Deepfake1M++, an extension of the AV-Deepfake1M having 2 million video clips with diversified manipulation strategy and audio-visual perturbation. This paper includes the description of data generation strategies along with benchmarking of AV-Deepfake1M++ using state-of-the-art methods. We believe that this dataset will play a pivotal role in facilitating research in Deepfake domain. Based on this dataset, we host the 2025 1M-Deepfakes Detection Challenge. The challenge details, dataset and evaluation scripts are available online under a research-only license at https://deepfakes1m.github.io/2025.
CLJan 15, 2024
Milestones in Bengali Sentiment Analysis leveraging Transformer-models: Fundamentals, Challenges and Future DirectionsSaptarshi Sengupta, Shreya Ghosh, Prasenjit Mitra et al.
Sentiment Analysis (SA) refers to the task of associating a view polarity (usually, positive, negative, or neutral; or even fine-grained such as slightly angry, sad, etc.) to a given text, essentially breaking it down to a supervised (since we have the view labels apriori) classification task. Although heavily studied in resource-rich languages such as English thus pushing the SOTA by leaps and bounds, owing to the arrival of the Transformer architecture, the same cannot be said for resource-poor languages such as Bengali (BN). For a language spoken by roughly 300 million people, the technology enabling them to run trials on their favored tongue is severely lacking. In this paper, we analyze the SOTA for SA in Bengali, particularly, Transformer-based models. We discuss available datasets, their drawbacks, the nuances associated with Bengali i.e. what makes this a challenging language to apply SA on, and finally provide insights for future direction to mitigate the limitations in the field.
SPMay 10, 2024
FunnelNet: An End-to-End Deep Learning Framework to Monitor Digital Heart Murmur in Real-TimeMd Jobayer, Md. Mehedi Hasan Shawon, Md Rakibul Hasan et al.
Objective: Heart murmurs are abnormal sounds caused by turbulent blood flow within the heart. Several diagnostic methods are available to detect heart murmurs and their severity, such as cardiac auscultation, echocardiography, phonocardiogram (PCG), etc. However, these methods have limitations, including extensive training and experience among healthcare providers, cost and accessibility of echocardiography, as well as noise interference and PCG data processing. This study aims to develop a novel end-to-end real-time heart murmur detection approach using traditional and depthwise separable convolutional networks. Methods: Continuous wavelet transform (CWT) was applied to extract meaningful features from the PCG data. The proposed network has three parts: the Squeeze net, the Bottleneck, and the Expansion net. The Squeeze net generates a compressed data representation, whereas the Bottleneck layer reduces computational complexity using a depthwise-separable convolutional network. The Expansion net is responsible for up-sampling the compressed data to a higher dimension, capturing tiny details of the representative data. Results: For evaluation, we used four publicly available datasets and achieved state-of-the-art performance in all datasets. Furthermore, we tested our proposed network on two resource-constrained devices: a Raspberry PI and an Android device, stripping it down into a tiny machine learning model (TinyML), achieving a maximum of 99.70%. Conclusion: The proposed model offers a deep learning framework for real-time accurate heart murmur detection within limited resources. Significance: It will significantly result in more accessible and practical medical services and reduced diagnosis time to assist medical professionals. The code is publicly available at TBA.
CLOct 24, 2025
TripTide: A Benchmark for Adaptive Travel Planning under DisruptionsPriyanshu Karmakar, Soumyabrata Chaudhuri, Shubhojit Mallick et al.
Recent efforts like TripCraft and TravelPlanner have advanced the use of Large Language Models ( LLMs) for personalized, constraint aware travel itinerary generation. Yet, real travel often faces disruptions. To address this, we present TripTide, the first benchmark evaluating LLM's ability to revise itineraries under realistic disruptions. TripTide models key dimensions such as disruption severity and traveler tolerance, enabling nuanced assessment of LLM adaptability to events like flight cancellations, weather closures, or overbooked attractions. We conduct a threefold evaluation. First, we introduce automatic metrics including Preservation of Intent (how well the revised plan maintains feasibility and goals), Responsiveness (promptness and appropriateness of disruption handling), and Adaptability (semantic, spatial, and sequential divergence between original and revised plans). Second, we apply an LLM-as-a-judge approach to automatically assess revision quality. Third, we perform manual expert evaluation to verify whether revisions preserve semantic, spatial, sequential, and responsive aspects. Our experiments show that LLMs maintain strong sequential consistency and semantic stability, while spatial deviations are larger for shorter trips but decrease with longer ones, indicating that extended plans encourage better geographic coherence. However, disruption-handling ability declines as plan length increases, highlighting limits in LLM robustness. TripTide establishes a benchmark for evaluating adaptability, personalization, and resilience in LLM-based travel planning under real-world uncertainty.
LGFeb 16, 2025
UNITE-FND: Reframing Multimodal Fake News Detection through Unimodal Scene TranslationArka Mukherjee, Shreya Ghosh
Multimodal fake news detection typically demands complex architectures and substantial computational resources, posing deployment challenges in real-world settings. We introduce UNITE-FND, a novel framework that reframes multimodal fake news detection as a unimodal text classification task. We propose six specialized prompting strategies with Gemini 1.5 Pro, converting visual content into structured textual descriptions, and enabling efficient text-only models to preserve critical visual information. To benchmark our approach, we introduce Uni-Fakeddit-55k, a curated dataset family of 55,000 samples each, each processed through our multimodal-to-unimodal translation framework. Experimental results demonstrate that UNITE-FND achieves 92.52% accuracy in binary classification, surpassing prior multimodal models while reducing computational costs by over 10x (TinyBERT variant: 14.5M parameters vs. 250M+ in SOTA models). Additionally, we propose a comprehensive suite of five novel metrics to evaluate image-to-text conversion quality, ensuring optimal information preservation. Our results demonstrate that structured text-based representations can replace direct multimodal processing with minimal loss of accuracy, making UNITE-FND a practical and scalable alternative for resource-constrained environments.
CVOct 23, 2021
MTGLS: Multi-Task Gaze Estimation with Limited SupervisionShreya Ghosh, Munawar Hayat, Abhinav Dhall et al.
Robust gaze estimation is a challenging task, even for deep CNNs, due to the non-availability of large-scale labeled data. Moreover, gaze annotation is a time-consuming process and requires specialized hardware setups. We propose MTGLS: a Multi-Task Gaze estimation framework with Limited Supervision, which leverages abundantly available non-annotated facial image data. MTGLS distills knowledge from off-the-shelf facial image analysis models, and learns strong feature representations of human eyes, guided by three complementary auxiliary signals: (a) the line of sight of the pupil (i.e. pseudo-gaze) defined by the localized facial landmarks, (b) the head-pose given by Euler angles, and (c) the orientation of the eye patch (left/right eye). To overcome inherent noise in the supervisory signals, MTGLS further incorporates a noise distribution modelling approach. Our experimental results show that MTGLS learns highly generalized representations which consistently perform well on a range of datasets. Our proposed framework outperforms the unsupervised state-of-the-art on CAVE (by 6.43%) and even supervised state-of-the-art methods on Gaze360 (by 6.59%) datasets.
LGApr 4, 2021
STOPPAGE: Spatio-temporal Data Driven Cloud-Fog-Edge Computing Framework for Pandemic Monitoring and ManagementShreya Ghosh, Anwesha Mukherjee, Soumya K Ghosh et al.
Several researches and evidence show the increasing likelihood of pandemics (large-scale outbreaks of infectious disease) which has far reaching sequels in all aspects of human lives ranging from rapid mortality rates to economic and social disruption across the world. In the recent time, COVID-19 (Coronavirus Disease 2019) pandemic disrupted normal human lives, and motivated by the urgent need of combating COVID-19, researchers have put significant efforts in modelling and analysing the disease spread patterns for effective preventive measures (in addition to developing pharmaceutical solutions, like vaccine). In this regards, it is absolutely necessary to develop an analytics framework by extracting and incorporating the knowledge of heterogeneous datasources to deliver insights in improving administrative policy and enhance the preparedness to combat the pandemic. Specifically, human mobility, travel history and other transport statistics have significant impacts on the spread of any infectious disease. In this direction, this paper proposes a spatio-temporal knowledge mining framework, named STOPPAGE to model the impact of human mobility and other contextual information over large geographic area in different temporal scales. The framework has two major modules: (i) Spatio-temporal data and computing infrastructure using fog/edge based architecture; and (ii) Spatio-temporal data analytics module to efficiently extract knowledge from heterogeneous data sources. Typically, we develop a Pandemic-knowledge graph to discover correlations among mobility information and disease spread, a deep learning architecture to predict the next hot-spot zones; and provide necessary support in home-health monitoring utilizing Femtolet and fog/edge based solutions. The experimental evaluations on real-life datasets related to COVID-19 in India illustrate the efficacy of the proposed methods.
CVJul 2, 2020
Low-Power Object Counting with Hierarchical Neural NetworksAbhinav Goel, Caleb Tung, Sara Aghajanzadeh et al.
Deep Neural Networks (DNNs) can achieve state-of-the-art accuracy in many computer vision tasks, such as object counting. Object counting takes two inputs: an image and an object query and reports the number of occurrences of the queried object. To achieve high accuracy on such tasks, DNNs require billions of operations, making them difficult to deploy on resource-constrained, low-power devices. Prior work shows that a significant number of DNN operations are redundant and can be eliminated without affecting the accuracy. To reduce these redundancies, we propose a hierarchical DNN architecture for object counting. This architecture uses a Region Proposal Network (RPN) to propose regions-of-interest (RoIs) that may contain the queried objects. A hierarchical classifier then efficiently finds the RoIs that actually contain the queried objects. The hierarchy contains groups of visually similar object categories. Small DNNs are used at each node of the hierarchy to classify between these groups. The RoIs are incrementally processed by the hierarchical classifier. If the object in an RoI is in the same group as the queried object, then the next DNN in the hierarchy processes the RoI further; otherwise, the RoI is discarded. By using a few small DNNs to process each image, this method reduces the memory requirement, inference time, energy consumption, and number of operations with negligible accuracy loss when compared with the existing object counters.
CVApr 13, 2020
Speak2Label: Using Domain Knowledge for Creating a Large Scale Driver Gaze Zone Estimation DatasetShreya Ghosh, Abhinav Dhall, Garima Sharma et al.
Labelling of human behavior analysis data is a complex and time consuming task. In this paper, a fully automatic technique for labelling an image based gaze behavior dataset for driver gaze zone estimation is proposed. Domain knowledge is added to the data recording paradigm and later labels are generated in an automatic manner using Speech To Text conversion (STT). In order to remove the noise in the STT process due to different illumination and ethnicity of subjects in our data, the speech frequency and energy are analysed. The resultant Driver Gaze in the Wild (DGW) dataset contains 586 recordings, captured during different times of the day including evenings. The large scale dataset contains 338 subjects with an age range of 18-63 years. As the data is recorded in different lighting conditions, an illumination robust layer is proposed in the Convolutional Neural Network (CNN). The extensive experiments show the variance in the dataset resembling real-world conditions and the effectiveness of the proposed CNN pipeline. The proposed network is also fine-tuned for the eye gaze prediction task, which shows the discriminativeness of the representation learnt by our network on the proposed DGW dataset. Project Page: https://sites.google.com/view/drivergazeprediction/home
CVApr 4, 2019
Unsupervised Learning of Eye Gaze Representation from the WebNeeru Dubey, Shreya Ghosh, Abhinav Dhall
Automatic eye gaze estimation has interested researchers for a while now. In this paper, we propose an unsupervised learning based method for estimating the eye gaze region. To train the proposed network "Ize-Net" in self-supervised manner, we collect a large `in the wild' dataset containing 1,54,251 images from the web. For the images in the database, we divide the gaze into three regions based on an automatic technique based on pupil-centers localization and then use a feature-based technique to determine the gaze region. The performance is evaluated on the Tablet Gaze and CAVE datasets by fine-tuning results of Ize-Net for the task of eye gaze estimation. The feature representation learned is also used to train traditional machine learning algorithms for eye gaze estimation. The results demonstrate that the proposed method learns a rich data representation, which can be efficiently fine-tuned for any eye gaze estimation dataset.
CVDec 31, 2018
Predicting Group Cohesiveness in ImagesShreya Ghosh, Abhinav Dhall, Nicu Sebe et al.
The cohesiveness of a group is an essential indicator of the emotional state, structure and success of a group of people. We study the factors that influence the perception of group-level cohesion and propose methods for estimating the human-perceived cohesion on the group cohesiveness scale. In order to identify the visual cues (attributes) for cohesion, we conducted a user survey. Image analysis is performed at a group-level via a multi-task convolutional neural network. For analyzing the contribution of facial expressions of the group members for predicting the Group Cohesion Score (GCS), a capsule network is explored. We add GCS to the Group Affect database and propose the `GAF-Cohesion database'. The proposed model performs well on the database and is able to achieve near human-level performance in predicting a group's cohesion score. It is interesting to note that group cohesion as an attribute, when jointly trained for group-level emotion prediction, helps in increasing the performance for the later task. This suggests that group-level emotion and cohesion are correlated.
CRAug 16, 2018
Mitigation of Flooding and Slow DDoS Attacks in a Software-Defined NetworkThomas Lukaseder, Shreya Ghosh, Frank Kargl
Distributed denial of service (DDoS) attacks are a constant threat for services in the Internet. This year, the record for the largest DDoS attack ever observed was set at 1.7 Tbps. Meanwhile, detection and mitigation mechanisms are still lacking behind. Many mitigation systems require the assistance by the victim - or the victim's administrator themself has to become active to mitigate attacks. We introduced a system that can detect attacks, identify attackers, and mitigate the attacks purely within the network infrastructure. With the improved flexibility of software-defined networks, new possibilities to mitigate such attacks can be implemented. In addition to our short paper on the mitigation of reflective DDoS attacks on LCN 2018, we also like to demonstrate our work on mitigating flooding attacks presented at LCN 2017 and our mitigation of slow DDoS attacks. In our demo, we show how these systems can be combined and how they work when faced with such different attacks.