Deepu Rajan

h-index33

12papers

971citations

Novelty41%

AI Score39

Ranked #79,748 of 194,257 authors (top 41%)#26,974 in CV (top 46%)

12 Papers

23.0SDMar 29, 2022Code

Speech Emotion Recognition with Co-Attention based Multi-level Acoustic Information

Heqing Zou, Yuke Si, Chen Chen et al.

Speech Emotion Recognition (SER) aims to help the machine to understand human's subjective emotion from only audio information. However, extracting and utilizing comprehensive in-depth audio information is still a challenging task. In this paper, we propose an end-to-end speech emotion recognition system using multi-level acoustic information with a newly designed co-attention module. We firstly extract multi-level acoustic information, including MFCC, spectrogram, and the embedded high-level acoustic information with CNN, BiLSTM and wav2vec2, respectively. Then these extracted features are treated as multimodal inputs and fused by the proposed co-attention mechanism. Experiments are carried on the IEMOCAP dataset, and our model achieves competitive performance with two different speaker-independent cross-validation strategies. Our code is available on GitHub.

10.3MMJun 14, 2023Code

Towards Balanced Active Learning for Multimodal Classification

Meng Shen, Yizheng Huang, Jianxiong Yin et al.

Training multimodal networks requires a vast amount of data due to their larger parameter space compared to unimodal networks. Active learning is a widely used technique for reducing data annotation costs by selecting only those samples that could contribute to improving model performance. However, current active learning strategies are mostly designed for unimodal tasks, and when applied to multimodal data, they often result in biased sample selection from the dominant modality. This unfairness hinders balanced multimodal learning, which is crucial for achieving optimal performance. To address this issue, we propose three guidelines for designing a more balanced multimodal active learning strategy. Following these guidelines, a novel approach is proposed to achieve more fair data selection by modulating the gradient embedding with the dominance degree among modalities. Our studies demonstrate that the proposed method achieves more balanced multimodal learning by avoiding greedy sample selection from the dominant modality. Our approach outperforms existing active learning strategies on a variety of multimodal classification tasks. Overall, our work highlights the importance of balancing sample selection in multimodal active learning and provides a practical solution for achieving more balanced active learning for multimodal classification.

11.8CVAug 4, 2025Code

IMoRe: Implicit Program-Guided Reasoning for Human Motion Q&A

Chen Li, Chinthani Sugandhika, Yeo Keat Ee et al.

Existing human motion Q\&A methods rely on explicit program execution, where the requirement for manually defined functional modules may limit the scalability and adaptability. To overcome this, we propose an implicit program-guided motion reasoning (IMoRe) framework that unifies reasoning across multiple query types without manually designed modules. Unlike existing implicit reasoning approaches that infer reasoning operations from question words, our model directly conditions on structured program functions, ensuring a more precise execution of reasoning steps. Additionally, we introduce a program-guided reading mechanism, which dynamically selects multi-level motion representations from a pretrained motion Vision Transformer (ViT), capturing both high-level semantics and fine-grained motion cues. The reasoning module iteratively refines memory representations, leveraging structured program functions to extract relevant information for different query types. Our model achieves state-of-the-art performance on Babel-QA and generalizes to a newly constructed motion Q\&A dataset based on HuMMan, demonstrating its adaptability across different motion reasoning datasets. Code and dataset are available at: https://github.com/LUNAProject22/IMoRe.

2.7CLMar 25, 2025Code

Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition Across Languages

Heqing Zou, Fengmao Lv, Desheng Zheng et al.

Multilingual speech emotion recognition aims to estimate a speaker's emotional state using a contactless method across different languages. However, variability in voice characteristics and linguistic diversity poses significant challenges for zero-shot speech emotion recognition, especially with multilingual datasets. In this paper, we propose leveraging contrastive learning to refine multilingual speech features and extend large language models for zero-shot multilingual speech emotion estimation. Specifically, we employ a novel two-stage training framework to align speech signals with linguistic features in the emotional space, capturing both emotion-aware and language-agnostic speech representations. To advance research in this field, we introduce a large-scale synthetic multilingual speech emotion dataset, M5SER. Our experiments demonstrate the effectiveness of the proposed method in both speech emotion recognition and zero-shot multilingual speech emotion recognition, including previously unseen datasets and languages.

3.3MMDec 12, 2024

Enhancing Modality Representation and Alignment for Multimodal Cold-start Active Learning

Meng Shen, Yake Wei, Jianxiong Yin et al.

Training multimodal models requires a large amount of labeled data. Active learning (AL) aim to reduce labeling costs. Most AL methods employ warm-start approaches, which rely on sufficient labeled data to train a well-calibrated model that can assess the uncertainty and diversity of unlabeled data. However, when assembling a dataset, labeled data are often scarce initially, leading to a cold-start problem. Additionally, most AL methods seldom address multimodal data, highlighting a research gap in this field. Our research addresses these issues by developing a two-stage method for Multi-Modal Cold-Start Active Learning (MMCSAL). Firstly, we observe the modality gap, a significant distance between the centroids of representations from different modalities, when only using cross-modal pairing information as self-supervision signals. This modality gap affects data selection process, as we calculate both uni-modal and cross-modal distances. To address this, we introduce uni-modal prototypes to bridge the modality gap. Secondly, conventional AL methods often falter in multimodal scenarios where alignment between modalities is overlooked. Therefore, we propose enhancing cross-modal alignment through regularization, thereby improving the quality of selected multimodal data pairs in AL. Finally, our experiments demonstrate MMCSAL's efficacy in selecting multimodal data pairs across three multimodal datasets.

39.3CVMay 16, 2023Code

UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning

Heqing Zou, Meng Shen, Chen Chen et al.

Multimodal learning aims to imitate human beings to acquire complementary information from multiple modalities for various downstream tasks. However, traditional aggregation-based multimodal fusion methods ignore the inter-modality relationship, treat each modality equally, suffer sensor noise, and thus reduce multimodal learning performance. In this work, we propose a novel multimodal contrastive method to explore more reliable multimodal representations under the weak supervision of unimodal predicting. Specifically, we first capture task-related unimodal representations and the unimodal predictions from the introduced unimodal predicting task. Then the unimodal representations are aligned with the more effective one by the designed multimodal contrastive method under the supervision of the unimodal predictions. Experimental results with fused features on two image-text classification benchmarks UPMC-Food-101 and N24News show that our proposed Unimodality-Supervised MultiModal Contrastive UniS-MMC learning method outperforms current state-of-the-art multimodal methods. The detailed ablation study and analysis further demonstrate the advantage of our proposed method.

7.8CVSep 12, 2018

Are object detection assessment criteria ready for maritime computer vision?

Dilip K. Prasad, Huixu Dong, Deepu Rajan et al.

Maritime vessels equipped with visible and infrared cameras can complement other conventional sensors for object detection. However, application of computer vision techniques in maritime domain received attention only recently. The maritime environment offers its own unique requirements and challenges. Assessment of the quality of detections is a fundamental need in computer vision. However, the conventional assessment metrics suitable for usual object detection are deficient in the maritime setting. Thus, a large body of related work in computer vision appears inapplicable to the maritime setting at the first sight. We discuss the problem of defining assessment metrics suitable for maritime computer vision. We consider new bottom edge proximity metrics as assessment metrics for maritime computer vision. These metrics indicate that existing computer vision approaches are indeed promising for maritime computer vision and can play a foundational role in the emerging field of maritime computer vision.

2.3GRApr 15, 2017

A learning-based approach for automatic image and video colorization

Raj Kumar Gupta, Alex Yong-Sang Chia, Deepu Rajan et al.

In this paper, we present a color transfer algorithm to colorize a broad range of gray images without any user intervention. The algorithm uses a machine learning-based approach to automatically colorize grayscale images. The algorithm uses the superpixel representation of the reference color images to learn the relationship between different image features and their corresponding color values. We use this learned information to predict the color value of each grayscale image superpixel. As compared to processing individual image pixels, our use of superpixels helps us to achieve a much higher degree of spatial consistency as well as speeds up the colorization process. The predicted color values of the gray-scale image superpixels are used to provide a 'micro-scribble' at the centroid of the superpixels. These color scribbles are refined by using a voting based approach. To generate the final colorization result, we use an optimization-based approach to smoothly spread the color scribble across all pixels within a superpixel. Experimental results on a broad range of images and the comparison with existing state-of-the-art colorization methods demonstrate the greater effectiveness of the proposed algorithm.

3.8CVJan 29, 2017

MSCM-LiFe: Multi-scale cross modal linear feature for horizon detection in maritime images

D. K. Prasad, D. Rajan, C. K. Prasath et al.

This paper proposes a new method for horizon detection called the multi-scale cross modal linear feature. This method integrates three different concepts related to the presence of horizon in maritime images to increase the accuracy of horizon detection. Specifically it uses the persistence of horizon in multi-scale median filtering, and its detection as a linear feature commonly detected by two different methods, namely the Hough transform of edgemap and the intensity gradient. We demonstrate the performance of the method over 13 videos comprising of more than 3000 frames and show that the proposed method detects horizon with small error in most of the cases, outperforming three state-of-the-art methods.

17.2CVNov 17, 2016

Video Processing from Electro-optical Sensors for Object Detection and Tracking in Maritime Environment: A Survey

D. K. Prasad, D. Rajan, L. Rachmawati et al.

We present a survey on maritime object detection and tracking approaches, which are essential for the development of a navigational system for autonomous ships. The electro-optical (EO) sensor considered here is a video camera that operates in the visible or the infrared spectra, which conventionally complement radar and sonar and have demonstrated effectiveness for situational awareness at sea has demonstrated its effectiveness over the last few years. This paper provides a comprehensive overview of various approaches of video processing for object detection and tracking in the maritime environment. We follow an approach-based taxonomy wherein the advantages and limitations of each approach are compared. The object detection system consists of the following modules: horizon detection, static background subtraction and foreground segmentation. Each of these has been studied extensively in maritime situations and has been shown to be challenging due to the presence of background motion especially due to waves and wakes. The main processes involved in object tracking include video frame registration, dynamic background subtraction, and the object tracking algorithm itself. The challenges for robust tracking arise due to camera motion, dynamic background and low contrast of tracked object, possibly due to environmental degradation. The survey also discusses multisensor approaches and commercial maritime systems that use EO sensors. The survey also highlights methods from computer vision research which hold promise to perform well in maritime EO data processing. Performance of several maritime and computer vision techniques is evaluated on newly proposed Singapore Maritime Dataset.

5.3CVAug 3, 2016

Challenges in video based object detection in maritime scenario using computer vision

D. K. Prasad, C. K. Prasath, D. Rajan et al.

This paper discusses the technical challenges in maritime image processing and machine vision problems for video streams generated by cameras. Even well documented problems of horizon detection and registration of frames in a video are very challenging in maritime scenarios. More advanced problems of background subtraction and object detection in video streams are very challenging. Challenges arising from the dynamic nature of the background, unavailability of static cues, presence of small objects at distant backgrounds, illumination effects, all contribute to the challenges as discussed here.

1.1CVApr 22, 2016

A Classifier-guided Approach for Top-down Salient Object Detection

Hisham Cholakkal, Jubin Johnson, Deepu Rajan

We propose a framework for top-down salient object detection that incorporates a tightly coupled image classification module. The classifier is trained on novel category-aware sparse codes computed on object dictionaries used for saliency modeling. A misclassification indicates that the corresponding saliency model is inaccurate. Hence, the classifier selects images for which the saliency models need to be updated. The category-aware sparse coding produces better image classification accuracy as compared to conventional sparse coding with a reduced computational complexity. A saliency-weighted max-pooling is proposed to improve image classification, which is further used to refine the saliency maps. Experimental results on Graz-02 and PASCAL VOC-07 datasets demonstrate the effectiveness of salient object detection. Although the role of the classifier is to support salient object detection, we evaluate its performance in image classification and also illustrate the utility of thresholded saliency maps for image segmentation.