Aref Farhadipour

CV
h-index15
6papers
46citations
Novelty27%
AI Score36

6 Papers

ASAug 31, 2024
Comparative Analysis of Modality Fusion Approaches for Audio-Visual Person Identification and Verification

Aref Farhadipour, Masoumeh Chapariniya, Teodora Vukovic et al.

Multimodal learning involves integrating information from various modalities to enhance learning and comprehension. We compare three modality fusion strategies in person identification and verification by processing two modalities: voice and face. In this paper, a one-dimensional convolutional neural network is employed for x-vector extraction from voice, while the pre-trained VGGFace2 network and transfer learning are utilized for face modality. In addition, gammatonegram is used as speech representation in engagement with the Darknet19 pre-trained network. The proposed systems are evaluated using the K-fold cross-validation technique on the 118 speakers of the test set of the VoxCeleb2 dataset. The comparative evaluations are done for single-modality and three proposed multimodal strategies in equal situations. Results demonstrate that the feature fusion strategy of gammatonegram and facial features achieves the highest performance, with an accuracy of 98.37% in the person identification task. However, concatenating facial features with the x-vector reaches 0.62% for EER in verification tasks.

ASJul 6, 2023
Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment

Aref Farhadipour, Hadi Veisi

Dysarthria is a disability that causes a disturbance in the human speech system and reduces the quality and intelligibility of a person's speech. Because of this effect, the normal speech processing systems can not work properly on impaired speech. This disability is usually associated with physical disabilities. Therefore, designing a system that can perform some tasks by receiving voice commands in the smart home can be a significant achievement. In this work, we introduce gammatonegram as an effective method to represent audio files with discriminative details, which is used as input for the convolutional neural network. On the other word, we convert each speech file into an image and propose image recognition system to classify speech in different scenarios. Proposed CNN is based on the transfer learning method on the pre-trained Alexnet. In this research, the efficiency of the proposed system for speech recognition, speaker identification, and intelligibility assessment is evaluated. According to the results on the UA dataset, the proposed speech recognition system achieved 91.29% accuracy in speaker-dependent mode, the speaker identification system acquired 87.74% accuracy in text-dependent mode, and the intelligibility assessment system achieved 96.47% accuracy in two-class mode. Finally, we propose a multi-network speech recognition system that works fully automatically. This system is located in a cascade arrangement with the two-class intelligibility assessment system, and the output of this system activates each one of the speech recognition networks. This architecture achieves an accuracy of 92.3% WRR. The source code of this paper is available.

CVMar 24
Foundation Model Embeddings Meet Blended Emotions: A Multimodal Fusion Approach for the BLEMORE Challenge

Masoumeh Chapariniya, Aref Farhadipour, Sarah Ebling et al.

We present our system for the BLEMORE Challenge at FG 2026 on blended emotion recognition with relative salience prediction. Our approach combines six encoder families through late probability fusion: an S4D-ViTMoE face encoder adapted with soft-label KL training, frozen layer-selective Wav2Vec2 audio features, finetuned body-language encoders (TimeSformer, VideoMAE), and -- for the first time in emotion recognition -- Gemini Embedding 2.0, a large multimodal model whose video embeddings produce competitive presence accuracy (ACCP = 0.320) from only 2 seconds of input. Three key findings emerge from our experiments: selecting prosody-encoding layers (6--12) from frozen Wav2Vec2 outperforms end-to-end finetuning (Score 0.207 vs. 0.161), as the non-verbal nature of BLEMORE audio makes phonetic layers irrelevant; the post-processing salience threshold $β$ varies from 0.05 to 0.43 across folds, revealing that personalized expression styles are the primary bottleneck; and task-adapted encoders collectively receive 62\% of ensemble weight over general-purpose baselines. Our 12-encoder system achieves Score = 0.279 (ACCP = 0.391, ACCS = 0.168) on the test set, placing 6th.

CVMar 9, 2025
Multimodal Emotion Recognition and Sentiment Analysis in Multi-Party Conversation Contexts

Aref Farhadipour, Hossein Ranjbar, Masoumeh Chapariniya et al.

Emotion recognition and sentiment analysis are pivotal tasks in speech and language processing, particularly in real-world scenarios involving multi-party, conversational data. This paper presents a multimodal approach to tackle these challenges on a well-known dataset. We propose a system that integrates four key modalities/channels using pre-trained models: RoBERTa for text, Wav2Vec2 for speech, a proposed FacialNet for facial expressions, and a CNN+Transformer architecture trained from scratch for video analysis. Feature embeddings from each modality are concatenated to form a multimodal vector, which is then used to predict emotion and sentiment labels. The multimodal system demonstrates superior performance compared to unimodal approaches, achieving an accuracy of 66.36% for emotion recognition and 72.15% for sentiment analysis.

CVDec 16, 2025
Adaptive Multimodal Person Recognition: A Robust Framework for Handling Missing Modalities

Aref Farhadipour, Teodora Vukovic, Volker Dellwo et al.

Person identification systems often rely on audio, visual, or behavioral cues, but real-world conditions frequently present with missing or degraded modalities. To address this challenge, we propose a multimodal person identification framework incorporating upper-body motion, face, and voice. Experimental results demonstrate that body motion outperforms traditional modalities such as face and voice in within-session evaluations, while serving as a complementary cue that enhances performance in multi-session scenarios. Our model employs a unified hybrid fusion strategy, fusing both feature-level and score-level information to maximize representational richness and decision accuracy. Specifically, it leverages multi-task learning to process modalities independently, followed by cross-attention and gated fusion mechanisms to exploit both unimodal information and cross-modal interactions. Finally, a confidence-weighted strategy and mistake-correction mechanism dynamically adapt to missing data, ensuring that our single classification head achieves optimal performance even in unimodal and bimodal scenarios. We evaluate our method on CANDOR, a newly introduced interview-based multimodal dataset, which we benchmark in this work for the first time. Our results demonstrate that the proposed trimodal system achieves 99.51% Top-1 accuracy on person identification tasks. In addition, we evaluate our model on the VoxCeleb1 dataset as a widely used evaluation protocol and reach 99.92% accuracy in bimodal mode, outperforming conventional approaches. Moreover, we show that our system maintains high accuracy even when one or two modalities are unavailable, making it a robust solution for real-world person recognition applications. The code and data for this work are publicly available.

CVDec 3, 2023
Facial Emotion Recognition Under Mask Coverage Using a Data Augmentation Technique

Aref Farhadipour, Pouya Taghipour

Identifying human emotions using AI-based computer vision systems, when individuals wear face masks, presents a new challenge in the current Covid-19 pandemic. In this study, we propose a facial emotion recognition system capable of recognizing emotions from individuals wearing different face masks. A novel data augmentation technique was utilized to improve the performance of our model using four mask types for each face image. We evaluated the effectiveness of four convolutional neural networks, Alexnet, Squeezenet, Resnet50 and VGGFace2 that were trained using transfer learning. The experimental findings revealed that our model works effectively in multi-mask mode compared to single-mask mode. The VGGFace2 network achieved the highest accuracy rate, with 97.82% for the person-dependent mode and 74.21% for the person-independent mode using the JAFFE dataset. However, we evaluated our proposed model using the UIBVFED dataset. The Resnet50 has demonstrated superior performance, with accuracies of 73.68% for the person-dependent mode and 59.57% for the person-independent mode. Moreover, we employed metrics such as precision, sensitivity, specificity, AUC, F1 score, and confusion matrix to measure our system's efficiency in detail. Additionally, the LIME algorithm was used to visualize CNN's decision-making strategy.