SDNov 20, 2024Code
Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait SynthesisPegah Salehi, Sajad Amouei Sheshkal, Vajira Thambawita et al.
This paper examines the integration of real-time talking-head generation for interviewer training, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE models with Open AI's Whisper, leveraging its encoder to optimize processing and improve overall system efficiency. Our evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions. These advancements make the system a more effective tool for immersive, interactive training applications, expanding the potential of AI-driven avatars in interviewer training.
IVJun 29, 2021Code
SinGAN-Seg: Synthetic training data generation for medical image segmentationVajira Thambawita, Pegah Salehi, Sajad Amouei Sheshkal et al.
Analyzing medical data to find abnormalities is a time-consuming and costly task, particularly for rare abnormalities, requiring tremendous efforts from medical experts. Artificial intelligence has become a popular tool for the automatic processing of medical data, acting as a supportive tool for doctors. However, the machine learning models used to build these tools are highly dependent on the data used to train them. Large amounts of data can be difficult to obtain in medicine due to privacy, expensive and time-consuming annotations, and a general lack of data samples for infrequent lesions. Here, we present a novel synthetic data generation pipeline, called SinGAN-Seg, to produce synthetic medical images with corresponding masks using a single training image. Our method is different from the traditional GANs because our model needs only a single image and the corresponding ground truth to train. Our method produces alternative artificial segmentation datasets with ground truth masks when real datasets are not allowed to share. The pipeline is evaluated using qualitative and quantitative comparisons between real and synthetic data to show that the style transfer technique used in our pipeline significantly improves the quality of the generated data and our method is better than other state-of-the-art GANs to prepare synthetic images when the size of training datasets are limited. By training UNet++ using both real and the synthetic data generated from the SinGAN-Seg pipeline, we show that models trained with synthetic data have very close performances to those trained on real data when the datasets have a considerable amount of data. In contrast, Synthetic data generated from the SinGAN-Seg pipeline can improve the performance of segmentation models when training datasets do not have a considerable amount of data. The code is available on GitHub.
HCJun 16, 2025
Multimodal Integration Challenges in Emotionally Expressive Child Avatars for Training ApplicationsPegah Salehi, Sajad Amouei Sheshkal, Vajira Thambawita et al.
Dynamic facial emotion is essential for believable AI-generated avatars, yet most systems remain visually static, limiting their use in simulations like virtual training for investigative interviews with abused children. We present a real-time architecture combining Unreal Engine 5 MetaHuman rendering with NVIDIA Omniverse Audio2Face to generate facial expressions from vocal prosody in photorealistic child avatars. Due to limited TTS options, both avatars were voiced using young adult female models from two systems to better fit character profiles, introducing a voice-age mismatch. This confound may affect audiovisual alignment. We used a two-PC setup to decouple speech generation from GPU-intensive rendering, enabling low-latency interaction in desktop and VR. A between-subjects study (N=70) compared audio+visual vs. visual-only conditions as participants rated emotional clarity, facial realism, and empathy for avatars expressing joy, sadness, and anger. While emotions were generally recognized - especially sadness and joy - anger was harder to detect without audio, highlighting the role of voice in high-arousal expressions. Interestingly, silencing clips improved perceived realism by removing mismatches between voice and animation, especially when tone or age felt incongruent. These results emphasize the importance of audiovisual congruence: mismatched voice undermines expression, while a good match can enhance weaker visuals - posing challenges for emotionally coherent avatars in sensitive contexts.
CVMay 27, 2020
Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent DevelopmentsPegah Salehi, Abdolah Chalechale, Maryam Taghizadeh
One of the most significant challenges in statistical signal processing and machine learning is how to obtain a generative model that can produce samples of large-scale data distribution, such as images and speeches. Generative Adversarial Network (GAN) is an effective method to address this problem. The GANs provide an appropriate way to learn deep representations without widespread use of labeled training data. This approach has attracted the attention of many researchers in computer vision since it can generate a large amount of data without precise modeling of the probability density function (PDF). In GANs, the generative model is estimated via a competitive process where the generator and discriminator networks are trained simultaneously. The generator learns to generate plausible data, and the discriminator learns to distinguish fake data created by the generator from real data samples. Given the rapid growth of GANs over the last few years and their application in various fields, it is necessary to investigate these networks accurately. In this paper, after introducing the main concepts and the theory of GAN, two new deep generative models are compared, the evaluation metrics utilized in the literature and challenges of GANs are also explained. Moreover, the most remarkable GAN architectures are categorized and discussed. Finally, the essential applications in computer vision are examined.
IVFeb 3, 2020
Pix2Pix-based Stain-to-Stain Translation: A Solution for Robust Stain Normalization in Histopathology Images AnalysisPegah Salehi, Abdolah Chalechale
The diagnosis of cancer is mainly performed by visual analysis of the pathologists, through examining the morphology of the tissue slices and the spatial arrangement of the cells. If the microscopic image of a specimen is not stained, it will look colorless and textured. Therefore, chemical staining is required to create contrast and help identify specific tissue components. During tissue preparation due to differences in chemicals, scanners, cutting thicknesses, and laboratory protocols, similar tissues are usually varied significantly in appearance. This diversity in staining, in addition to Interpretive disparity among pathologists more is one of the main challenges in designing robust and flexible systems for automated analysis. To address the staining color variations, several methods for normalizing stain have been proposed. In our proposed method, a Stain-to-Stain Translation (STST) approach is used to stain normalization for Hematoxylin and Eosin (H&E) stained histopathology images, which learns not only the specific color distribution but also the preserves corresponding histopathological pattern. We perform the process of translation based on the pix2pix framework, which uses the conditional generator adversarial networks (cGANs). Our approach showed excellent results, both mathematically and experimentally against the state of the art methods. We have made the source code publicly available.