Sofoklis Kakouros

CL
h-index12
9papers
1,042citations
Novelty32%
AI Score49

9 Papers

ASFeb 25Code
iMiGUE-Speech: A Spontaneous Speech Dataset for Affective Analysis

Sofoklis Kakouros, Fang Kang, Haoyu Chen

This work presents iMiGUE-Speech, an extension of the iMiGUE dataset that provides a spontaneous affective corpus for studying emotional and affective states. The new release focuses on speech and enriches the original dataset with additional metadata, including speech transcripts, speaker-role separation between interviewer and interviewee, and word-level forced alignments. Unlike existing emotional speech datasets that rely on acted or laboratory-elicited emotions, iMiGUE-Speech captures spontaneous affect arising naturally from real match outcomes. To demonstrate the utility of the dataset and establish initial benchmarks, we introduce two evaluation tasks for comparative assessment: speech emotion recognition and transcript-based sentiment analysis. These tasks leverage state-of-the-art pre-trained representations to assess the dataset's ability to capture spontaneous affective states from both acoustic and linguistic modalities. iMiGUE-Speech can also be synchronously paired with micro-gesture annotations from the original iMiGUE dataset, forming a uniquely multimodal resource for studying speech-gesture affective dynamics. The extended dataset is available at https://github.com/CV-AC/imigue-speech.

ASJun 16, 2023
Investigating the Utility of Surprisal from Large Language Models for Speech Synthesis Prosody

Sofoklis Kakouros, Juraj Šimko, Martti Vainio et al.

This paper investigates the use of word surprisal, a measure of the predictability of a word in a given context, as a feature to aid speech synthesis prosody. We explore how word surprisal extracted from large language models (LLMs) correlates with word prominence, a signal-based measure of the salience of a word in a given discourse. We also examine how context length and LLM size affect the results, and how a speech synthesizer conditioned with surprisal values compares with a baseline system. To evaluate these factors, we conducted experiments using a large corpus of English text and LLMs of varying sizes. Our results show that word surprisal and word prominence are moderately correlated, suggesting that they capture related but distinct aspects of language use. We find that length of context and size of the LLM impact the correlations, but not in the direction anticipated, with longer contexts and larger LLMs generally underpredicting prominent words in a nearly linear manner. We demonstrate that, in line with these findings, a speech synthesizer conditioned with surprisal values provides a minimal improvement over the baseline with the results suggesting a limited effect of using surprisal values for eliciting appropriate prominence patterns.

CLApr 25, 2023
What does BERT learn about prosody?

Sofoklis Kakouros, Johannah O'Mahony

Language models have become nearly ubiquitous in natural language processing applications achieving state-of-the-art results in many tasks including prosody. As the model design does not define predetermined linguistic targets during training but rather aims at learning generalized representations of the language, analyzing and interpreting the representations that models implicitly capture is important in bridging the gap between interpretability and model performance. Several studies have explored the linguistic information that models capture providing some insights on their representational capacity. However, the current studies have not explored whether prosody is part of the structural information of the language that models learn. In this work, we perform a series of experiments on BERT probing the representations captured at different layers. Our results show that information about prosodic prominence spans across many layers but is mostly focused in middle layers suggesting that BERT relies mostly on syntactic and semantic information.

CVJun 30, 2025
Towards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text, Audio and Facial Cue-Based Summarization

Md Moinul Islam, Sofoklis Kakouros, Janne Heikkilä et al.

The increasing volume of video content in educational, professional, and social domains necessitates effective summarization techniques that go beyond traditional unimodal approaches. This paper proposes a behaviour-aware multimodal video summarization framework that integrates textual, audio, and visual cues to generate timestamp-aligned summaries. By extracting prosodic features, textual cues and visual indicators, the framework identifies semantically and emotionally important moments. A key contribution is the identification of bonus words, which are terms emphasized across multiple modalities and used to improve the semantic relevance and expressive clarity of the summaries. The approach is evaluated against pseudo-ground truth (pGT) summaries generated using LLM-based extractive method. Experimental results demonstrate significant improvements over traditional extractive method, such as the Edmundson method, in both text and video-based evaluation metrics. Text-based metrics show ROUGE-1 increasing from 0.4769 to 0.7929 and BERTScore from 0.9152 to 0.9536, while in video-based evaluation, our proposed framework improves F1-Score by almost 23%. The findings underscore the potential of multimodal integration in producing comprehensive and behaviourally informed video summaries.

CLJun 2, 2025
Sounding Like a Winner? Prosodic Differences in Post-Match Interviews

Sofoklis Kakouros, Haoyu Chen

This study examines the prosodic characteristics associated with winning and losing in post-match tennis interviews. Additionally, this research explores the potential to classify match outcomes solely based on post-match interview recordings using prosodic features and self-supervised learning (SSL) representations. By analyzing prosodic elements such as pitch and intensity, alongside SSL models like Wav2Vec 2.0 and HuBERT, the aim is to determine whether an athlete has won or lost their match. Traditional acoustic features and deep speech representations are extracted from the data, and machine learning classifiers are employed to distinguish between winning and losing players. Results indicate that SSL representations effectively differentiate between winning and losing outcomes, capturing subtle speech patterns linked to emotional states. At the same time, prosodic cues -- such as pitch variability -- remain strong indicators of victory.

CLJun 2, 2025
Investigating the Impact of Word Informativeness on Speech Emotion Recognition

Sofoklis Kakouros

In emotion recognition from speech, a key challenge lies in identifying speech signal segments that carry the most relevant acoustic variations for discerning specific emotions. Traditional approaches compute functionals for features such as energy and F0 over entire sentences or longer speech portions, potentially missing essential fine-grained variation in the long-form statistics. This research investigates the use of word informativeness, derived from a pre-trained language model, to identify semantically important segments. Acoustic features are then computed exclusively for these identified segments, enhancing emotion recognition accuracy. The methodology utilizes standard acoustic prosodic features, their functionals, and self-supervised representations. Results indicate a notable improvement in recognition performance when features are computed on segments selected based on word informativeness, underscoring the effectiveness of this approach.

ASMay 19, 2023
North Sámi Dialect Identification with Self-supervised Speech Models

Sofoklis Kakouros, Katri Hiovain-Asikainen

The North Sámi (NS) language encapsulates four primary dialectal variants that are related but that also have differences in their phonology, morphology, and vocabulary. The unique geopolitical location of NS speakers means that in many cases they are bilingual in Sámi as well as in the dominant state language: Norwegian, Swedish, or Finnish. This enables us to study the NS variants both with respect to the spoken state language and their acoustic characteristics. In this paper, we investigate an extensive set of acoustic features, including MFCCs and prosodic features, as well as state-of-the-art self-supervised representations, namely, XLS-R, WavLM, and HuBERT, for the automatic detection of the four NS variants. In addition, we examine how the majority state language is reflected in the dialects. Our results show that NS dialects are influenced by the state language and that the four dialects are separable, reaching high classification accuracy, especially with the XLS-R model.

ASJun 29, 2020
Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

Antti Suni, Sofoklis Kakouros, Martti Vainio et al.

Recent advances in deep learning methods have elevated synthetic speech quality to human level, and the field is now moving towards addressing prosodic variation in synthetic speech.Despite successes in this effort, the state-of-the-art systems fall short of faithfully reproducing local prosodic events that give rise to, e.g., word-level emphasis and phrasal structure. This type of prosodic variation often reflects long-distance semantic relationships that are not accessible for end-to-end systems with a single sentence as their synthesis domain. One of the possible solutions might be conditioning the synthesized speech by explicit prosodic labels, potentially generated using longer portions of text. In this work we evaluate whether augmenting the textual input with such prosodic labels capturing word-level prominence and phrasal boundary strength can result in more accurate realization of sentence prosody. We use an automatic wavelet-based technique to extract such labels from speech material, and use them as an input to a tacotron-like synthesis system alongside textual information. The results of objective evaluation of synthesized speech show that using the prosodic labels significantly improves the output in terms of faithfulness of f0 and energy contours, in comparison with state-of-the-art implementations.

CLAug 6, 2019
Predicting Prosodic Prominence from Text with Pre-trained Contextualized Word Representations

Aarne Talman, Antti Suni, Hande Celikkanat et al.

In this paper we introduce a new natural language processing dataset and benchmark for predicting prosodic prominence from written text. To our knowledge this will be the largest publicly available dataset with prosodic labels. We describe the dataset construction and the resulting benchmark dataset in detail and train a number of different models ranging from feature-based classifiers to neural network systems for the prediction of discretized prosodic prominence. We show that pre-trained contextualized word representations from BERT outperform the other models even with less than 10% of the training data. Finally we discuss the dataset in light of the results and point to future research and plans for further improving both the dataset and methods of predicting prosodic prominence from text. The dataset and the code for the models are publicly available.