Alessandro Ragano

AS
h-index8
6papers
31citations
Novelty36%
AI Score34

6 Papers

ASSep 22, 2023
Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models

Asad Ullah, Alessandro Ragano, Andrew Hines

Self-supervised representation learning (SSRL) has demonstrated superior performance than supervised models for tasks including phoneme recognition. Training SSRL models poses a challenge for low-resource languages where sufficient pre-training data may not be available. A common approach is cross-lingual pre-training. Instead, we propose to use audio augmentation techniques, namely: pitch variation, noise addition, accented target language and other language speech to pre-train SSRL models in a low resource condition and evaluate phoneme recognition. Our comparisons found that a combined synthetic augmentations (noise/pitch) strategy outperformed accent and language knowledge transfer. Furthermore, we examined the scaling factor of augmented data to achieve equivalent performance to model pre-trained with target domain speech. Our findings suggest that for resource-constrained languages, combined augmentations can be a viable option than other augmentations.

SDSep 14, 2022
Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset

Michael Chinen, Jan Skoglund, Chandan K A Reddy et al.

Non-reference speech quality models are important for a growing number of applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels. This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset. Speech quality models were constructed using wav2vec 2.0 with additional metadata features that included rater groups and system identifiers and obtained competitive metrics including a Spearman rank correlation coefficient (SRCC) of 0.934 and MSE of 0.088 at the system-level, and 0.877 and 0.198 at the utterance-level. Using data and metadata that the test restricted or blinded further improved the metrics. A metadata analysis showed that the system-level metrics do not represent the model's system-level prediction as a result of the wide variation in the number of utterances used for each system on the validation and test datasets. We conclude that, in general, conditions should have enough utterances in the test set to bound the sample mean error, and be relatively balanced in utterance count between systems, otherwise the utterance-level metrics may be more reliable and interpretable.

SDOct 26, 2021Code
AQP: An Open Modular Python Platform for Objective Speech and Audio Quality Metrics

Jack Geraghty, Jiazheng Li, Alessandro Ragano et al.

Audio quality assessment has been widely researched in the signal processing area. Full-reference objective metrics (e.g., POLQA, ViSQOL) have been developed to estimate the audio quality relying only on human rating experiments. To evaluate the audio quality of novel audio processing techniques, researchers constantly need to compare objective quality metrics. Testing different implementations of the same metric and evaluating new datasets are fundamental and ongoing iterative activities. In this paper, we present AQP - an open-source, node-based, light-weight Python pipeline for audio quality assessment. AQP allows researchers to test and compare objective quality metrics helping to improve robustness, reproducibility and development speed. We introduce the platform, explain the motivations, and illustrate with examples how, using AQP, objective quality metrics can be (i) compared and benchmarked; (ii) prototyped and adapted in a modular fashion; (iii) visualised and checked for errors. The code has been shared on GitHub to encourage adoption and contributions from the community.

ASNov 17, 2025
Systematic Evaluation of Time-Frequency Features for Binaural Sound Source Localization

Davoud Shariat Panah, Alessandro Ragano, Dan Barry et al.

This study presents a systematic evaluation of time-frequency feature design for binaural sound source localization (SSL), focusing on how feature selection influences model performance across diverse conditions. We investigate the performance of a convolutional neural network (CNN) model using various combinations of amplitude-based features (magnitude spectrogram, interaural level difference - ILD) and phase-based features (phase spectrogram, interaural phase difference - IPD). Evaluations on in-domain and out-of-domain data with mismatched head-related transfer functions (HRTFs) reveal that carefully chosen feature combinations often outperform increases in model complexity. While two-feature sets such as ILD + IPD are sufficient for in-domain SSL, generalization to diverse content requires richer inputs combining channel spectrograms with both ILD and IPD. Using the optimal feature sets, our low-complexity CNN model achieves competitive performance. Our findings underscore the importance of feature design in binaural SSL and provide practical guidance for both domain-specific and general-purpose localization.

ASAug 19, 2021
More for Less: Non-Intrusive Speech Quality Assessment with Limited Annotations

Alessandro Ragano, Emmanouil Benetos, Andrew Hines

Non-intrusive speech quality assessment is a crucial operation in multimedia applications. The scarcity of annotated data and the lack of a reference signal represent some of the main challenges for designing efficient quality assessment metrics. In this paper, we propose two multi-task models to tackle the problems above. In the first model, we first learn a feature representation with a degradation classifier on a large dataset. Then we perform MOS prediction and degradation classification simultaneously on a small dataset annotated with MOS. In the second approach, the initial stage consists of learning features with a deep clustering-based unsupervised feature representation on the large dataset. Next, we perform MOS prediction and cluster label classification simultaneously on a small dataset. The results show that the deep clustering-based model outperforms the degradation classifier-based model and the 3 baselines (autoencoder features, P.563, and SRMRnorm) on TCD-VoIP. This paper indicates that multi-task learning combined with feature representations from unlabelled data is a promising approach to deal with the lack of large MOS annotated datasets.

ASMar 22, 2020
Audio Impairment Recognition Using a Correlation-Based Feature Representation

Alessandro Ragano, Emmanouil Benetos, Andrew Hines

Audio impairment recognition is based on finding noise in audio files and categorising the impairment type. Recently, significant performance improvement has been obtained thanks to the usage of advanced deep learning models. However, feature robustness is still an unresolved issue and it is one of the main reasons why we need powerful deep learning architectures. In the presence of a variety of musical styles, hand-crafted features are less efficient in capturing audio degradation characteristics and they are prone to failure when recognising audio impairments and could mistakenly learn musical concepts rather than impairment types. In this paper, we propose a new representation of hand-crafted features that is based on the correlation of feature pairs. We experimentally compare the proposed correlation-based feature representation with a typical raw feature representation used in machine learning and we show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage whilst achieving comparable accuracy.