Andrew Hines

h-index20

16papers

992citations

Novelty28%

AI Score45

Ranked #40,342 of 194,257 authors (top 21%)#167 in AS (top 12%)

16 Papers

1.2ASSep 22, 2023

Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models

Asad Ullah, Alessandro Ragano, Andrew Hines

Self-supervised representation learning (SSRL) has demonstrated superior performance than supervised models for tasks including phoneme recognition. Training SSRL models poses a challenge for low-resource languages where sufficient pre-training data may not be available. A common approach is cross-lingual pre-training. Instead, we propose to use audio augmentation techniques, namely: pitch variation, noise addition, accented target language and other language speech to pre-train SSRL models in a low resource condition and evaluate phoneme recognition. Our comparisons found that a combined synthetic augmentations (noise/pitch) strategy outperformed accent and language knowledge transfer. Furthermore, we examined the scaling factor of augmented data to achieve equivalent performance to model pre-trained with target domain speech. Our findings suggest that for resource-constrained languages, combined augmentations can be a viable option than other augmentations.

4.1SDSep 14, 2022

Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset

Michael Chinen, Jan Skoglund, Chandan K A Reddy et al.

Non-reference speech quality models are important for a growing number of applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels. This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset. Speech quality models were constructed using wav2vec 2.0 with additional metadata features that included rater groups and system identifiers and obtained competitive metrics including a Spearman rank correlation coefficient (SRCC) of 0.934 and MSE of 0.088 at the system-level, and 0.877 and 0.198 at the utterance-level. Using data and metadata that the test restricted or blinded further improved the metrics. A metadata analysis showed that the system-level metrics do not represent the model's system-level prediction as a result of the wide variation in the number of utterances used for each system on the validation and test datasets. We conclude that, in general, conditions should have enough utterances in the test set to bound the sample mean error, and be relatively balanced in utterance count between systems, otherwise the utterance-level metrics may be more reliable and interpretable.

6.3AIApr 27

Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Aaryan Shah, Andrew Hines, Alexia Downs et al.

Objective. Clinical AI documentation systems require evaluation methodologies that are clinically valid, economically viable, and sensitive to iterative changes. Methods requiring expert review per scoring instance are too slow and expensive for safe, iterative deployment. We present a case-specific, clinician-authored rubric methodology for clinical AI evaluation and examine whether LLM-generated rubrics can approximate clinician agreement. Materials and Methods. Twenty clinicians authored 1,646 rubrics for 823 clinical cases (736 real-world, 87 synthetic) across primary care, psychiatry, oncology, and behavioral health. Each rubric was validated by confirming that an LLM-based scoring agent consistently scored clinician-preferred outputs higher than rejected ones. Seven versions of an EHR-embedded AI agent for clinicians were evaluated across all cases. Results. Clinician-authored rubrics discriminated effectively between high- and low-quality outputs (median score gap: 82.9%) with high scoring stability (median range: 0.00%). Median scores improved from 84% to 95%. In later experiments, clinician-LLM ranking agreement (tau: 0.42-0.46) matched or exceeded clinician-clinician agreement (tau: 0.38-0.43), attributable to both ceiling compression and LLM rubric improvement. Discussion. This convergence supports incorporating LLM rubrics alongside clinician-authored ones. At roughly 1,000 times lower cost, LLM rubrics enable substantially greater evaluation coverage, while continued clinical authorship grounds evaluation in expert judgment. Ceiling compression poses a methodological challenge for future inter-rater agreement studies. Conclusion. Case-specific rubrics offer a path for clinical AI evaluation that preserves expert judgment while enabling automation at three orders lower cost. Clinician-authored rubrics establish the baseline against which LLM rubrics are validated.

7.6AIApr 30

End-to-End Evaluation and Governance of an EHR-Embedded AI Agent for Clinicians

Aaryan Shah, Andrew Hines, Alexia Downs et al.

Clinical AI systems require not just point-in-time evaluation but continuous governance: the ongoing practice of monitoring, evaluating, iterating, and re-evaluating performance throughout deployment. We present an end-to-end framework of governance that integrates rubric validation, live deployment feedback, technical performance monitoring, and cost tracking, with controlled experimentation gating system changes before deployment. Applied to Hyperscribe, an EHR-embedded agent that converts ambient audio into structured chart updates, twenty clinicians authored 1,646 validated rubrics across 823 cases. Seven Hyperscribe versions were evaluated through controlled experiments, with median scores improving from 84% to 95%. Analysis of 107 live feedback entries over three months showed feedback composition shifting from 79% error reports and 14% positive observations to 30% errors and 45% positive observations as engineering interventions resolved failures. Median processing time per audio segment was 8.1 seconds with a 99.6% effective completion rate after retry mechanisms absorbed transient model errors. These results demonstrate that continuous, multi-channel governance of deployed clinical AI is both achievable and effective.

2.3SDOct 26, 2021Code

AQP: An Open Modular Python Platform for Objective Speech and Audio Quality Metrics

Jack Geraghty, Jiazheng Li, Alessandro Ragano et al.

Audio quality assessment has been widely researched in the signal processing area. Full-reference objective metrics (e.g., POLQA, ViSQOL) have been developed to estimate the audio quality relying only on human rating experiments. To evaluate the audio quality of novel audio processing techniques, researchers constantly need to compare objective quality metrics. Testing different implementations of the same metric and evaluating new datasets are fundamental and ongoing iterative activities. In this paper, we present AQP - an open-source, node-based, light-weight Python pipeline for audio quality assessment. AQP allows researchers to test and compare objective quality metrics helping to improve robustness, reproducibility and development speed. We introduce the platform, explain the motivations, and illustrate with examples how, using AQP, objective quality metrics can be (i) compared and benchmarked; (ii) prototyped and adapted in a modular fashion; (iii) visualised and checked for errors. The code has been shared on GitHub to encourage adoption and contributions from the community.

22.6ASApr 20, 2020Code

ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric

Michael Chinen, Felicia S. C. Lim, Jan Skoglund et al.

Estimation of perceptual quality in audio and speech is possible using a variety of methods. The combined v3 release of ViSQOL and ViSQOLAudio (for speech and audio, respectively,) provides improvements upon previous versions, in terms of both design and usage. As an open source C++ library or binary with permissive licensing, ViSQOL can now be deployed beyond the research context into production usage. The feedback from internal production teams at Google has helped to improve this new release, and serves to show cases where it is most applicable, as well as to highlight limitations. The new model is benchmarked against real-world data for evaluation purposes. The trends and direction of future work is discussed.

8.6NIDec 5, 2019Code

5G network slicing using SDN and NFV- A survey of taxonomy, architectures and future challenges

Alcardo Alex Barakabitze, Arslan Ahmad, Rashid Mijumbi et al.

In this paper, we provide a comprehensive review and updated solutions related to 5G network slicing using SDN and NFV. Firstly, we present 5G service quality and business requirements followed by a description of 5G network softwarization and slicing paradigms including essential concepts, history and different use cases. Secondly, we provide a tutorial of 5G network slicing technology enablers including SDN, NFV, MEC, cloud/Fog computing, network hypervisors, virtual machines & containers. Thidly, we comprehensively survey different industrial initiatives and projects that are pushing forward the adoption of SDN and NFV in accelerating 5G network slicing. A comparison of various 5G architectural approaches in terms of practical implementations, technology adoptions and deployment strategies is presented. Moreover, we provide a discussion on various open source orchestrators and proof of concepts representing industrial contribution. The work also investigates the standardization efforts in 5G networks regarding network slicing and softwarization. Additionally, the article presents the management and orchestration of network slices in a single domain followed by a comprehensive survey of management and orchestration approaches in 5G network slicing across multiple domains while supporting multiple tenants. Furthermore, we highlight the future challenges and research directions regarding network softwarization and slicing using SDN and NFV in 5G networks.

1.2ASNov 17, 2025

Systematic Evaluation of Time-Frequency Features for Binaural Sound Source Localization

Davoud Shariat Panah, Alessandro Ragano, Dan Barry et al.

This study presents a systematic evaluation of time-frequency feature design for binaural sound source localization (SSL), focusing on how feature selection influences model performance across diverse conditions. We investigate the performance of a convolutional neural network (CNN) model using various combinations of amplitude-based features (magnitude spectrogram, interaural level difference - ILD) and phase-based features (phase spectrogram, interaural phase difference - IPD). Evaluations on in-domain and out-of-domain data with mismatched head-related transfer functions (HRTFs) reveal that carefully chosen feature combinations often outperform increases in model complexity. While two-feature sets such as ILD + IPD are sufficient for in-domain SSL, generalization to diverse content requires richer inputs combining channel spectrograms with both ILD and IPD. Using the optimal feature sets, our low-complexity CNN model achieves competitive performance. Our findings underscore the importance of feature design in binaural SSL and provide practical guidance for both domain-specific and general-purpose localization.

1.2ASApr 15, 2025

Respiratory Inhaler Sound Event Classification Using Self-Supervised Learning

Davoud Shariat Panah, Alessandro N Franciosi, Cormac McCarthy et al.

Asthma is a chronic respiratory condition that affects millions of people worldwide. While this condition can be managed by administering controller medications through handheld inhalers, clinical studies have shown low adherence to the correct inhaler usage technique. Consequently, many patients may not receive the full benefit of their medication. Automated classification of inhaler sounds has recently been studied to assess medication adherence. However, the existing classification models were typically trained using data from specific inhaler types, and their ability to generalize to sounds from different inhalers remains unexplored. In this study, we adapted the wav2vec 2.0 self-supervised learning model for inhaler sound classification by pre-training and fine-tuning this model on inhaler sounds. The proposed model shows a balanced accuracy of 98% on a dataset collected using a dry powder inhaler and smartwatch device. The results also demonstrate that re-finetuning this model on minimal data from a target inhaler is a promising approach to adapting a generic inhaler sound classification model to a different inhaler device and audio capture hardware. This is the first study in the field to demonstrate the potential of smartwatches as assistive technologies for the personalized monitoring of inhaler adherence using machine learning models.

3.3NIJan 3, 2022

Supervised Learning based QoE Prediction of Video Streaming in Future Networks: A Tutorial with Comparative Study

Arslan Ahmad, Atif Bin Mansoor, Alcardo Alex Barakabitze et al.

The Quality of Experience (QoE) based service management remains key for successful provisioning of multimedia services in next-generation networks such as 5G/6G, which requires proper tools for quality monitoring, prediction and resource management where machine learning (ML) can play a crucial role. In this paper, we provide a tutorial on the development and deployment of the QoE measurement and prediction solutions for video streaming services based on supervised learning ML models. Firstly, we provide a detailed pipeline for developing and deploying supervised learning-based video streaming QoE prediction models which covers several stages including data collection, feature engineering, model optimization and training, testing and prediction and evaluation. Secondly, we discuss the deployment of the ML model for the QoE prediction/measurement in the next generation networks (5G/6G) using network enabling technologies such as Software-Defined Networking (SDN), Network Function Virtualization (NFV) and Mobile Edge Computing (MEC) by proposing reference architecture. Thirdly, we present a comparative study of the state-of-the-art supervised learning ML models for QoE prediction of video streaming applications based on multiple performance metrics.

5.1ASAug 19, 2021

More for Less: Non-Intrusive Speech Quality Assessment with Limited Annotations

Alessandro Ragano, Emmanouil Benetos, Andrew Hines

Non-intrusive speech quality assessment is a crucial operation in multimedia applications. The scarcity of annotated data and the lack of a reference signal represent some of the main challenges for designing efficient quality assessment metrics. In this paper, we propose two multi-task models to tackle the problems above. In the first model, we first learn a feature representation with a degradation classifier on a large dataset. Then we perform MOS prediction and degradation classification simultaneously on a small dataset annotated with MOS. In the second approach, the initial stage consists of learning features with a deep clustering-based unsupervised feature representation on the large dataset. Next, we perform MOS prediction and cluster label classification simultaneously on a small dataset. The results show that the deep clustering-based model outperforms the degradation classifier-based model and the 3 baselines (autoencoder features, P.563, and SRMRnorm) on TCD-VoIP. This paper indicates that multi-task learning combined with feature representations from unlabelled data is a promising approach to deal with the lack of large MOS annotated datasets.

5.1MMJun 10, 2020

QUALINET White Paper on Definitions of Immersive Media Experience (IMEx)

Andrew Perkis, Christian Timmerer, Sabina Baraković et al.

With the coming of age of virtual/augmented reality and interactive media, numerous definitions, frameworks, and models of immersion have emerged across different fields ranging from computer graphics to literary works. Immersion is oftentimes used interchangeably with presence as both concepts are closely related. However, there are noticeable interdisciplinary differences regarding definitions, scope, and constituents that are required to be addressed so that a coherent understanding of the concepts can be achieved. Such consensus is vital for paving the directionality of the future of immersive media experiences (IMEx) and all related matters. The aim of this white paper is to provide a survey of definitions of immersion and presence which leads to a definition of immersive media experience (IMEx). The Quality of Experience (QoE) for immersive media is described by establishing a relationship between the concepts of QoE and IMEx followed by application areas of immersive media experience. Influencing factors on immersive media experience are elaborated as well as the assessment of immersive media experience. Finally, standardization activities related to IMEx are highlighted and the white paper is concluded with an outlook related to future developments.

1.2ASMar 26, 2020

Speech Quality Factors for Traditional and Neural-Based Low Bit Rate Vocoders

Wissam A. Jassim, Jan Skoglund, Michael Chinen et al.

This study compares the performances of different algorithms for coding speech at low bit rates. In addition to widely deployed traditional vocoders, a selection of recently developed generative-model-based coders at different bit rates are contrasted. Performance analysis of the coded speech is evaluated for different quality aspects: accuracy of pitch periods estimation, the word error rates for automatic speech recognition, and the influence of speaker gender and coding delays. A number of performance metrics of speech samples taken from a publicly available database were compared with subjective scores. Results from subjective quality assessment do not correlate well with existing full reference speech quality metrics. The results provide valuable insights into aspects of the speech signal that will be used to develop a novel metric to accurately predict speech quality from generative-model-based coders.

1.2MMMar 24, 2020

How deep is your encoder: an analysis of features descriptors for an autoencoder-based audio-visual quality metric

Helard Martinez, Andrew Hines, Mylene C. Q. Farias

The development of audio-visual quality assessment models poses a number of challenges in order to obtain accurate predictions. One of these challenges is the modelling of the complex interaction that audio and visual stimuli have and how this interaction is interpreted by human users. The No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder (NAViDAd) deals with this problem from a machine learning perspective. The metric receives two sets of audio and video features descriptors and produces a low-dimensional set of features used to predict the audio-visual quality. A basic implementation of NAViDAd was able to produce accurate predictions tested with a range of different audio-visual databases. The current work performs an ablation study on the base architecture of the metric. Several modules are removed or re-trained using different configurations to have a better understanding of the metric functionality. The results presented in this study provided important feedback that allows us to understand the real capacity of the metric's architecture and eventually develop a much better audio-visual quality metric.

2.3ASMar 22, 2020

Audio Impairment Recognition Using a Correlation-Based Feature Representation

Alessandro Ragano, Emmanouil Benetos, Andrew Hines

Audio impairment recognition is based on finding noise in audio files and categorising the impairment type. Recently, significant performance improvement has been obtained thanks to the usage of advanced deep learning models. However, feature robustness is still an unresolved issue and it is one of the main reasons why we need powerful deep learning architectures. In the presence of a variety of musical styles, hand-crafted features are less efficient in capturing audio degradation characteristics and they are prone to failure when recognising audio impairments and could mistakenly learn musical concepts rather than impairment types. In this paper, we propose a new representation of hand-crafted features that is based on the correlation of feature pairs. We experimentally compare the proposed correlation-based feature representation with a typical raw feature representation used in machine learning and we show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage whilst achieving comparable accuracy.

5.9MMJan 30, 2020

NAViDAd: A No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder

Helard Martinez, M. C. Farias, A. Hines

The development of models for quality prediction of both audio and video signals is a fairly mature field. But, although several multimodal models have been proposed, the area of audio-visual quality prediction is still an emerging area. In fact, despite the reasonable performance obtained by combination and parametric metrics, currently there is no reliable pixel-based audio-visual quality metric. The approach presented in this work is based on the assumption that autoencoders, fed with descriptive audio and video features, might produce a set of features that is able to describe the complex audio and video interactions. Based on this hypothesis, we propose a No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder (NAViDAd). The model visual features are natural scene statistics (NSS) and spatial-temporal measures of the video component. Meanwhile, the audio features are obtained by computing the spectrogram representation of the audio component. The model is formed by a 2-layer framework that includes a deep autoencoder layer and a classification layer. These two layers are stacked and trained to build the deep neural network model. The model is trained and tested using a large set of stimuli, containing representative audio and video artifacts. The model performed well when tested against the UnB-AV and the LiveNetflix-II databases. %Results shows that this type of approach produces quality scores that are highly correlated to subjective quality scores.