Yiqi Liu

h-index31

4papers

161citations

Novelty38%

AI Score28

Ranked #147,995 of 194,257 authors (top 76%)#26,084 in CL (top 85%)

4 Papers

25.2SDJun 18, 2023Code

MARBLE: Music Audio Representation Benchmark for Universal Evaluation

Ruibin Yuan, Yinghao Ma, Yizhi Li et al. · deepmind, mila

In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines. Besides, MARBLE offers an easy-to-use, extendable, and reproducible suite for the community, with a clear statement on copyright issues on datasets. Results suggest recently proposed large-scale pre-trained musical language models perform the best in most tasks, with room for further improvement. The leaderboard and toolkit repository are published at https://marble-bm.shef.ac.uk to promote future music AI research.

19.1CLNov 16, 2023

LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores

Yiqi Liu, Nafise Sadat Moosavi, Chenghua Lin

Automatic evaluation of generated textual content presents an ongoing challenge within the field of NLP. Given the impressive capabilities of modern language models (LMs) across diverse NLP tasks, there is a growing trend to employ these models in creating innovative evaluation metrics for automated assessment of generation tasks. This paper investigates a pivotal question: Do language model-driven evaluation metrics inherently exhibit bias favoring texts generated by the same underlying language model? Specifically, we assess whether prominent LM-based evaluation metrics (e.g. BARTScore, T5Score, and GPTScore) demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks. Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries. These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality, highlighting the necessity of developing more reliable evaluation protocols in the future.

3.3EMFeb 14, 2024

Inference for an Algorithmic Fairness-Accuracy Frontier

Yiqi Liu, Francesca Molinari

Algorithms are increasingly used to aid with high-stakes decision making. Yet, their predictive ability frequently exhibits systematic variation across population subgroups. To assess the trade-off between fairness and accuracy using finite data, we propose a debiased machine learning estimator for the fairness-accuracy frontier introduced by Liang, Lu, Mu, and Okumura (2024). We derive its asymptotic distribution and propose inference methods to test key hypotheses in the fairness literature, such as (i) whether excluding group identity from use in training the algorithm is optimal and (ii) whether there are less discriminatory alternatives to a given algorithm. In addition, we construct an estimator for the distance between a given algorithm and the fairest point on the frontier, and characterize its asymptotic distribution. Using Monte Carlo simulations, we evaluate the finite-sample performance of our inference methods. We apply our framework to re-evaluate algorithms used in hospital care management and show that our approach yields alternative algorithms that lie on the fairness-accuracy frontier, offering improvements along both dimensions.

8.3CLApr 2, 2025

ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

Xiao Wang, Daniil Larionov, Siwei Wu et al.

Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.