Shiran Liu

SE
5papers
18citations
Novelty37%
AI Score42

5 Papers

57.9CLApr 7
Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

Thibault Bañeras-Roux, Sergio Burdisso, Esaú Villatoro-Tello et al.

Conventional end-to-end automatic speech recognition (ASR) systems rely on paired speech-text data for domain adaptation. Recent LLM-based ASR architectures connect a speech encoder to a large language model via a projection module, enabling adaptation with text-only data. However, this introduces a modality gap, as the LLM is not exposed to the noisy representations produced by the speech projector. We investigate whether small amounts of speech can mitigate this mismatch. We compare three strategies: text-only adaptation, paired speech-text adaptation, and mixed batching (MB), which combines both. Experiments in in-domain and out-of-domain settings show that even limited speech consistently improves performance. Notably, MB using only 10% of the target-domain (less than 4 hours) speech achieves word error rates comparable to, or better than, conventional ASR fine-tuning with the full dataset, indicating that small amounts of speech provide a strong modality-alignment signal.

SEJun 18, 2020Code
Prioritizing documentation effort: Can we do better?

Shiran Liu, Zhaoqiang Guo, Yanhui Li et al.

Code documentations are essential for software quality assurance, but due to time or economic pressures, code developers are often unable to write documents for all modules in a project. Recently, a supervised artificial neural network (ANN) approach is proposed to prioritize important modules for documentation effort. However, as a supervised approach, there is a need to use labeled training data to train the prediction model, which may not be easy to obtain in practice. Furthermore, it is unclear whether the ANN approach is generalizable, as it is only evaluated on several small data sets. In this paper, we propose an unsupervised approach based on PageRank to prioritize documentation effort. This approach identifies "important" modules only based on the dependence relationships between modules in a project. As a result, the PageRank approach does not need any training data to build the prediction model. In order to evaluate the effectiveness of the PageRank approach, we use six additional large data sets to conduct the experiments in addition to the same data sets collected from open-source projects as used in prior studies. The experimental results show that the PageRank approach is superior to the state-of-the-art ANN approach in prioritizing important modules for documentation effort. In particular, due to the simplicity and effectiveness, we advocate that the PageRank approach should be used as an easy-to-implement baseline in future research on documentation effort prioritization, and any new approach should be compared with it to demonstrate its effectiveness.

SEOct 29, 2019Code
MAT: A simple yet strong baseline for identifying self-admitted technical debt

Zhaoqiang Guo, Shiran Liu, Jinping Liu et al.

In the process of software evolution, developers often sacrifice the long-term code quality to satisfy the short-term goals due to specific reasons, which is called technical debt. In particular, self-admitted technical debt (SATD) refers to those that were intentionally introduced and remarked by code comments. Those technical debts reduce the quality of software and increase the cost of subsequent software maintenance. Therefore, it is necessary to find out and resolve these debts in time. Recently, many approaches have been proposed to identify SATD. However, those approaches either have a low accuracy or are complex to implementation in practice. In this paper, we propose a simple unsupervised baseline approach that fuzzily matches task annotation tags (MAT) to identify SATD. MAT does not need any training data to build a prediction model. Instead, MAT only examines whether any of four task tags (i.e. TODO, FIXME, XXX, and HACK) appears in the comments of a target project to identify SATD. In this sense, MAT is a natural baseline approach, which has a good understandability, in SATD identification. In order to evaluate the usefulness of MAT, we use 10 open-source projects to conduct the experiment. The experimental results reveal that MAT has a surprisingly excellent performance for SATD identification compared with the state-of-the-art approaches. As such, we suggest that, in the future SATD identification studies, MAT should be considered as an easy-to-implement baseline to which any new approach should be compared against to demonstrate its usefulness.

34.4CLApr 23
Evaluation of Automatic Speech Recognition Using Generative Large Language Models

Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil et al.

Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.

SEJan 27, 2021
An extensive empirical study of inconsistent labels in multi-version-project defect data sets

Shiran Liu, Zhaoqiang Guo, Yanhui Li et al.

The label quality of defect data sets has a direct influence on the reliability of defect prediction models. In this study, for multi-version-project defect data sets, we propose an approach to automatically detecting instances with inconsistent labels (i.e. the phenomena of instances having the same source code but different labels over multiple versions of a software project) and understand their influence on the evaluation and interpretation of defect prediction models. Based on five multi-version-project defect data sets (either widely used or the most up-to-date in the literature) collected by diverse approaches, we find that: (1) most versions in the investigated defect data sets contain inconsistent labels with varying degrees; (2) the existence of inconsistent labels in a training data set may considerably change the prediction performance of a defect prediction model as well as can lead to the identification of substantially different true defective modules; and (3) the importance ranking of independent variables in a defect prediction model can be substantially shifted due to the existence of inconsistent labels. The above findings reveal that inconsistent labels in defect data sets can profoundly change the prediction ability and interpretation of a defect prediction model. Therefore, we strongly suggest that practitioners should detect and exclude inconsistent labels in defect data sets to avoid their potential negative influence on defect prediction models. What is more, it is necessary for researchers to improve existing defect label collection approaches to reduce inconsistent labels. Furthermore, there is a need to re-examine the experimental conclusions of previous studies using multi-version-project defect data sets with a high ratio of inconsistent labels.