Yun Hao

CL
h-index20
7papers
27citations
Novelty40%
AI Score37

7 Papers

58.5CLMay 19
Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

Yun Hao, Reihaneh Amooie, Wietse de Vries et al.

Automatic speech recognition (ASR) has improved substantially in recent years, yet performance remains limited for low-resource languages. Large language models (LLMs) have shown promise for improving ASR through generative error correction (GER), but their effectiveness in low-resource settings remains underexplored. In addition, it remains unclear to what extent data contamination influences the reported improvements in LLM-based GER. This study investigates LLM-based GER for low-resource Frisian. In addition to a public corpus, we construct and use a Frisian offline dataset with non-public texts for evaluation to control for potential data contamination. Results show that GER improves ASR performance in most settings, with the best GPT-5.1 results surpassing oracle WERs. Comparable gains on the offline dataset indicate that improvements reflect true correction ability. We further provide a detailed error analysis revealing model correction patterns.

CVAug 30, 2024
Hybrid Classification-Regression Adaptive Loss for Dense Object Detection

Yanquan Huang, Liu Wei Zhen, Yun Hao et al.

For object detection detectors, enhancing model performance hinges on the ability to simultaneously consider inconsistencies across tasks and focus on difficult-to-train samples. Achieving this necessitates incorporating information from both the classification and regression tasks. However, prior work tends to either emphasize difficult-to-train samples within their respective tasks or simply compute classification scores with IoU, often leading to suboptimal model performance. In this paper, we propose a Hybrid Classification-Regression Adaptive Loss, termed as HCRAL. Specifically, we introduce the Residual of Classification and IoU (RCI) module for cross-task supervision, addressing task inconsistencies, and the Conditioning Factor (CF) to focus on difficult-to-train samples within each task. Furthermore, we introduce a new strategy named Expanded Adaptive Training Sample Selection (EATSS) to provide additional samples that exhibit classification and regression inconsistencies. To validate the effectiveness of the proposed method, we conduct extensive experiments on COCO test-dev. Experimental evaluations demonstrate the superiority of our approachs. Additionally, we designed experiments by separately combining the classification and regression loss with regular loss functions in popular one-stage models, demonstrating improved performance.

MTRL-SCIFeb 25, 2025
Inverse Materials Design by Large Language Model-Assisted Generative Framework

Yun Hao, Che Fan, Beilin Ye et al.

Deep generative models hold great promise for inverse materials design, yet their efficiency and accuracy remain constrained by data scarcity and model architecture. Here, we introduce AlloyGAN, a closed-loop framework that integrates Large Language Model (LLM)-assisted text mining with Conditional Generative Adversarial Networks (CGANs) to enhance data diversity and improve inverse design. Taking alloy discovery as a case study, AlloyGAN systematically refines material candidates through iterative screening and experimental validation. For metallic glasses, the framework predicts thermodynamic properties with discrepancies of less than 8% from experiments, demonstrating its robustness. By bridging generative AI with domain knowledge and validation workflows, AlloyGAN offers a scalable approach to accelerate the discovery of materials with tailored properties, paving the way for broader applications in materials science.

CVJun 3, 2025
MemoryOut: Learning Principal Features via Multimodal Sparse Filtering Network for Semi-supervised Video Anomaly Detection

Juntong Li, Lingwei Dang, Yukun Su et al.

Video Anomaly Detection (VAD) methods based on reconstruction or prediction face two critical challenges: (1) strong generalization capability often results in accurate reconstruction or prediction of abnormal events, making it difficult to distinguish normal from abnormal patterns; (2) reliance only on low-level appearance and motion cues limits their ability to identify high-level semantic in abnormal events from complex scenes. To address these limitations, we propose a novel VAD framework with two key innovations. First, to suppress excessive generalization, we introduce the Sparse Feature Filtering Module (SFFM) that employs bottleneck filters to dynamically and adaptively remove abnormal information from features. Unlike traditional memory modules, it does not need to memorize the normal prototypes across the training dataset. Further, we design the Mixture of Experts (MoE) architecture for SFFM. Each expert is responsible for extracting specialized principal features during running time, and different experts are selectively activated to ensure the diversity of the learned principal features. Second, to overcome the neglect of semantics in existing methods, we integrate a Vision-Language Model (VLM) to generate textual descriptions for video clips, enabling comprehensive joint modeling of semantic, appearance, and motion cues. Additionally, we enforce modality consistency through semantic similarity constraints and motion frame-difference contrastive loss. Extensive experiments on multiple public datasets validate the effectiveness of our multimodal joint modeling framework and sparse feature filtering paradigm. Project page at https://qzfm.github.io/sfn_vad_project_page/.

CLFeb 7, 2025
Evaluating Standard and Dialectal Frisian ASR: Multilingual Fine-tuning and Language Identification for Improved Low-resource Performance

Reihaneh Amooie, Wietse de Vries, Yun Hao et al.

Automatic Speech Recognition (ASR) performance for low-resource languages is still far behind that of higher-resource languages such as English, due to a lack of sufficient labeled data. State-of-the-art methods deploy self-supervised transfer learning where a model pre-trained on large amounts of data is fine-tuned using little labeled data in a target low-resource language. In this paper, we present and examine a method for fine-tuning an SSL-based model in order to improve the performance for Frisian and its regional dialects (Clay Frisian, Wood Frisian, and South Frisian). We show that Frisian ASR performance can be improved by using multilingual (Frisian, Dutch, English and German) fine-tuning data and an auxiliary language identification task. In addition, our findings show that performance on dialectal speech suffers substantially, and, importantly, that this effect is moderated by the elicitation approach used to collect the dialectal data. Our findings also particularly suggest that relying solely on standard language data for ASR evaluation may underestimate real-world performance, particularly in languages with substantial dialectal variation.

IRDec 22, 2017
Integrating Knowledge from Latent and Explicit Features for Triple Scoring - Team Radicchio's Triple Scorer at WSDM Cup 2017

Liang-Wei Chen, Bhargav Mangipudi, Jayachandu Bandlamudi et al.

The objective of the triple scoring task in WSDM Cup 2017 is to compute relevance scores for knowledge-base triples of type-like relations. For example, consider Julius Caesar who has had various professions, including Politician and Author. For two given triples (Julius Caesar, profession, Politician) and (Julius Caesar, profession, Author), the former triple is likely to have a higher relevance score (also called "triple score") because Julius Caesar was well-known as a politician and not as an author. Accurate prediction of such triple scores greatly benefits real-world applications, such as information retrieval or knowledge base query. In these scenarios, being able to rank all relations (Profession/Nationality) can help improve the user experience. We propose a triple scoring model which integrates knowledge from both latent features and explicit features via an ensemble approach. The latent features consist of representations for a person learned by using a word2vec model and representations for profession/nationality values extracted from a pre-trained GloVe embedding model. In addition, we extract explicit features for person entities from the Freebase knowledge base. Experimental results show that the proposed method performs competitively at WSDM Cup 2017, ranking at the third place with an accuracy of 79.72% for predicting within two places of the ground truth score.

CYFeb 13, 2016
Urban sidewalks: visualization and routing for individuals with limited mobility

Nicholas Bolten, Amirhossein Amini, Yun Hao et al.

People with limited mobility in the U.S. (defined as having difficulty or inability to walk a quarter of a mile without help and without the use of special equipment) face a growing informational gap: while pedestrian routing algorithms are getting faster and more informative, planning a route with a wheeled device in urban centers is very difficult due to lack of integrated pertinent information regarding accessibility along the route. Moreover, reducing access to street-spaces translates to reduced access to other public information and services that are increasingly made available to the public along urban streets. To adequately plan a commute, a traveler with limited or wheeled mobility must know whether her path may be blocked by construction, whether the sidewalk would be too steep or rendered unusable due to poor conditions, whether the street can be crossed or a highway is blocking the way, or whether there is a sidewalk at all. These details populate different datasets in many modern municipalities, but they are not immediately available in a convenient, integrated format to be useful to people with limited mobility. Our project, AccessMap, in its first phase (v.1) overlayed the information that is most relevant to people with limited mobility on a map, enabling self-planning of routes. Here, we describe the next phase of the project: synthesizing commonly available open data (including streets, sidewalks, curb ramps, elevation data, and construction permit information) to generate a graph of paths to enable variable cost-function accessible routing.