ASMay 17
SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context EnhancementKuan-Yu Chen, Jeng-Lin Li, De-Yan Lu et al.
With the fast development of zero-shot text-to-speech technologies, it is possible to generate high-quality speech signals that are indistinguishable from the real ones. Speech editing, including speech insertion and replacement, appeals to researchers due to its potential applications. However, existing studies only considered clean speech scenarios. In real-world applications, the existence of environmental noise could significantly degrade the quality of generation. In this study, we propose a noise-resilient speech editing framework, SeamlessEdit, for noisy speech editing. SeamlessEdit adopts a frequency-band-aware noise suppression module and an in-content refinement strategy. It can well address the scenario where the frequency bands of voice and background noise are not separated. The proposed SeamlessEdit framework outperforms state-of-the-art approaches in multiple quantitative and qualitative evaluations.
SPMar 21Code
The Binding Effect: Analyzing How Multi-Dimensional Cues Form Gender Bias in Instruction TTSKuan-Yu Chen, Yi-Cheng Lin, Po-Chung Hsieh et al.
Current bias evaluations in Instruction Text-to-Speech (ITTS) often rely on univariate testing, overlooking the compositional structure of social cues. In this work, we investigate gender bias by modeling prompts as combinations of Social Status, Career stereotypes, and Persona descriptors. Analyzing open-source ITTS models, we uncover systematic interaction effects where social dimensions modulate one another, creating complex bias patterns missed by univariate baselines. Crucially, our findings indicate that these biases extend beyond surface-level artifacts, demonstrating strong associations with the semantic priors of pre-trained text encoders and the skewed distributions inherent in training data. We further demonstrate that generic diversity prompting is insufficient to override these entrenched patterns, underscoring the need for compositional analysis to diagnose latent risks in generative speech.
LGSep 19, 2024
Learning Multi-Manifold Embedding for Out-Of-Distribution DetectionJeng-Lin Li, Ming-Ching Chang, Wei-Chao Chen
Detecting out-of-distribution (OOD) samples is crucial for trustworthy AI in real-world applications. Leveraging recent advances in representation learning and latent embeddings, Various scoring algorithms estimate distributions beyond the training data. However, a single embedding space falls short in characterizing in-distribution data and defending against diverse OOD conditions. This paper introduces a novel Multi-Manifold Embedding Learning (MMEL) framework, optimizing hypersphere and hyperbolic spaces jointly for enhanced OOD detection. MMEL generates representative embeddings and employs a prototype-aware scoring function to differentiate OOD samples. It operates with very few OOD samples and requires no model retraining. Experiments on six open datasets demonstrate MMEL's significant reduction in FPR while maintaining a high AUC compared to state-of-the-art distance-based OOD detection methods. We analyze the effects of learning multiple manifolds and visualize OOD score distributions across datasets. Notably, enrolling ten OOD samples without retraining achieves comparable FPR and AUC to modern outlier exposure methods using 80 million outlier samples for model training.
CLSep 19, 2024
Profiling Patient Transcript Using Large Language Model Reasoning Augmentation for Alzheimer's Disease DetectionChin-Po Chen, Jeng-Lin Li
Alzheimer's disease (AD) stands as the predominant cause of dementia, characterized by a gradual decline in speech and language capabilities. Recent deep-learning advancements have facilitated automated AD detection through spontaneous speech. However, common transcript-based detection methods directly model text patterns in each utterance without a global view of the patient's linguistic characteristics, resulting in limited discriminability and interpretability. Despite the enhanced reasoning abilities of large language models (LLMs), there remains a gap in fully harnessing the reasoning ability to facilitate AD detection and model interpretation. Therefore, we propose a patient-level transcript profiling framework leveraging LLM-based reasoning augmentation to systematically elicit linguistic deficit attributes. The summarized embeddings of the attributes are integrated into an Albert model for AD detection. The framework achieves 8.51\% ACC and 8.34\% F1 improvements on the ADReSS dataset compared to the baseline without reasoning augmentation. Our further analysis shows the effectiveness of our identified linguistic deficit attributes and the potential to use LLM for AD detection interpretation.
AINov 11, 2025
Dual-Process Scaffold Reasoning for Enhancing LLM Code DebuggingPo-Chung Hsieh, Chin-Po Chen, Jeng-Lin Li et al.
Recent LLMs have demonstrated sophisticated problem-solving capabilities on various benchmarks through advanced reasoning algorithms. However, the key research question of identifying reasoning steps that balance complexity and computational efficiency remains unsolved. Recent research has increasingly drawn upon psychological theories to explore strategies for optimizing cognitive pathways. The LLM's final outputs and intermediate steps are regarded as System 1 and System 2, respectively. However, an in-depth exploration of the System 2 reasoning is still lacking. Therefore, we propose a novel psychologically backed Scaffold Reasoning framework for code debugging, which encompasses the Scaffold Stream, Analytic Stream, and Integration Stream. The construction of reference code within the Scaffold Stream is integrated with the buggy code analysis results produced by the Analytic Stream through the Integration Stream. Our framework achieves an 88.91% pass rate and an average inference time of 5.36 seconds per-problem on DebugBench, outperforming other reasoning approaches across various LLMs in both reasoning accuracy and efficiency. Further analyses elucidate the advantages and limitations of various cognitive pathways across varying problem difficulties and bug types. Our findings also corroborate the alignment of the proposed Scaffold Reasoning framework with human cognitive processes.
CVNov 10, 2025
How Bias Binds: Measuring Hidden Associations for Bias Control in Text-to-Image CompositionsJeng-Lin Li, Ming-Ching Chang, Wei-Chao Chen
Text-to-image generative models often exhibit bias related to sensitive attributes. However, current research tends to focus narrowly on single-object prompts with limited contextual diversity. In reality, each object or attribute within a prompt can contribute to bias. For example, the prompt "an assistant wearing a pink hat" may reflect female-inclined biases associated with a pink hat. The neglected joint effects of the semantic binding in the prompts cause significant failures in current debiasing approaches. This work initiates a preliminary investigation on how bias manifests under semantic binding, where contextual associations between objects and attributes influence generative outcomes. We demonstrate that the underlying bias distribution can be amplified based on these associations. Therefore, we introduce a bias adherence score that quantifies how specific object-attribute bindings activate bias. To delve deeper, we develop a training-free context-bias control framework to explore how token decoupling can facilitate the debiasing of semantic bindings. This framework achieves over 10% debiasing improvement in compositional generation tasks. Our analysis of bias scores across various attribute-object bindings and token decorrelation highlights a fundamental challenge: reducing bias without disrupting essential semantic relationships. These findings expose critical limitations in current debiasing approaches when applied to semantically bound contexts, underscoring the need to reassess prevailing bias mitigation strategies.
CVDec 4, 2024
Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large Vision-Language Model via Causality AnalysisPo-Hsuan Huang, Jeng-Lin Li, Chin-Po Chen et al.
Recent advancements in large vision-language models (LVLM) have significantly enhanced their ability to comprehend visual inputs alongside natural language. However, a major challenge in their real-world application is hallucination, where LVLMs generate non-existent visual elements, eroding user trust. The underlying mechanism driving this multimodal hallucination is poorly understood. Minimal research has illuminated whether contexts such as sky, tree, or grass field involve the LVLM in hallucinating a frisbee. We hypothesize that hidden factors, such as objects, contexts, and semantic foreground-background structures, induce hallucination. This study proposes a novel causal approach: a hallucination probing system to identify these hidden factors. By analyzing the causality between images, text prompts, and network saliency, we systematically explore interventions to block these factors. Our experimental findings show that a straightforward technique based on our analysis can significantly reduce hallucinations. Additionally, our analyses indicate the potential to edit network internals to minimize hallucinated outputs.
LGFeb 20, 2024
A Comprehensive Review of Machine Learning Advances on Data Change: A Cross-Field PerspectiveJeng-Lin Li, Chih-Fan Hsu, Ming-Ching Chang et al.
Recent artificial intelligence (AI) technologies show remarkable evolution in various academic fields and industries. However, in the real world, dynamic data lead to principal challenges for deploying AI models. An unexpected data change brings about severe performance degradation in AI models. We identify two major related research fields, domain shift and concept drift according to the setting of the data change. Although these two popular research fields aim to solve distribution shift and non-stationary data stream problems, the underlying properties remain similar which also encourages similar technical approaches. In this review, we regroup domain shift and concept drift into a single research problem, namely the data change problem, with a systematic overview of state-of-the-art methods in the two research fields. We propose a three-phase problem categorization scheme to link the key ideas in the two technical fields. We thus provide a novel scope for researchers to explore contemporary technical strategies, learn industrial applications, and identify future directions for addressing data change challenges.
CVSep 30, 2025
PatchEAD: Unifying Industrial Visual Prompting Frameworks for Patch-Exclusive Anomaly DetectionPo-Han Huang, Jeng-Lin Li, Po-Hsuan Huang et al.
Industrial anomaly detection is increasingly relying on foundation models, aiming for strong out-of-distribution generalization and rapid adaptation in real-world deployments. Notably, past studies have primarily focused on textual prompt tuning, leaving the intrinsic visual counterpart fragmented into processing steps specific to each foundation model. We aim to address this limitation by proposing a unified patch-focused framework, Patch-Exclusive Anomaly Detection (PatchEAD), enabling training-free anomaly detection that is compatible with diverse foundation models. The framework constructs visual prompting techniques, including an alignment module and foreground masking. Our experiments show superior few-shot and batch zero-shot performance compared to prior work, despite the absence of textual features. Our study further examines how backbone structure and pretrained characteristics affect patch-similarity robustness, providing actionable guidance for selecting and configuring foundation models for real-world visual inspection. These results confirm that a well-unified patch-only framework can enable quick, calibration-light deployment without the need for carefully engineered textual prompts.
LGAug 24, 2025
Sharpness-Aware Geometric Defense for Robust Out-Of-Distribution DetectionJeng-Lin Li, Ming-Ching Chang, Wei-Chao Chen
Out-of-distribution (OOD) detection ensures safe and reliable model deployment. Contemporary OOD algorithms using geometry projection can detect OOD or adversarial samples from clean in-distribution (ID) samples. However, this setting regards adversarial ID samples as OOD, leading to incorrect OOD predictions. Existing efforts on OOD detection with ID and OOD data under attacks are minimal. In this paper, we develop a robust OOD detection method that distinguishes adversarial ID samples from OOD ones. The sharp loss landscape created by adversarial training hinders model convergence, impacting the latent embedding quality for OOD score calculation. Therefore, we introduce a {\bf Sharpness-aware Geometric Defense (SaGD)} framework to smooth out the rugged adversarial loss landscape in the projected latent geometry. Enhanced geometric embedding convergence enables accurate ID data characterization, benefiting OOD detection against adversarial attacks. We use Jitter-based perturbation in adversarial training to extend the defense ability against unseen attacks. Our SaGD framework significantly improves FPR and AUC over the state-of-the-art defense approaches in differentiating CIFAR-100 from six other OOD datasets under various attacks. We further examine the effects of perturbations at various adversarial training levels, revealing the relationship between the sharp loss landscape and adversarial OOD detection.
CVJul 22, 2025
LSSGen: Leveraging Latent Space Scaling in Flow and Diffusion for Efficient Text to Image GenerationJyun-Ze Tang, Chih-Fan Hsu, Jeng-Lin Li et al.
Flow matching and diffusion models have shown impressive results in text-to-image generation, producing photorealistic images through an iterative denoising process. A common strategy to speed up synthesis is to perform early denoising at lower resolutions. However, traditional methods that downscale and upscale in pixel space often introduce artifacts and distortions. These issues arise when the upscaled images are re-encoded into the latent space, leading to degraded final image quality. To address this, we propose {\bf Latent Space Scaling Generation (LSSGen)}, a framework that performs resolution scaling directly in the latent space using a lightweight latent upsampler. Without altering the Transformer or U-Net architecture, LSSGen improves both efficiency and visual quality while supporting flexible multi-resolution generation. Our comprehensive evaluation covering text-image alignment and perceptual quality shows that LSSGen significantly outperforms conventional scaling approaches. When generating $1024^2$ images at similar speeds, it achieves up to 246\% TOPIQ score improvement.
CVJan 16, 2024
Improving Limited Supervised Foot Ulcer Segmentation Using Cross-Domain AugmentationShang-Jui Kuo, Po-Han Huang, Chia-Ching Lin et al.
Diabetic foot ulcers pose health risks, including higher morbidity, mortality, and amputation rates. Monitoring wound areas is crucial for proper care, but manual segmentation is subjective due to complex wound features and background variation. Expert annotations are costly and time-intensive, thus hampering large dataset creation. Existing segmentation models relying on extensive annotations are impractical in real-world scenarios with limited annotated data. In this paper, we propose a cross-domain augmentation method named TransMix that combines Augmented Global Pre-training AGP and Localized CutMix Fine-tuning LCF to enrich wound segmentation data for model learning. TransMix can effectively improve the foot ulcer segmentation model training by leveraging other dermatology datasets not on ulcer skins or wounds. AGP effectively increases the overall image variability, while LCF increases the diversity of wound regions. Experimental results show that TransMix increases the variability of wound regions and substantially improves the Dice score for models trained with only 40 annotated images under various proportions.