Xinyi Wen

CV
h-index9
4papers
4citations
Novelty65%
AI Score50

4 Papers

CVJan 30
Countering the Over-Reliance Trap: Mitigating Object Hallucination for LVLMs via a Self-Validation Framework

Shiyu Liu, Xinyi Wen, Zhibin Lan et al.

Despite progress in Large Vision Language Models (LVLMs), object hallucination remains a critical issue in image captioning task, where models generate descriptions of non-existent objects, compromising their reliability. Previous work attributes this to LVLMs' over-reliance on language priors and attempts to mitigate it through logits calibration. However, they still lack a thorough analysis of the over-reliance. To gain a deeper understanding of over-reliance, we conduct a series of preliminary experiments, indicating that as the generation length increases, LVLMs' over-reliance on language priors leads to inflated probability of hallucinated object tokens, consequently exacerbating object hallucination. To circumvent this issue, we propose Language-Prior-Free Verification to enable LVLMs to faithfully verify the confidence of object existence. Based on this, we propose a novel training-free Self-Validation Framework to counter the over-reliance trap. It first validates objects' existence in sampled candidate captions and further mitigates object hallucination via caption selection or aggregation. Experiment results demonstrate that our framework mitigates object hallucination significantly in image captioning task (e.g., 65.6% improvement on CHAIRI metric with LLaVA-v1.5-7B), surpassing the previous SOTA methods. This result highlights a novel path towards mitigating hallucination by unlocking the inherent potential within LVLMs themselves.

HCFeb 4
Adaptive Prompt Elicitation for Text-to-Image Generation

Xinyi Wen, Lena Hegemann, Xiaofu Jin et al.

Aligning text-to-image generation with user intent remains challenging, for users who provide ambiguous inputs and struggle with model idiosyncrasies. We propose Adaptive Prompt Elicitation (APE), a technique that adaptively asks visual queries to help users refine prompts without extensive writing. Our technical contribution is a formulation of interactive intent inference under an information-theoretic framework. APE represents latent intent as interpretable feature requirements using language model priors, adaptively generates visual queries, and compiles elicited requirements into effective prompts. Evaluation on IDEA-Bench and DesignBench shows that APE achieves stronger alignment with improved efficiency. A user study with challenging user-defined tasks demonstrates 19.8% higher alignment without workload overhead. Our work contributes a principled approach to prompting that, for general users, offers an effective and efficient complement to the prevailing prompt-based interaction paradigm with text-to-image models.

SYMay 4
PowerSINDy: Identifying Nonlinear Time-Dependent Dynamics in Power Grid Frequency

Xinyi Wen, Xiao Li, Leonardo Rydin Gorjão et al.

System identification plays a crucial role in physics and machine learning for discovering governing equations directly from data. A powerful approach is the Sparse Identification of Nonlinear Dynamics (SINDy) method, which assumes that only a few dominant terms drive the essential behavior of a nonlinear dynamical system. While SINDy methods have shown excellent results, they are most often illustrated on synthetic or simulated systems, leaving open the question of how well they perform on complex, noisy, real-world data. Power grid frequency dynamics provide a highly relevant and challenging environment for advancing system identification methods. In this work, we propose PowerSINDy as a framework for empirical power system data. We apply this framework to empirical frequency data from the Continental Europe (CE) and South Korea (SK) synchronous grids, two major power systems with distinct dynamical characteristics. PowerSINDy, which also includes time-dependent terms, can identify the dynamics of these complex real-world systems. Furthermore, we benchmark three sparsity-promoting regression strategies: Sequentially Thresholded Least Squares (STLSQ), Least Absolute Shrinkage and Selection Operator (LASSO), and Sparse Relaxed Regularized Regression (SR3) to evaluate trade-offs between accuracy, sparsity, and robustness. Results show that LASSO consistently achieves the lowest stable RMSEs, reaching 0.0101 for the CE, while STLSQ provides the best balance between accuracy and stability. SR3 exhibits higher variability and sensitivity to regularization, with L0 and L1 producing nearly indistinguishable outcomes.

MLJun 8, 2025
ALINE: Joint Amortization for Bayesian Inference and Active Data Acquisition

Daolang Huang, Xinyi Wen, Ayush Bharti et al.

Many critical applications, from autonomous scientific discovery to personalized medicine, demand systems that can both strategically acquire the most informative data and instantaneously perform inference based upon it. While amortized methods for Bayesian inference and experimental design offer part of the solution, neither approach is optimal in the most general and challenging task, where new data needs to be collected for instant inference. To tackle this issue, we introduce the Amortized Active Learning and Inference Engine (ALINE), a unified framework for amortized Bayesian inference and active data acquisition. ALINE leverages a transformer architecture trained via reinforcement learning with a reward based on self-estimated information gain provided by its own integrated inference component. This allows it to strategically query informative data points while simultaneously refining its predictions. Moreover, ALINE can selectively direct its querying strategy towards specific subsets of model parameters or designated predictive tasks, optimizing for posterior estimation, data prediction, or a mixture thereof. Empirical results on regression-based active learning, classical Bayesian experimental design benchmarks, and a psychometric model with selectively targeted parameters demonstrate that ALINE delivers both instant and accurate inference along with efficient selection of informative points.