Boyu Xiao

AI
h-index7
3papers
14citations
Novelty65%
AI Score45

3 Papers

22.9DBMar 16
A New Lower Bounding Paradigm and Tighter Lower Bounds for Elastic Similarity Measures

Zemin Chao, Boyu Xiao, Zitong Li et al.

Elastic similarity measures are fundamental to time series similarity search because of their ability to handle temporal misalignments. These measures are inherently computationally expensive, therefore necessitating the use of lower bounds to prune unnecessary comparisons. This paper proposes a new \emph{Bipartite Graph Edge-Cover Paradigm} for deriving lower bounds, which applies to a broad class of elastic similarity measures. This paradigm formulates lower bounding as a vertex-weighting problem on a weighted bipartite graph induced from the input time series. Under this paradigm, most of the existing lower bounds of elastic similarity measures can be viewed as simple instantiations. We further propose \textit{BGLB}, an instantiation of the proposed paradigm that incorporates an additional augmentation term, yielding lower bounds that are provably tighter. Theoretical analysis and extensive experiments on 128 real-world datasets demonstrate that \textit{BGLB} achieves the tightest known lower bounds for six elastic measures (ERP, MSM, TWED, LCSS, EDR, and SWALE). Moreover, \textit{BGLB} remains highly competitive for \textit{DTW} with a favorable trade-off between tightness and computational efficiency. In nearest neighbor search, integrating \textit{BGLB} into filter pipelines consistently outperforms state-of-the-art methods, achieving speedups ranging from $24.6\%$ to $84.9\%$ across various elastic similarity measures. Besides, \textit{BGLB} also delivers a significant acceleration in density-based clustering applications, validating the practical potential of \textit{BGLB} in time series similarity search tasks based on elastic similarity measures.

76.3AIApr 23
When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure

Boyu Xiao, Xiuqi Tian, Xuwen Song et al.

Despite strong medical benchmark accuracy, LLMs can exhibit severe multi-turn sycophancy in clinical dialogue, abandoning initial correct diagnosis under escalating pressure. We propose \textbf{\textsc{Med-Stress}}, a targeted stress test framework that evaluates belief stability under escalating pressure. Across nine frontier large language models (LLMs), we find a clear dissociation between medical knowledge and robustness: high initial diagnostic capability does not imply high belief stability, yielding large knowledge-robustness gaps for several LLMs. To mitigate this failure mode, we propose a lightweight inference-time defense, \textbf{\texttt{RBED}} (\textbf{R}ole-\textbf{B}ased \textbf{E}pistemic \textbf{D}efense), and \textbf{\texttt{R-FT}} (\textbf{R}esilience-oriented \textbf{F}ine-\textbf{T}uning), a training-time approach that internalizes evidence-based resistance to pressure. Experiments show that \textbf{\texttt{R-FT}} nearly eliminates belief change and substantially improves robustness.

CVMay 15, 2025
UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation

Yi Li, Haonan Wang, Qixiang Zhang et al.

The emergence of unified multimodal understanding and generation models is rapidly attracting attention because of their ability to enhance instruction-following capabilities while minimizing model redundancy. However, there is a lack of a unified evaluation framework for these models, which would enable an elegant, simplified, and overall evaluation. Current models conduct evaluations on multiple task-specific benchmarks, but there are significant limitations, such as the lack of overall results, errors from extra evaluation models, reliance on extensive labeled images, benchmarks that lack diversity, and metrics with limited capacity for instruction-following evaluation. To tackle these challenges, we introduce UniEval, the first evaluation framework designed for unified multimodal models without extra models, images, or annotations. This facilitates a simplified and unified evaluation process. The UniEval framework contains a holistic benchmark, UniBench (supports both unified and visual generation models), along with the corresponding UniScore metric. UniBench includes 81 fine-grained tags contributing to high diversity. Experimental results indicate that UniBench is more challenging than existing benchmarks, and UniScore aligns closely with human evaluations, surpassing current metrics. Moreover, we extensively evaluated SoTA unified and visual generation models, uncovering new insights into Univeral's unique values.