Shi Li

h-index20

33papers

128citations

Novelty53%

AI Score55

Ranked #25,981 of 201,326 authors (top 13%)#5,994 in LG (top 14%)

33 Papers

CVApr 1Code

SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

Shi Li, Vinkle Srivastav, Nicolas Chanel et al.

Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to well navigate evolving intraoperative scenes. Computer-assisted systems such as surgical visual question answering (VQA) offer promises for education and intraoperative support. Current surgical VQA research largely focuses on static frame analysis, overlooking rich temporal semantics. Surgical video question answering is further challenged by low visual contrast, its highly knowledge-driven nature, diverse analytical needs spanning scattered temporal windows, and the hierarchy from basic perception to high-level intraoperative assessment. To address these challenges, we propose SurgTEMP, a multimodal LLM framework featuring (i) a query-guided token selection module that builds hierarchical visual memory (spatial and temporal memory banks) and (ii) a Surgical Competency Progression (SCP) training scheme. Together, these components enable effective modeling of variable-length surgical videos while preserving procedure-relevant cues and temporal coherence, and better support diverse downstream assessment tasks. To support model development, we introduce CholeVidQA-32K, a surgical video question answering dataset comprising 32K open-ended QA pairs and 3,855 video segments (approximately 128 h total) from laparoscopic cholecystectomy. The dataset is organized into a three-level hierarchy -- Perception, Assessment, and Reasoning -- spanning 11 tasks from instrument/action/anatomy perception to Critical View of Safety (CVS), intraoperative difficulty, skill proficiency, and adverse event assessment. In comprehensive evaluations against state-of-the-art open-source multimodal and video LLMs (fine-tuned and zero-shot), SurgTEMP achieves substantial performance improvements, advancing the state of video-based surgical VQA.

DSMay 11

Static to Dynamic Correlation Clustering

Nairen Cao, Vincent Cohen-Addad, Euiwoong Lee et al.

Correlation clustering is a well-studied problem, first proposed by Bansal, Blum, and Chawla [Mach. Learn. '04]. The input is an unweighted, undirected graph. The problem is to cluster the vertices so as to minimize the number of edges between vertices in different clusters and missing edges between vertices inside the same cluster. This problem has a wide application in data mining and machine learning. We introduce a general framework that transforms existing static correlation clustering algorithms into fully-dynamic ones that work against an adaptive adversary. We show how to apply our framework to known efficient correlation clustering algorithms, starting from the classic 3-approximate Pivot algorithm from Ailon, Charikar and Newman [JACM'08]. Applied to the most recent sublinear $1.485$-approximation algorithm from Cao, Cohen-Addad, Lee, Li, Lolck, Newman, Thorup, Vogl, Yan and Zhang [STOC'25], we get a $1.485$-approximation fully-dynamic algorithm that works with worst-case constant update time. The original static algorithm gets its approximation factor with constant probability, and we get the same against an adaptive adversary in the sense that for any given update step, not known to our algorithm, our solution is a $1.485$-approximation with constant probability when we reach this update. Most of previous dynamic algorithms, including the celebrated result from Behnezhad, Charikar, Ma and Tan [FOCS'19], had approximation factors around $3$ in expectation, and they could only handle an oblivious adversary. A recent algorithm by Braverman, Dharangutte, Pai, Shah, and Wang [AISTATS'25] could handle an adaptive adversary, but it has a large unspecified constant approximation ratio. This contrasts with our general transformation, which works with all the best approximation factors known for the static case.

DSMay 6

On Tight FPT Time Approximation Algorithms for k-Clustering Problems

Han Dai, Shi Li, Sijin Peng

Following recent advances in combining approximation algorithms with fixed-parameter tractability (FPT), we study FPT-time approximation algorithms for minimum-norm $k$-clustering problems, parameterized by the number $k$ of open facilities. For the capacitated setting, we give a tight $(3+ε)$-approximation for the general-norm capacitated $k$-clustering problem in FPT-time parameterized by $k$ and $ε$. Prior to our work, such a result was only known for the capacitated $k$-median problem [CL, ICALP, 2019]. As a special case, our result yields an FPT-time $3$-approximation for capacitated $k$-center. The problem has not been studied in the FPT-time setting, with the previous best known polynomial-time approximation ratio being 9 [ABCG, MP, 2015]. In the uncapacitated setting, we consider the $top$-$cn$ norm $k$-clustering problem, where the goal of the problem is to minimize the $top$-$cn$ norm of the connection distance vector. Our main result is a tight $\big(1 + \frac 2{ec} + ε\big)$-approximation algorithm for the problem with $c \in \big(\frac1e, 1\big]$. (For the case $c \leq \frac1e$, there is a simple tight $(3+ε)$-approximation.) Our framework can be easily extended to give a tight $\left(3, 1+\frac2e + ε\right)$-bicriteria approximation for the ($k$-center, $k$-median) problem in FPT time, improving the previous best polynomial-time $(4, 8)$ guarantee [AB, WAOA, 2017]. All results are based on a unified framework: computing a $(1+ε)$-approximate solution using $O\left(\frac{k\log n}ε\right)$ facilities $S$ via LP rounding, sampling a few client representatives $R$ based on the solution $S$, guessing a few pivots from $S \cup R$ and some radius information on the pivots, and solving the problem using the guesses. We believe this framework can lead to further results on $k$-clustering problems.

CVJan 18

Where It Moves, It Matters: Referring Surgical Instrument Segmentation via Motion

Meng Wei, Kun Yuan, Shi Li et al.

Enabling intuitive, language-driven interaction with surgical scenes is a critical step toward intelligent operating rooms and autonomous surgical robotic assistance. However, the task of referring segmentation, localizing surgical instruments based on natural language descriptions, remains underexplored in surgical videos, with existing approaches struggling to generalize due to reliance on static visual cues and predefined instrument names. In this work, we introduce SurgRef, a novel motion-guided framework that grounds free-form language expressions in instrument motion, capturing how tools move and interact across time, rather than what they look like. This allows models to understand and segment instruments even under occlusion, ambiguity, or unfamiliar terminology. To train and evaluate SurgRef, we present Ref-IMotion, a diverse, multi-institutional video dataset with dense spatiotemporal masks and rich motion-centric expressions. SurgRef achieves state-of-the-art accuracy and generalization across surgical procedures, setting a new benchmark for robust, language-driven surgical video segmentation.

BMMay 22

An accurate nucleic acid-small molecule docking framework via geometric deep learning with large-scale pretraining

Shi Li, Xujun Zhang, Mingquan Liu et al.

Nucleic acids are increasingly recognized as therapeutic targets beyond conventional protein-centered drug discovery, yet accurate and efficient docking of small molecules to nucleic acid structures remains challenging. Physics-based docking methods often show limited accuracy and efficiency, whereas deep learning approaches are constrained by the scarcity of experimentally resolved nucleic acid-ligand complexes. Here, we present NucleoDock, a deep learning framework for nucleic acid-small molecule docking. To address data scarcity, NucleoDock combines physics-guided large-scale pretraining on millions of docking-generated synthetic complexes with fine-tuning on curated experimental co-crystal structures. It further integrates sequence- and structure-informed nucleotide representations with atomistic three-dimensional features to capture both biological context and binding-site geometry. A mixture density network-based geometric scoring head is used to model conditional interaction-distance distributions for pose ranking. On an external benchmark of 125 nucleic acid-ligand complexes, NucleoDock achieved a top-1 success rate of 56 percent at an RMSD cutoff of 2.0 Angstrom, outperforming rDock with 29 percent, while generating 100 poses in approximately 5 seconds per complex. Retrospective virtual screening on the ROBIN benchmark further showed improved early enrichment. NucleoDock represents a step toward bridging the methodological gap between protein- and nucleic acid-directed computational drug discovery.

CVDec 2, 2025

From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature

Kun Yuan, Min Woo Sun, Zhen Chen et al.

There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.

LGMay 18

AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models

Yuanyun Zhang, Shi Li

Recent healthcare foundation models have achieved strong predictive performance through large scale self supervised learning, yet their latent representations frequently entangle physiologic severity, intervention intensity, observational structure, and institutional workflow into shared embedding directions. While effective for downstream prediction, such representations remain semantically opaque and unstable under contextual shift. We introduce AURORA, Adaptive Uncertainty aware Representations through Orthogonalized Relational Alignment, a new framework for healthcare representation learning based on contextual latent geometry. Rather than optimizing a single unified embedding manifold, AURORA decomposes representations into orthogonal semantic subspaces corresponding to distinct contextual factors and learns relational consistency objectives within each subspace. This induces latent spaces that are both semantically disentangled and geometrically interpretable. Across multiple clinical prediction and retrieval tasks, AURORA consistently outperforms reconstruction, contrastive, and self distillation baselines while substantially improving contextual disentanglement, neighborhood purity, and robustness under institutional distribution shift. Our results suggest that latent geometry itself constitutes an important axis of healthcare foundation model design and that explicitly structuring representation space according to contextual semantics provides a complementary direction beyond conventional predictive compression objectives.

CVJun 25, 2025Code

Recognizing Surgical Phases Anywhere: Few-Shot Test-time Adaptation and Task-graph Guided Refinement

Kun Yuan, Tingxuan Chen, Shi Li et al.

The complexity and diversity of surgical workflows, driven by heterogeneous operating room settings, institutional protocols, and anatomical variability, present a significant challenge in developing generalizable models for cross-institutional and cross-procedural surgical understanding. While recent surgical foundation models pretrained on large-scale vision-language data offer promising transferability, their zero-shot performance remains constrained by domain shifts, limiting their utility in unseen surgical environments. To address this, we introduce Surgical Phase Anywhere (SPA), a lightweight framework for versatile surgical workflow understanding that adapts foundation models to institutional settings with minimal annotation. SPA leverages few-shot spatial adaptation to align multi-modal embeddings with institution-specific surgical scenes and phases. It also ensures temporal consistency through diffusion modeling, which encodes task-graph priors derived from institutional procedure protocols. Finally, SPA employs dynamic test-time adaptation, exploiting the mutual agreement between multi-modal phase prediction streams to adapt the model to a given test video in a self-supervised manner, enhancing the reliability under test-time distribution shifts. SPA is a lightweight adaptation framework, allowing hospitals to rapidly customize phase recognition models by defining phases in natural language text, annotating a few images with the phase labels, and providing a task graph defining phase transitions. The experimental results show that the SPA framework achieves state-of-the-art performance in few-shot surgical phase recognition across multiple institutions and procedures, even outperforming full-shot models with 32-shot labeled data. Code is available at https://github.com/CAMMA-public/SPA

LGMar 21

Discriminative Representation Learning for Clinical Prediction

Yang Zhang, Li Fan, Samuel Lawrence et al.

Foundation models in healthcare have largely adopted self supervised pretraining objectives inherited from natural language processing and computer vision, emphasizing reconstruction and large scale representation learning prior to downstream adaptation. We revisit this paradigm in outcome centric clinical prediction settings and argue that, when high quality supervision is available, direct outcome alignment may provide a stronger inductive bias than generative pretraining. We propose a supervised deep learning framework that explicitly shapes representation geometry by maximizing inter class separation relative to within class variance, thereby concentrating model capacity along clinically meaningful axes. Across multiple longitudinal electronic health record tasks, including mortality and readmission prediction, our approach consistently outperforms masked, autoregressive, and contrastive pretraining baselines under matched model capacity. The proposed method improves discrimination, calibration, and sample efficiency, while simplifying the training pipeline to a single stage optimization. These findings suggest that in low entropy, outcome driven healthcare domains, supervision can act as the statistically optimal driver of representation learning, challenging the assumption that large scale self supervised pretraining is a prerequisite for strong clinical performance.

LGMay 10

WISTERIA: Learning Clinical Representations from Noisy Supervision via Multi-View Consistency in Electronic Health Records

Ruan Dong, Yuanyun Zhang, Shi Li

Representation learning in electronic health records (EHR) has largely followed paradigms inherited from natural language processing, relying on sequence modeling and reconstruction based objectives that treat clinical labels as ground truth. However, real world clinical supervision is inherently weak, arising from heterogeneous, noisy, and institution specific labeling processes such as billing codes, heuristic phenotypes, and incomplete annotations. In this work, we propose WISTERIA, a weakly supervised representation learning framework that models labels as stochastic observations of an underlying latent clinical state. Instead of optimizing against a single supervision signal, WISTERIA constructs multiple weak supervision operators and learns representations by enforcing consistency across their induced label distributions. This multi view formulation induces an implicit denoising mechanism, allowing the model to recover clinically meaningful structure by reconciling disagreement between noisy labelers. We further incorporate ontology aware regularization in the label space to impose semantic structure over supervision signals. Empirically, WISTERIA improves predictive performance across standard EHR benchmarks, demonstrates strong robustness to label noise, and exhibits superior cross institutional generalization compared to sequence based pretraining objectives. These results suggest that explicitly modeling the supervision process rather than treating labels as fixed targets provides a more appropriate inductive bias for learning robust and clinically meaningful representations from EHR data.

LGMay 9

Event Fields: Learning Latent Event Structure for Waveform Foundation Models

Li Na, Yuanyun Zhang, Shi Li

We propose a new class of waveform foundation models that departs from conventional sequence based representations by modeling physiological time series as realizations of latent event processes. Rather than treating signals as collections of local tokens or patches, our approach assumes that clinically meaningful structure arises from temporally extended, interacting events whose boundaries and dynamics are not directly observed. To capture this structure, we introduce a self supervised learning framework that enforces consistency across stochastic segmentations and time frequency projections of the same waveform, encouraging representations that are invariant to signal level perturbations while preserving event level organization. The resulting model combines a segmentation aware encoder with a latent interaction operator that captures dependencies among inferred events, and naturally extends to multimodal settings by aligning modalities through shared event representations. Across a range of physiological benchmarks, including arrhythmia classification, hemodynamic prediction, and waveform retrieval, the proposed method improves performance, robustness, and label efficiency relative to strong sequence based baselines. These results suggest that shifting from signal centric to event centric representations provides a more appropriate inductive bias for modeling physiological dynamics and offers a complementary path to scaling foundation models in healthcare.

MTRL-SCIMay 4

From Knowledge to Action: Outcomes of the 2025 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry

Aritra Roy, Kevin Shen, Andrew MacBride et al.

Large language models (LLMs) are rapidly changing how researchers in materials science and chemistry discover, organize, and act on scientific knowledge. This paper analyzes a broad set of community-developed LLM applications in an effort to identify emerging patterns in how these systems can be used across the scientific research lifecycle. We organize the projects into two complementary categories: Knowledge Infrastructure, systems that structure, retrieve, synthesize, and validate scientific information; and Action Systems, systems that execute, coordinate, or automate scientific work across computational and experimental environments. The submissions reveal a shift from single-purpose LLM tools toward integrated, multi-agent workflows that combine retrieval, reasoning, tool use, and domain-specific validation. Prominent themes include retrieval-augmented generation as grounding infrastructure, persistent structured knowledge representations, multimodal and multilingual scientific inputs, and early progress toward laboratory-integrated closed-loop systems. Together, these results suggest that LLMs are evolving from general-purpose assistants into composable infrastructure for scientific reasoning and action. This work provides a community snapshot of that transition and a practical taxonomy for understanding emerging LLM-enabled workflows in materials science and chemistry.

DSApr 27

New Convex Programming Technique for Nash Social Welfare and Scheduling

Yuda Feng, Weijiang Hu, Shi Li

We propose a new convex programming relaxation for the weighted Nash social welfare (NSW) problem that achieves a matching $(e^{1/e}\approx 1.445)$-approximation via the rounding algorithm of Feng and Li. Unlike the exponential-size configuration LP used in prior work, our formulation can be converted into a compact linear program of polynomial size, incurring only an additive loss of $\ln(1+ε)$ in the objective. This allows the program to be solved directly using standard LP solvers, without the ellipsoid method or dual separation oracles. In the unweighted case, we show that our convex program is equivalent to the restricted-spending Fisher market convex program of Cole and Gkatzelis, yielding a constructive proof that its integrality gap is exactly $e^{1/e}$. With a minor modification, our analysis also gives a simple proof of the $e^{1/e}$ EF1 gap for the identical agent setting. Finally, we show that our convex programming technique extends to two unrelated machine scheduling problems, recovering the best-known approximation ratios with simpler analyses.

SPMay 2, 2024

Machine Learning in Short-Reach Optical Systems: A Comprehensive Survey

Chen Shao, Elias Giacoumidis, Syed Moktacim Billah et al.

In recent years, extensive research has been conducted to explore the utilization of machine learning algorithms in various direct-detected and self-coherent short-reach communication applications. These applications encompass a wide range of tasks, including bandwidth request prediction, signal quality monitoring, fault detection, traffic prediction, and digital signal processing (DSP)-based equalization. As a versatile approach, machine learning demonstrates the ability to address stochastic phenomena in optical systems networks where deterministic methods may fall short. However, when it comes to DSP equalization algorithms, their performance improvements are often marginal, and their complexity is prohibitively high, especially in cost-sensitive short-reach communications scenarios such as passive optical networks (PONs). They excel in capturing temporal dependencies, handling irregular or nonlinear patterns effectively, and accommodating variable time intervals. Within this extensive survey, we outline the application of machine learning techniques in short-reach communications, specifically emphasizing their utilization in high-bandwidth demanding PONs. Notably, we introduce a novel taxonomy for time-series methods employed in machine learning signal processing, providing a structured classification framework. Our taxonomy categorizes current time series methods into four distinct groups: traditional methods, Fourier convolution-based methods, transformer-based models, and time-series convolutional networks. Finally, we highlight prospective research directions within this rapidly evolving field and outline specific solutions to mitigate the complexity associated with hardware implementations. We aim to pave the way for more practical and efficient deployment of machine learning approaches in short-reach optical communication systems by addressing complexity concerns.

LGApr 5

Uncertainty-Aware Foundation Models for Clinical Data

Qian Zhou, Yuanyun Zhang, Shi Li

Healthcare foundation models have largely followed paradigms from natural language processing and computer vision, emphasizing large scale pretraining and deterministic representations over heterogeneous clinical data. However, clinical observations are inherently incomplete, reflecting sparse, irregular, and modality dependent measurements of an underlying physiologic state. In this work, we propose a framework for uncertainty aware foundation modeling that represents each patient not as a point embedding, but as a distribution over plausible latent states. By learning set valued representations and enforcing consistency across partial views of the same patient, the model captures what is invariantly inferable while explicitly encoding epistemic uncertainty. We integrate this formulation with multimodal encoders and scalable self supervised objectives, combining reconstruction, contrastive alignment, and distributional regularization. Across diverse clinical tasks, our approach improves predictive performance, robustness under missing data, and uncertainty calibration relative to strong baselines. These results suggest that modeling what is not observed rather than only what is constitutes a critical inductive bias for healthcare foundation models.

SPApr 25, 2024

A Novel Machine Learning-based Equalizer for a Downstream 100G PAM-4 PON

Chen Shao, Elias Giacoumidis, Shi Li et al.

A frequency-calibrated SCINet (FC-SCINet) equalizer is proposed for down-stream 100G PON with 28.7 dB path loss. At 5 km, FC-SCINet improves the BER by 88.87% compared to FFE and a 3-layer DNN with 10.57% lower complexity.

LGMar 7

Learning Clinical Representations Under Systematic Distribution Shift

Yuanyun Zhang, Shi Li

Clinical machine learning models are increasingly trained using large scale, multimodal foundation paradigms, yet deployment environments often differ systematically from the data generating settings used during training. Such shifts arise from heterogeneous measurement policies, documentation practices, and institutional workflows, leading to representation entanglement between physiologic signal and practice specific artifacts. In this work, we propose a practice invariant representation learning framework for multimodal clinical prediction. We model clinical observations as arising from latent physiologic factors and environment dependent processes, and introduce an objective that jointly optimizes predictive performance while suppressing environment predictive information in the learned embedding. Concretely, we combine supervised risk minimization with adversarial environment regularization and invariant risk penalties across hospitals. Across multiple longitudinal EHR prediction tasks and cross institution evaluations, our method improves out of distribution AUROC by up to 2 to 3 points relative to masked pretraining and standard supervised baselines, while maintaining in distribution performance and improving calibration. These results demonstrate that explicitly accounting for systematic distribution shift during representation learning yields more robust and transferable clinical models, highlighting the importance of structural invariance alongside architectural scale in healthcare AI.

LGMay 4, 2024

Advanced Equalization in 112 Gb/s Upstream PON Using a Novel Fourier Convolution-based Network

Chen Shao, Elias Giacoumidis, Patrick Matalla et al.

We experimentally demonstrate a novel, low-complexity Fourier Convolution-based Network (FConvNet) based equalizer for 112 Gb/s upstream PAM4-PON. At a BER of 0.005, FConvNet enhances the receiver sensitivity by 2 and 1 dB compared to a 51-tap Sato equalizer and benchmark machine learning algorithms respectively.

DSApr 7

$k$-Clustering via Iterative Randomized Rounding

JarosÅaw Byrka, Yuhao Guo, Yang Hu et al.

In this work we propose a single rounding algorithm for the fractional solutions of the standard LP relaxation for $k$-clustering. As a starting point, we obtain an iterative rounding $(\frac{3^p + 1}{2})$-Lagrangian Multiplier-Perserving (LMP) approximation for the $k$-clustering problem with the cost function being the $p$-th power of the distance. Such an algorithm outputs a random solution that opens $k$ facilities \emph{in expectation}, whose cost in expectation is at most $\frac{3^p + 1}{2}$ times the optimum cost. Thus, we recover the $2$-LMP approximation for $k$-median by Jain et al.~[JACM'03], which played a central role in deriving the current best $2$ approximation for $k$-median. Unlike the result of Jain et al., our algorithm is based on LP rounding, and it can be easily adapted to the $L_p^p$-cost setting. For the Euclidean $k$-means problem, the LMP factor we obtain is $\frac{11}{3}$, which is better than the $5$ approximation given by this framework for general metrics. Then, we show how to convert the LMP-approximation algorithms to a true-approximation, with only a $(1+\varepsilon)$ factor loss in the approximation ratio. We obtain a ($\frac{3^p + 1}{2}+\varepsilon$)-approximation algorithm for $k$-clustering with cost function being the $p$-th power of the distance, for $p \geq 1$. This reproduces the best known ($2+\varepsilon$)-approximation for $k$-median by Cohen-Addad et al. [STOC'25], and improves the approximation factor for metric $k$-means from 5.83 by Charikar at al. [FOCS'25] to $5+\varepsilon$ in our framework. Moreover, the same algorithm, but with a specialized analysis, attains ($4+\varepsilon$)-approximation for Euclidean $k$-means matching the recent result by Charikar et al. [STOC'26].

LGApr 10, 2025

ChronoFormer: Time-Aware Transformer Architectures for Structured Clinical Event Modeling

Yuanyun Zhang, Shi Li

The temporal complexity of electronic health record (EHR) data presents significant challenges for predicting clinical outcomes using machine learning. This paper proposes ChronoFormer, an innovative transformer based architecture specifically designed to encode and leverage temporal dependencies in longitudinal patient data. ChronoFormer integrates temporal embeddings, hierarchical attention mechanisms, and domain specific masking techniques. Extensive experiments conducted on three benchmark tasks mortality prediction, readmission prediction, and long term comorbidity onset demonstrate substantial improvements over current state of the art methods. Furthermore, detailed analyses of attention patterns underscore ChronoFormer's capability to capture clinically meaningful long range temporal relationships.

CYFeb 25, 2025

A Collection of Innovations in Medical AI for patient records in 2024

Yuanyun Zhang, Shi Li

The field of Artificial Intelligence in healthcare is evolving at an unprecedented pace, driven by rapid advancements in machine learning and the recent breakthroughs in large language models. While these innovations hold immense potential to transform clinical decision making, diagnostics, and patient care, the accelerating speed of AI development has outpaced traditional academic publishing cycles. As a result, many scholarly contributions quickly become outdated, failing to capture the latest state of the art methodologies and their real world implications. This paper advocates for a new category of academic publications an annualized citation framework that prioritizes the most recent AI driven healthcare innovations. By systematically referencing the breakthroughs of the year, such papers would ensure that research remains current, fostering a more adaptive and informed discourse. This approach not only enhances the relevance of AI research in healthcare but also provides a more accurate reflection of the fields ongoing evolution.

DSOct 12, 2025

Learning-Augmented Streaming Algorithms for Correlation Clustering

Yinhao Dong, Shan Jiang, Shi Li et al.

We study streaming algorithms for Correlation Clustering. Given a graph as an arbitrary-order stream of edges, with each edge labeled as positive or negative, the goal is to partition the vertices into disjoint clusters, such that the number of disagreements is minimized. In this paper, we give the first learning-augmented streaming algorithms for the problem on both complete and general graphs, improving the best-known space-approximation tradeoffs. Based on the works of Cambus et al. (SODA'24) and Ahn et al. (ICML'15), our algorithms use the predictions of pairwise distances between vertices provided by a predictor. For complete graphs, our algorithm achieves a better-than-$3$ approximation under good prediction quality, while using $\tilde{O}(n)$ total space. For general graphs, our algorithm achieves an $O(\log |E^-|)$ approximation under good prediction quality using $\tilde{O}(n)$ total space, improving the best-known non-learning algorithm in terms of space efficiency. Experimental results on synthetic and real-world datasets demonstrate the superiority of our proposed algorithms over their non-learning counterparts.

CLOct 11, 2025

Serialized EHR make for good text representations

Zhirong Chou, Quan Qin, Shi Li

The emergence of foundation models in healthcare has opened new avenues for learning generalizable representations from large scale clinical data. Yet, existing approaches often struggle to reconcile the tabular and event based nature of Electronic Health Records (EHRs) with the sequential priors of natural language models. This structural mismatch limits their ability to capture longitudinal dependencies across patient encounters. We introduce SerialBEHRT, a domain aligned foundation model that extends SciBERT through additional pretraining on structured EHR sequences. SerialBEHRT is designed to encode temporal and contextual relationships among clinical events, thereby producing richer patient representations. We evaluate its effectiveness on the task of antibiotic susceptibility prediction, a clinically meaningful problem in antibiotic stewardship. Through extensive benchmarking against state of the art EHR representation strategies, we demonstrate that SerialBEHRT achieves superior and more consistent performance, highlighting the importance of temporal serialization in foundation model pretraining for healthcare.

CVJun 18, 2025

A Unified Graph-based Framework for Scalable 3D Tree Reconstruction and Non-Destructive Biomass Estimation from Point Clouds

Di Wang, Shi Li

Estimating forest above-ground biomass (AGB) is crucial for assessing carbon storage and supporting sustainable forest management. Quantitative Structural Model (QSM) offers a non-destructive approach to AGB estimation through 3D tree structural reconstruction. However, current QSM methods face significant limitations, as they are primarily designed for individual trees,depend on high-quality point cloud data from terrestrial laser scanning (TLS), and also require multiple pre-processing steps that hinder scalability and practical deployment. This study presents a novel unified framework that enables end-to-end processing of large-scale point clouds using an innovative graph-based pipeline. The proposed approach seamlessly integrates tree segmentation,leaf-wood separation and 3D skeletal reconstruction through dedicated graph operations including pathing and abstracting for tree topology reasoning. Comprehensive validation was conducted on datasets with varying leaf conditions (leaf-on and leaf-off), spatial scales (tree- and plot-level), and data sources (TLS and UAV-based laser scanning, ULS). Experimental results demonstrate strong performance under challenging conditions, particularly in leaf-on scenarios (~20% relative error) and low-density ULS datasets with partial coverage (~30% relative error). These findings indicate that the proposed framework provides a robust and scalable solution for large-scale, non-destructive AGB estimation. It significantly reduces dependency on specialized pre-processing tools and establishes ULS as a viable alternative to TLS. To our knowledge, this is the first method capable of enabling seamless, end-to-end 3D tree reconstruction at operational scales. This advancement substantially improves the feasibility of QSM-based AGB estimation, paving the way for broader applications in forest inventory and climate change research.

IRJun 1, 2025

Structured Semantics from Unstructured Notes: Language Model Approaches to EHR-Based Decision Support

Wu Hao Ran, Xi Xi, Furong Li et al.

The advent of large language models (LLMs) has opened new avenues for analyzing complex, unstructured data, particularly within the medical domain. Electronic Health Records (EHRs) contain a wealth of information in various formats, including free text clinical notes, structured lab results, and diagnostic codes. This paper explores the application of advanced language models to leverage these diverse data sources for improved clinical decision support. We will discuss how text-based features, often overlooked in traditional high dimensional EHR analysis, can provide semantically rich representations and aid in harmonizing data across different institutions. Furthermore, we delve into the challenges and opportunities of incorporating medical codes and ensuring the generalizability and fairness of AI models in healthcare.

CLApr 25, 2025

Temporal Entailment Pretraining for Clinical Language Models over EHR Data

Tatsunori Tanaka, Fi Zheng, Kai Sato et al.

Clinical language models have achieved strong performance on downstream tasks by pretraining on domain specific corpora such as discharge summaries and medical notes. However, most approaches treat the electronic health record as a static document, neglecting the temporally-evolving and causally entwined nature of patient trajectories. In this paper, we introduce a novel temporal entailment pretraining objective for language models in the clinical domain. Our method formulates EHR segments as temporally ordered sentence pairs and trains the model to determine whether a later state is entailed by, contradictory to, or neutral with respect to an earlier state. Through this temporally structured pretraining task, models learn to perform latent clinical reasoning over time, improving their ability to generalize across forecasting and diagnosis tasks. We pretrain on a large corpus derived from MIMIC IV and demonstrate state of the art results on temporal clinical QA, early warning prediction, and disease progression modeling.

LGMar 5, 2025

An Optimization Algorithm for Multimodal Data Alignment

Wei Zhang, Xinyue Wang, Lan Yu et al.

In the data era, the integration of multiple data types, known as multimodality, has become a key area of interest in the research community. This interest is driven by the goal to develop cutting edge multimodal models capable of serving as adaptable reasoning engines across a wide range of modalities and domains. Despite the fervent development efforts, the challenge of optimally representing different forms of data within a single unified latent space a crucial step for enabling effective multimodal reasoning has not been fully addressed. To bridge this gap, we introduce AlignXpert, an optimization algorithm inspired by Kernel CCA crafted to maximize the similarities between N modalities while imposing some other constraints. This work demonstrates the impact on improving data representation for a variety of reasoning tasks, such as retrieval and classification, underlining the pivotal importance of data representation.

LGMar 5, 2025

Exploring Neural Ordinary Differential Equations as Interpretable Healthcare classifiers

Shi Li

Deep Learning has emerged as one of the most significant innovations in machine learning. However, a notable limitation of this field lies in the ``black box" decision-making processes, which have led to skepticism within groups like healthcare and scientific communities regarding its applicability. In response, this study introduces a interpretable approach using Neural Ordinary Differential Equations (NODEs), a category of neural network models that exploit the dynamics of differential equations for representation learning. Leveraging their foundation in differential equations, we illustrate the capability of these models to continuously process textual data, marking the first such model of its kind, and thereby proposing a promising direction for future research in this domain. The primary objective of this research is to propose a novel architecture for groups like healthcare that require the predictive capabilities of deep learning while emphasizing the importance of model transparency demonstrated in NODEs.

MLOct 19, 2020

Robust High Dimensional Expectation Maximization Algorithm via Trimmed Hard Thresholding

Di Wang, Xiangyu Guo, Shi Li et al.

In this paper, we study the problem of estimating latent variable models with arbitrarily corrupted samples in high dimensional space ({\em i.e.,} $d\gg n$) where the underlying parameter is assumed to be sparse. Specifically, we propose a method called Trimmed (Gradient) Expectation Maximization which adds a trimming gradients step and a hard thresholding step to the Expectation step (E-step) and the Maximization step (M-step), respectively. We show that under some mild assumptions and with an appropriate initialization, the algorithm is corruption-proofing and converges to the (near) optimal statistical rate geometrically when the fraction of the corrupted samples $ε$ is bounded by $ \tilde{O}(\frac{1}{\sqrt{n}})$. Moreover, we apply our general framework to three canonical models: mixture of Gaussians, mixture of regressions and linear regression with missing covariates. Our theory is supported by thorough numerical results.

LGOct 19, 2020

Estimating Stochastic Linear Combination of Non-linear Regressions Efficiently and Scalably

Di Wang, Xiangyu Guo, Chaowen Guan et al.

Recently, many machine learning and statistical models such as non-linear regressions, the Single Index, Multi-index, Varying Coefficient Index Models and Two-layer Neural Networks can be reduced to or be seen as a special case of a new model which is called the \textit{Stochastic Linear Combination of Non-linear Regressions} model. However, due to the high non-convexity of the problem, there is no previous work study how to estimate the model. In this paper, we provide the first study on how to estimate the model efficiently and scalably. Specifically, we first show that with some mild assumptions, if the variate vector $x$ is multivariate Gaussian, then there is an algorithm whose output vectors have $\ell_2$-norm estimation errors of $O(\sqrt{\frac{p}{n}})$ with high probability, where $p$ is the dimension of $x$ and $n$ is the number of samples. The key idea of the proof is based on an observation motived by the Stein's lemma. Then we extend our result to the case where $x$ is bounded and sub-Gaussian using the zero-bias transformation, which could be seen as a generalization of the classic Stein's lemma. We also show that with some additional assumptions there is an algorithm whose output vectors have $\ell_\infty$-norm estimation errors of $O(\frac{1}{\sqrt{p}}+\sqrt{\frac{p}{n}})$ with high probability. We also provide a concrete example to show that there exists some link function which satisfies the previous assumptions. Finally, for both Gaussian and sub-Gaussian cases we propose a faster sub-sampling based algorithm and show that when the sub-sample sizes are large enough then the estimation errors will not be sacrificed by too much. Experiments for both cases support our theoretical results. To the best of our knowledge, this is the first work that studies and provides theoretical guarantees for the stochastic linear combination of non-linear regressions model.

DSAug 13, 2020

Consistent $k$-Median: Simpler, Better and Robust

Xiangyu Guo, Janardhan Kulkarni, Shi Li et al.

In this paper we introduce and study the online consistent $k$-clustering with outliers problem, generalizing the non-outlier version of the problem studied in [Lattanzi-Vassilvitskii, ICML17]. We show that a simple local-search based online algorithm can give a bicriteria constant approximation for the problem with $O(k^2 \log^2 (nD))$ swaps of medians (recourse) in total, where $D$ is the diameter of the metric. When restricted to the problem without outliers, our algorithm is simpler, deterministic and gives better approximation ratio and recourse, compared to that of [Lattanzi-Vassilvitskii, ICML17].

DSOct 26, 2019

Facility Location Problem in Differential Privacy Model Revisited

Yunus Esencayi, Marco Gaboardi, Shi Li et al.

In this paper we study the uncapacitated facility location problem in the model of differential privacy (DP) with uniform facility cost. Specifically, we first show that, under the hierarchically well-separated tree (HST) metrics and the super-set output setting that was introduced in Gupta et. al., there is an $ε$-DP algorithm that achieves an $O(\frac{1}ε)$(expected multiplicative) approximation ratio; this implies an $O(\frac{\log n}ε)$ approximation ratio for the general metric case, where $n$ is the size of the input metric. These bounds improve the best-known results given by Gupta et. al. In particular, our approximation ratio for HST-metrics is independent of $n$, and the ratio for general metrics is independent of the aspect ratio of the input metric. On the negative side, we show that the approximation ratio of any $ε$-DP algorithm is lower bounded by $Ω(\frac{1}{\sqrtε})$, even for instances on HST metrics with uniform facility cost, under the super-set output setting. The lower bound shows that the dependence of the approximation ratio for HST metrics on $ε$ can not be removed or greatly improved. Our novel methods and techniques for both the upper and lower bound may find additional applications.

DCOct 18, 2018

Distributed $k$-Clustering for Data with Heavy Noise

Xiangyu Guo, Shi Li

In this paper, we consider the $k$-center/median/means clustering with outliers problems (or the $(k, z)$-center/median/means problems) in the distributed setting. Most previous distributed algorithms have their communication costs linearly depending on $z$, the number of outliers. Recently Guha et al. overcame this dependence issue by considering bi-criteria approximation algorithms that output solutions with $2z$ outliers. For the case where $z$ is large, the extra $z$ outliers discarded by the algorithms might be too large, considering that the data gathering process might be costly. In this paper, we improve the number of outliers to the best possible $(1+ε)z$, while maintaining the $O(1)$-approximation ratio and independence of communication cost on $z$. The problems we consider include the $(k, z)$-center problem, and $(k, z)$-median/means problems in Euclidean metrics. Implementation of the our algorithm for $(k, z)$-center shows that it outperforms many previous algorithms, both in terms of the communication cost and quality of the output solution.