LGJan 21Code
Efficient Imputation for Patch-based Missing Single-cell Data via Cluster-regularized Optimal TransportYuyu Liu, Jiannan Yang, Ziyang Yu et al.
Missing data in single-cell sequencing datasets poses significant challenges for extracting meaningful biological insights. However, existing imputation approaches, which often assume uniformity and data completeness, struggle to address cases with large patches of missing data. In this paper, we present CROT, an optimal transport-based imputation algorithm designed to handle patch-based missing data in tabular formats. Our approach effectively captures the underlying data structure in the presence of significant missingness. Notably, it achieves superior imputation accuracy while significantly reducing runtime, demonstrating its scalability and efficiency for large-scale datasets. This work introduces a robust solution for imputation in heterogeneous, high-dimensional datasets with structured data absence, addressing critical challenges in both biological and clinical data analysis. Our code is available at Anomalous Github.
SIApr 13Code
Identifying Disruptive Models in the Open-Source LLM CommunityXiaoting Wei, Lele Kang, Xuelian Pan et al.
The rapid growth of open-source large language models (LLMs) has created a complex ecosystem of model inheritance and reuse. However, existing research has focused mainly on descriptive analyses of lineage evolution, with limited attention to identifying which models play a disruptive role in shaping subsequent development. Using metadata from 2,556,240 models on Hugging Face, this study reconstructs a large-scale lineage network and introduces the Model Disruption Index (MDI) to distinguish between models that reinforce existing technological trajectories and those that become new bases for later development. The results show that most models in the open-source LLM community are consolidative rather than disruptive, reflecting a highly concentrated and path-dependent evolutionary structure. Further analyses suggest that disruptive positions are more likely to emerge among large-scale models and through finetuning strategies. Overall, this study provides a new perspective for identifying disruptive models and understanding uneven technological development in open-source LLM ecosystems.
SIApr 15Code
Racing to Release: Priority, Congestion, and Community Recognition in Open-Source LLM EcosystemsBin Liu, Lele Kang, Jiannan Yang
Open-source large language models have made platforms such as Hugging Face central hubs for decentralized AI innovation. Yet these ecosystems are shaped not only by collaboration, but also by competition for priority and community attention. Drawing on Hill and Stein's Race-to-the-Bottom framework, this study extends the logic of project potential, maturation, competition, and quality from scientific production to open-source LLM ecosystems, where prominent base models attract concentrated derivative entry under rapid and highly visible platform feedback. Using a large-scale sample of derivative models on Hugging Face, we find that later releases and more crowded competitive environments are both associated with weaker community recognition, even after accounting for differences in model and ecosystem prominence. These findings suggest that competition for priority remains an important organizing force in open-source LLM ecosystems, shaping which derivative innovations receive community recognition.
LGDec 8, 2025
Self-Supervised Learning on Molecular Graphs: A Systematic Investigation of Masking DesignJiannan Yang, Veronika Thost, Tengfei Ma
Self-supervised learning (SSL) plays a central role in molecular representation learning. Yet, many recent innovations in masking-based pretraining are introduced as heuristics and lack principled evaluation, obscuring which design choices are genuinely effective. This work cast the entire pretrain-finetune workflow into a unified probabilistic framework, enabling a transparent comparison and deeper understanding of masking strategies. Building on this formalism, we conduct a controlled study of three core design dimensions: masking distribution, prediction target, and encoder architecture, under rigorously controlled settings. We further employ information-theoretic measures to assess the informativeness of pretraining signals and connect them to empirically benchmarked downstream performance. Our findings reveal a surprising insight: sophisticated masking distributions offer no consistent benefit over uniform sampling for common node-level prediction tasks. Instead, the choice of prediction target and its synergy with the encoder architecture are far more critical. Specifically, shifting to semantically richer targets yields substantial downstream improvements, particularly when paired with expressive Graph Transformer encoders. These insights offer practical guidance for developing more effective SSL methods for molecular graphs.