GNAILGMar 28, 2025

Celler:A Genomic Language Model for Long-Tailed Single-Cell Annotation

arXiv:2504.00020v22 citationsh-index: 6Has Code
Originality Incremental advance
AI Analysis

This work addresses a domain-specific problem for researchers in genomics and disease studies by providing a novel method for single-cell annotation, though it appears incremental as it builds on existing pre-training models with new components.

The paper tackles the challenge of efficiently annotating large, long-tailed single-cell data for diseases by introducing Celler, a generative pre-training model that incorporates a Gaussian Inflation Loss function and Hard Data Mining strategy, achieving improved predictive accuracy and learning from rare categories.

Recent breakthroughs in single-cell technology have ushered in unparalleled opportunities to decode the molecular intricacy of intricate biological systems, especially those linked to diseases unique to humans. However, these progressions have also ushered in novel obstacles-specifically, the efficient annotation of extensive, long-tailed single-cell data pertaining to disease conditions. To effectively surmount this challenge, we introduce Celler, a state-of-the-art generative pre-training model crafted specifically for the annotation of single-cell data. Celler incorporates two groundbreaking elements: First, we introduced the Gaussian Inflation (GInf) Loss function. By dynamically adjusting sample weights, GInf Loss significantly enhances the model's ability to learn from rare categories while reducing the risk of overfitting for common categories. Secondly, we introduce an innovative Hard Data Mining (HDM) strategy into the training process, specifically targeting the challenging-to-learn minority data samples, which significantly improved the model's predictive accuracy. Additionally, to further advance research in this field, we have constructed a large-scale single-cell dataset: Celler-75, which encompasses 40 million cells distributed across 80 human tissues and 75 specific diseases. This dataset provides critical support for comprehensively exploring the potential of single-cell technology in disease research. Our code is available at https://github.com/AI4science-ym/HiCeller.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes