QMLGMar 26, 2025

The cell as a token: high-dimensional geometry in language models and cell embeddings

arXiv:2503.20278v21 citationsh-index: 3Bioinform.
Originality Synthesis-oriented
AI Analysis

This is an incremental review connecting two fields (language models and cell embeddings) without presenting new experimental results.

This review explores how insights from natural language embedding structures can inform the analysis of single-cell datasets, highlighting how developments in language foundation models could improve the construction of cell atlases and training of virtual cell models.

Single-cell sequencing technology maps cells to a high-dimensional space encoding their internal activity. Recently-proposed virtual cell models extend this concept, enriching cells' representations based on patterns learned from pretraining on vast cell atlases. This review explores how advances in understanding the structure of natural language embeddings informs ongoing efforts to analyze single-cell datasets. Both fields process unstructured data by partitioning datasets into tokens embedded within a high-dimensional vector space. We discuss how the context of tokens influences the geometry of embedding space, and how low-dimensional manifolds shape this space's robustness and interpretation. We highlight how new developments in foundation models for language, such as interpretability probes and in-context reasoning, can inform efforts to construct cell atlases and train virtual cell models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes