The cell as a token: high-dimensional geometry in language models and cell embeddings
This is an incremental review connecting two fields (language models and cell embeddings) without presenting new experimental results.
This review explores how insights from natural language embedding structures can inform the analysis of single-cell datasets, highlighting how developments in language foundation models could improve the construction of cell atlases and training of virtual cell models.
Single-cell sequencing technology maps cells to a high-dimensional space encoding their internal activity. Recently-proposed virtual cell models extend this concept, enriching cells' representations based on patterns learned from pretraining on vast cell atlases. This review explores how advances in understanding the structure of natural language embeddings informs ongoing efforts to analyze single-cell datasets. Both fields process unstructured data by partitioning datasets into tokens embedded within a high-dimensional vector space. We discuss how the context of tokens influences the geometry of embedding space, and how low-dimensional manifolds shape this space's robustness and interpretation. We highlight how new developments in foundation models for language, such as interpretability probes and in-context reasoning, can inform efforts to construct cell atlases and train virtual cell models.