CLJul 3, 2025

GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons

arXiv:2507.02221v2h-index: 11Has CodeBioinform Adv
Originality Synthesis-oriented
AI Analysis

This tool addresses the difficulty for cancer genomics researchers, especially new users, in navigating complex data fields to create cohorts, though it is an incremental application of existing methods to a specific domain.

The paper tackles the challenge of curating patient cohorts from the Genomic Data Commons (GDC) by introducing GDC Cohort Copilot, an AI tool that converts natural language descriptions into GDC cohort filters, with their locally-served LLM outperforming GPT-4o in generating cohorts.

The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts. We implement and share GDC Cohort Copilot as a containerized Gradio app on HuggingFace Spaces, available at https://huggingface.co/spaces/uc-ctds/GDC-Cohort-Copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds. All source code is available at https://github.com/uc-cdis/gdc-cohort-copilot.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes