CL AIJul 17, 2025

HuggingGraph: Understanding the Supply Chain of LLM Ecosystem

Mohammad Shahedur Rahman, Peng Gao, Yuede Ji

arXiv:2507.14240v313.011 citationsh-index: 1CIKM

Originality Synthesis-oriented

AI Analysis

This work addresses the need for transparency and risk detection in the LLM ecosystem, which is crucial for developers and regulators, though it appears incremental as it builds on existing supply chain concepts.

The researchers tackled the problem of understanding vulnerabilities and biases in large language models by studying their supply chain relationships, creating a directed heterogeneous graph with 402,654 nodes and 462,524 edges to model connections between models and datasets.

Large language models (LLMs) leverage deep learning architectures to process and predict sequences of words, enabling them to perform a wide range of natural language processing tasks, such as translation, summarization, question answering, and content generation. As existing LLMs are often built from base models or other pre-trained models and use external datasets, they can inevitably inherit vulnerabilities, biases, or malicious components that exist in previous models or datasets. Therefore, it is critical to understand these components' origin and development process to detect potential risks, improve model fairness, and ensure compliance with regulatory frameworks. Motivated by that, this project aims to study such relationships between models and datasets, which are the central parts of the LLM supply chain. First, we design a methodology to systematically collect LLMs' supply chain information. Then, we design a new graph to model the relationships between models and datasets, which is a directed heterogeneous graph, having 402,654 nodes and 462,524 edges. Lastly, we perform different types of analysis and make multiple interesting findings.

View on arXiv PDF

Similar