Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings
This work provides a practical demonstration for researchers analyzing scientific publications, though it is incremental in applying existing methods to a new dataset.
The researchers tackled the problem of analyzing large text datasets by combining graph structures and LLM embeddings, using the Web of Science dataset with ~56 million publications to reveal a self-structured landscape of texts.
Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure and can be handled by a range of established algorithms for classification and prediction, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts.