IRCLHCLGAug 26, 2024

Relationships are Complicated! An Analysis of Relationships Between Datasets on the Web

arXiv:2408.14636v12 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses the challenge for users who discover, use, and share datasets by providing insights into dataset relationships, though it is incremental as it builds on existing semantic markup.

The paper tackles the problem of understanding complex relationships between datasets on the Web by developing a taxonomy and machine-learning methods to identify these relationships, achieving 90% multi-class classification accuracy on a large corpus.

The Web today has millions of datasets, and the number of datasets continues to grow at a rapid pace. These datasets are not standalone entities; rather, they are intricately connected through complex relationships. Semantic relationships between datasets provide critical insights for research and decision-making processes. In this paper, we study dataset relationships from the perspective of users who discover, use, and share datasets on the Web: what relationships are important for different tasks? What contextual information might users want to know? We first present a comprehensive taxonomy of relationships between datasets on the Web and map these relationships to user tasks performed during dataset discovery. We develop a series of methods to identify these relationships and compare their performance on a large corpus of datasets generated from Web pages with schema.org markup. We demonstrate that machine-learning based methods that use dataset metadata achieve multi-class classification accuracy of 90%. Finally, we highlight gaps in available semantic markup for datasets and discuss how incorporating comprehensive semantics can facilitate the identification of dataset relationships. By providing a comprehensive overview of dataset relationships at scale, this paper sets a benchmark for future research.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes