IROct 13, 2018

Measuring Swampiness: Quantifying Chaos in Large Heterogeneous Data Repositories

arXiv:1810.05784v1
Originality Synthesis-oriented
AI Analysis

This addresses the challenge for scientists in locating and utilizing relevant data in large, complex repositories, though it appears incremental as it builds on existing clustering methods.

The paper tackles the problem of disorganization in large scientific data repositories by developing an automated clustering pipeline that processes heterogeneous filetypes and computes a novel cleanliness score, demonstrating its consistency compared to other measures.

As scientific data repositories and filesystems grow in size and complexity, they become increasingly disorganized. The coupling of massive quantities of data with poor organization makes it challenging for scientists to locate and utilize relevant data, thus slowing the process of analyzing data of interest. To address these issues, we explore an automated clustering approach for quantifying the organization of data repositories. Our parallel pipeline processes heterogeneous filetypes (e.g., text and tabular data), automatically clusters files based on content and metadata similarities, and computes a novel "cleanliness" score from the resulting clustering. We demonstrate the generation and accuracy of our cleanliness measure using both synthetic and real datasets, and conclude that it is more consistent than other potential cleanliness measures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes