LGDBNov 23, 2021

Identifying the Units of Measurement in Tabular Data

arXiv:2111.11959v11 citations
Originality Incremental advance
AI Analysis

This addresses a data preprocessing challenge for researchers and practitioners working with messy real-world tabular data, enabling and accelerating unit-related research.

The paper tackles the problem of identifying units of measurement in tabular data where numeric values and unit symbols are mixed, presenting PUC, a Probabilistic Unit Canonicalizer that accurately identifies units and canonicalizes entries, achieving better results than existing solutions on new annotated datasets.

We consider the problem of identifying the units of measurement in a data column that contains both numeric values and unit symbols in each row, e.g., "5.2 l", "7 pints". In this case we seek to identify the dimension of the column (e.g. volume) and relate the unit symbols to valid units (e.g. litre, pint) obtained from a knowledge graph. Below we present PUC, a Probabilistic Unit Canonicalizer that can accurately identify the units of measurement, extract semantic descriptions of quantitative data columns and canonicalize their entries. We present the first messy real-world tabular datasets annotated for units of measurement, which can enable and accelerate the research in this area. Our experiments on these datasets show that PUC achieves better results than existing solutions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes