LGDBNov 23, 2021

ptype-cat: Inferring the Type and Values of Categorical Variables

arXiv:2111.11956v1
Originality Incremental advance
AI Analysis

This addresses a specific data preprocessing bottleneck for data scientists and analysts by automating categorical variable detection, though it is incremental as it builds on the existing ptype method.

The paper tackles the problem of type inference for non-Boolean categorical variables in data columns, which existing methods mislabel as integers or strings, and proposes ptype-cat to identify these types and their possible values, achieving better results than current solutions.

Type inference is the task of identifying the type of values in a data column and has been studied extensively in the literature. Most existing type inference methods support data types such as Boolean, date, float, integer and string. However, these methods do not consider non-Boolean categorical variables, where there are more than two possible values encoded by integers or strings. Therefore, such columns are annotated either as integer or string rather than categorical, and need to be transformed into categorical manually by the user. In this paper, we propose a probabilistic type inference method that can identify the general categorical data type (including non-Boolean variables). Additionally, we identify the possible values of each categorical variable by adapting the existing type inference method ptype. Combining these methods, we present ptype-cat which achieves better results than existing applicable solutions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes