LGAICLMLMar 5, 2021

Rissanen Data Analysis: Examining Dataset Characteristics via Description Length

arXiv:2103.03872v126 citations
Originality Incremental advance
AI Analysis

This provides a theoretically-grounded method for analyzing dataset characteristics in NLP, though it appears incremental as it applies existing MDL concepts to new evaluation tasks.

The paper tackles the problem of determining whether a specific capability helps achieve accurate modeling of data by introducing Rissanen Data Analysis (RDA), which uses minimum description length as a proxy to evaluate dataset characteristics, and demonstrates its applicability in various NLP settings such as analyzing subquestions, rationales, parts of speech, and gender bias.

We introduce a method to determine if a certain capability helps to achieve an accurate model of given data. We view labels as being generated from the inputs by a program composed of subroutines with different capabilities, and we posit that a subroutine is useful if and only if the minimal program that invokes it is shorter than the one that does not. Since minimum program length is uncomputable, we instead estimate the labels' minimum description length (MDL) as a proxy, giving us a theoretically-grounded method for analyzing dataset characteristics. We call the method Rissanen Data Analysis (RDA) after the father of MDL, and we showcase its applicability on a wide variety of settings in NLP, ranging from evaluating the utility of generating subquestions before answering a question, to analyzing the value of rationales and explanations, to investigating the importance of different parts of speech, and uncovering dataset gender bias.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes