CL MEJan 19, 2022

Data-to-Value: An Evaluation-First Methodology for Natural Language Projects

arXiv:2201.07725v10.31 citations

Originality Synthesis-oriented

AI Analysis

This addresses challenges for teams working on natural language projects at scale, but it appears incremental as it builds on existing methodologies like CRISP-DM.

The paper tackles the problem of big data text analytics projects lacking methodologies that account for large-scale processing, unstructured data, and non-technical aspects, by introducing the Data to Value (D2V) methodology with a detailed question catalog to improve project success.

Big data, i.e. collecting, storing and processing of data at scale, has recently been possible due to the arrival of clusters of commodity computers powered by application-level distributed parallel operating systems like HDFS/Hadoop/Spark, and such infrastructures have revolutionized data mining at scale. For data mining project to succeed more consistently, some methodologies were developed (e.g. CRISP-DM, SEMMA, KDD), but these do not account for (1) very large scales of processing, (2) dealing with textual (unstructured) data (i.e. Natural Language Processing (NLP, "text analytics"), and (3) non-technical considerations (e.g. legal, ethical, project managerial aspects). To address these shortcomings, a new methodology, called "Data to Value" (D2V), is introduced, which is guided by a detailed catalog of questions in order to avoid a disconnect of big data text analytics project team with the topic when facing rather abstract box-and-arrow diagrams commonly associated with methodologies.

View on arXiv PDF

Similar