CLDBJun 14, 2022

"hasSignification()": une nouvelle fonction de distance pour soutenir la détection de données personnelles

arXiv:2206.06836v1h-index: 7
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of personal data protection in big data environments for data managers, but it appears incremental as it builds on existing distance functions with modifications.

The paper tackles the problem of automatically detecting personal data in large datasets by proposing a new distance function, hasSignification(), to determine if attribute names are meaningful for storage in a knowledge base. The result is a method that uses an exponential function based on the longest sequence and double dictionary scanning to improve over existing distance functions like N-Gram and Levenshtein, though no concrete performance numbers are provided.

Today with Big Data and data lakes, we are faced of a mass of data that is very difficult to manage it manually. The protection of personal data in this context requires an automatic analysis for data discovery. Storing the names of attributes already analyzed in a knowledge base could optimize this automatic discovery. To have a better knowledge base, we should not store any attributes whose name does not make sense. In this article, to check if the name of an attribute has a meaning, we propose a solution that calculate the distances between this name and the words in a dictionary. Our studies on the distance functions like N-Gram, Jaro-Winkler and Levenshtein show limits to set an acceptance threshold for an attribute in the knowledge base. In order to overcome these limitations, our solution aims to strengthen the score calculation by using an exponential function based on the longest sequence. In addition, a double scan in dictionary is also proposed in order to process the attributes which have a compound name.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes