CRAICYJun 25, 2020

Scalable Data Classification for Security and Privacy

arXiv:2006.14109v5
Originality Incremental advance
AI Analysis

This addresses the problem of managing and securing large, dynamic data assets for organizations like Facebook, representing an incremental improvement by combining existing techniques into a production system.

The paper tackles the challenge of scalable content-based data classification for security and privacy at Facebook by developing an end-to-end system that detects sensitive semantic types and enforces controls automatically, achieving over 0.9 average F2 scores across privacy classes while handling numerous data assets.

Content based data classification is an open challenge. Traditional Data Loss Prevention (DLP)-like systems solve this problem by fingerprinting the data in question and monitoring endpoints for the fingerprinted data. With a large number of constantly changing data assets in Facebook, this approach is both not scalable and ineffective in discovering what data is where. This paper is about an end-to-end system built to detect sensitive semantic types within Facebook at scale and enforce data retention and access controls automatically. The approach described here is our first end-to-end privacy system that attempts to solve this problem by incorporating data signals, machine learning, and traditional fingerprinting techniques to map out and classify all data within Facebook. The described system is in production achieving a 0.9+ average F2 scores across various privacy classes while handling a large number of data assets across dozens of data stores.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes