Probabilistic Reasoning with LLMs for k-anonymity Estimation
This work addresses privacy risk assessment for data protection, though it is incremental as it builds on existing probabilistic reasoning methods.
The paper tackles the problem of estimating privacy risk in user-generated documents by introducing a new task for large language models to compute k-anonymity values, achieving a 73% success rate, which is a 13% improvement over a baseline method.
Probabilistic reasoning is a key aspect of both human and artificial intelligence that allows for handling uncertainty and ambiguity in decision-making. In this paper, we introduce a new numerical reasoning task under uncertainty for large language models, focusing on estimating the privacy risk of user-generated documents containing privacy-sensitive information. We propose BRANCH, a new LLM methodology that estimates the k-privacy value of a text-the size of the population matching the given information. BRANCH factorizes a joint probability distribution of personal information as random variables. The probability of each factor in a population is estimated separately using a Bayesian network and combined to compute the final k-value. Our experiments show that this method successfully estimates the k-value 73% of the time, a 13% increase compared to o3-mini with chain-of-thought reasoning. We also find that LLM uncertainty is a good indicator for accuracy, as high-variance predictions are 37.47% less accurate on average.