It does what it says on the tin: safe synthetic data from coarsened margins

arXiv:2606.021010.8

AI Analysis

For data custodians and users, this method provides a transparent and safe way to generate synthetic data with known disclosure risk controls, though it is an incremental improvement over existing synthetic data methods.

This paper introduces a method for generating synthetic data that preserves transparency about which variable relationships are maintained and guarantees the data is derived from information already deemed disclosure-free. The approach uses coarsened margins and iterative proportional fitting, demonstrated on 1901 Scottish Census data.

This paper proposes a method of creating synthetic data (SD) that will have two important advantages for the user compared to other methods currently available. The first is transparency; unlike other methods, the person in receipt of the SD will know which of the relationships between variables in the original data will be approximately maintained in the SD. The second is a guarantee that the SD is derived from information that has already been judged to be free of disclosure risk. This is achieved by first defining and calculating the margins where relationships between variables will be maintained in the SD. Each margin will then be subject to statistical disclosure control (SDC) to the standards defined by the data custodian, e.g. top-coding and bottom-coding, combination of small categories and/or modifying small counts. Further adjustment of the curated margins is advised by coarsening all counts in the table to multiples of the disclosure limit. These adjusted margins are used to create SD by the Iterative Proportional Fitting (IPF) algorithm. The practical steps involved in creating such SD are illustrated using data from the 1901 Census of Scotland.

View on arXiv PDF

Similar