PeGS: Perturbed Gibbs Samplers that Generate Privacy-Compliant Synthetic Data
Yubin Park(a),(*), Joydeep Ghosh(a)
Transactions on Data Privacy 7:3 (2014) 253 - 282
Abstract, PDF
(a) Department of Electrical and Computer Engineering, The University of Texas at Austin, USA.
e-mail:;
|
Abstract
This paper proposes a categorical data synthesizer algorithm that guarantees a quantifiable disclosure risk. Our algorithm, named Perturbed Gibbs Sampler (PeGS), can handle high-dimensional categorical data that are intractable if represented as contingency tables. PeGS involves three intuitive steps: 1) disintegration, 2) noise injection, and 3) synthesis. We first disintegrate the original data into building blocks that (approximately) capture essential statistical characteristics of the original data. This process is efficiently implemented using feature hashing and non-parametric distribution approximation. In the next step, an optimal amount of noise is injected into the estimated statistical building blocks to guarantee differential privacy or l-diversity. Finally, synthetic samples are drawn using a Gibbs sampler approach. California Patient Discharge data are used to demonstrate statistical properties of the proposed synthetic methodology. Marginal and conditional distributions as well as regression coefficients obtained from the synthesized data are compared to those obtained from the original data. Intruder scenarios are simulated to evaluate disclosure risks of the synthesized data from multiple angles. Limitations and extensions of the proposed algorithm are also discussed.
|