Random Forests for Generating Partially Synthetic, Categorical Data
Gregory Caiola(a), Jerome P. Reiter(a),(*)
Transactions on Data Privacy 3:1 (2010) 27 - 42
Abstract, PDF
(a) Department of Statistical Science, Duke University, Durham, NC 27708, USA.
e-mail:gregory.caiola @duke.edu; jerry @stat.duke.edu
|
Abstract
Several national statistical agencies are now releasing partially synthetic, public use microdata.
These comprise the units in the original database with sensitive or identifying values replaced
with values simulated from statistical models. Specifying synthesis models can be daunting
in databases that includemany variables of diverse types. These variablesmay be related inways that
can be difficult to capture with standard parametric tools. In this article, we describe how random
forests can be adapted to generate partially synthetic data for categorical variables. Using an empirical
study, we illustrate that the random forest synthesizer can preserve relationships reasonably well
while providing low disclosure risks. The random forest synthesizer has some appealing features for
statistical agencies: it can be applied with minimal tuning, easily incorporates numerical, categorical,
and mixed variables as predictors, operates efficiently in high dimensions, and automatically fits
non-linear relationships.
|