Abstract
When checking frequency and magnitude tables for disclosure risk, the cell threshold (the minimum number of observations in each cell) is a crucial parameter. In rules-based environments, this is a hard limit on what can or can't be published. In
principles-based environments, this is less important but has an impact on the operational
effectiveness of statistical disclosure control (SDC) processes.
Determining the appropriate threshold is an unsolved problem. Ten is a common
threshold value for both national statistics and research outputs, but five or twenty are
also popular. Some organisations use multiple thresholds for different data sources.
These higher thresholds are all entirely subjective. Three is the only threshold which has
an objective statistical foundation, but most organisations argue that this leaves little
margin for error. Unfortunately, there is no equivalent statistical case for any number
larger than three: ten is popular because it is popular. This is particularly the case for
research environments, where there is no guidance.
This paper provides the first empirical foundation for threshold selection by modelling
alternative threshold values on both synthetic data and real datasets. The paper
demonstrates that this is a complex question. The trade-off between risk and value is well-
known, but we demonstrate that the protection of a higher threshold depends on the risk
measure. There is no monotonic relation between a threshold and risk, as higher
thresholds can increase disclosure risk in particular scenarios. The blind application of
high-threshold rules might mask new risks. There is no unambiguous result, other than
the simplistic ones that more observations reduces risk and higher thresholds reduce
utility.
Finally, the paper notes that a reconsideration of disclosure checking practices can
reduce risk irrespective of the threshold for some risk scenarios.
|