Abstract
A major growth area in social science research this century has been access to highly sensitive confidential microdata, often via restricted-access remote facilities. These allow researchers highly unlimited access to manipulate the data but with checks for disclosure risk before the statistical results can be published. Effective output-based statistical disclosure control (OSDC) is therefore central to effective use of confidential microdata for research.
Multiple regression is a key anaytical tool for researchers, and so knowing whether multiple regression results are 'safe' for release is essential for research facilities. This is a relatively unexplored field; guidelines used by almost all restricted-access facilities reference an informal document from 2006, but more recent work suggests that problems may exist.
This paper demonstrates that linear regression coefficients show no substantive disclosure risks in realistic environments, and so should be considered as 'safe statistics' in the terminology of this field. Conflicting results in the literature reflect institutional perceptions rather than statistical differences, the confusion of statistical quality with disclosure risk, or the failure to identify the source of risk. The result has important implications for those responsible for providing research access to sensitive data.
The paper explores this result on simple linear regression models; more complex models are shown to be 'safer' subsets. Non-linear models pose slightly different problems, but this paper indicates a way such models may be tackled.
|