Optimizing Privacy and Data Utility: Metrics and Strategies
Clémence Mauger(a), Gaël Le Mahec(a),(*), Gilles Dequen(a)
Transactions on Data Privacy 16:3 (2023) 153 - 189
Abstract, PDF
(a) Universit&ecuate; de Picardie Jules Verne - MIS Laboratory, 33 rue Saint-Leu, Amiens, 80000, France.
e-mail:clemence.mauger @u-picardie.fr; gael.le.mahec @u-picardie.fr; gilles.dequen @u-picardie.fr
|
Abstract
k-anonymity is a PPDP anonymization model preventing identity disclosure by making each record of the table indistinguishable from k − 1 others. To obtain a k-anonymous version of a table, a common technique is to generalize the quasi-identifier attributes values until records are grouped in equivalence classes of size at least k. The choice of records to be grouped will influence the amount of generalization to be performed and therefore the quality of the anonymized data (the more a value is generalized, the more precision it loses). The different k-anonymous versions of a table are therefore more or less interesting in terms of data utility. To assess the quality of a k-anonymized table, information loss metrics are often used. They can also be used within the k-anonymization process itself to choose the groupings of records resulting in the least data alteration. In this article, we propose a unified modeling of such metrics, faciliting their implementation and their use. We then analyze the behaviors of seven metrics when they are used in the k-anonymization process to guide the equivalence classes mergings. Our analyzes compare these seven metrics on two public tables for 14 values of k. After that, we turned to the limits of k-anonymity. In a k-anonymous table, the distribution of sensitive values in equivalence classes can lead to the disclosure of sensitive information about an individual. l-diversity and t-closeness anonymization models impose constraints that keep control over the distribution of sensitive values and therefore limit attribute disclosure. We continue our study on k-anonymization by proposing strategies aimed at optimizing the data alteration, the l-diversity and the t-closeness of the k-anonymous tables produced. Using two information loss metrics, we evaluate the seven optimization strategies on the two public tables first on real sensitive values distributions and then on 21 simulated sensitive values distributions. With this large study, we would like to understand how to choose a metric and an optimization strategy to provide k-anonymous database with strong guarantees on the data privacy and preserving as much as possible the data utility.
|