Differential privacy and k-anonymity for machine learning (2025)

Differential privacy and k-anonymity for machine learning (3)

User privacy is a rising concern in the nowadays data-driven world. we’ll investigate the impacts of the use of anonymization techniques on public medical-related datasets where some private information of patients is present which could allow re-identification attacks.

We will evaluate a feed-forward neural network using local differential privacy with Laplace noise and k-anonymity to anonymize our dataset with the help of MATLAB.

Data privacy has gained a lot of attention in recent years. Since leakage of user information is a constant issue we face nowadays, companies use different strategies to protect their user information. Many companies collect user data for internal usage, and they sometimes make this data publicly available through datasets. To protect the user's identity, data engineers use differential privacy techniques and other strategies to protect the user's private information.

We’ll investigate the use of two of the most common algorithms for data anonymization for masking the private information of participants on a diabetes study. We propose a Feed-forward neural network model that can achieve high accuracy even when the data is anonymized using Laplace noise and k-anonymity.

All the code used for this demo is available below:

Data privacy is a concern that companies and customers often care about. Differential privacy allows data providers to share private information publicly in a safe manner. This means that the dataset is utilized for describing patterns and statistical data of groups, not of a single individual in particular.

To protect the privacy of individuals, differential privacy adds noise in the data to mask the real value, and thus, making it private. By doing this, we hide the individual’s identity with little to no impact on the utility of the data. This means that the statistical outcomes from the dataset should not be influenced by an individual’s contribution since the data represents the characteristics of an entire population. Let D and D’ represent two distinct neighbouring datasets differing in only one data set. Differential privacy states that to secure the private attributes in a given dataset by adding a noise 𝜖, we cannot predict whether a particular entry exists in the database or not [1].

Let 𝑃𝑟 be the randomness of the algorithm 𝑃 , 𝑆𝑠 the subset of 𝑃𝑟 which represents all possible outcomes from 𝑅. We denote an algorithm as 𝜖-differentially private on D’ and D if equation 1 is true [1].

Differential privacy and k-anonymity for machine learning (4)

The most common types of noise for differential privacy is the Laplace, exponential and Gaussian mechanism. They work by adding noise to the original data entry and can be applied to both real and categorical features. The Laplace strategy is a symmetric version of the exponential distribution, and it adds noise from a symmetric continuous distribution to the true answer according to equation 2 [1]

Differential privacy and k-anonymity for machine learning (5)

The exponential mechanism, on the other hand, selects and outputs an element 𝑟 ∈ 𝑅 with probability proportional to equation 3.

Differential privacy and k-anonymity for machine learning (6)

Where 𝑥 is an input and 𝑢 is a utility function with generalized sensitivity Δ𝑢.

K-anonymity was first proposed on [4] and states that in order to achieve k-anonymity, the information for each person contained in the released dataset cannot be distinguished from at least 𝑘 − 1 individuals whose information also appear in the released dataset.

There are two methods for or achieving k-anonymity: suppression and generalization. The former approach replaces some of the entries with an asterisk ’*’, while the latter one groups the entries into categories.

The dataset we’ll use is the Early Stage Diabetes [2] which is publicly available on the UCI Machine Learning Repository. This dataset comprises 520 records which were collected using direct questionnaires from the patients of Sylhet Diabetes Hospital in Sylhet, Bangladesh in the year 2020. It contains binary and real features related to medical conditions and characteristics of patients admitted to the Sylhet Diabetes Hospital.

It is a balanced dataset with 16 features in total including the age and gender of the participants. The data we consider as private attributes for our analysis is only the age though

We will train our model using a feed-forward neural network (FFNN). This model was chosen due to its simplicity and efficacy for categorical classification. We need to shuffle the data to eliminate any bias. The dataset was split into three sets: 80% for training and 20% for testing. We’ll use 20% out of the training set for validation on the model. Let’s create a neural network with two hidden layers with 48 and 32 neurons, respectively, using the code bellow on MATLAB.

When we run the code above, we can vizualize our network.

Differential privacy and k-anonymity for machine learning (7)

For the transfer function, we’ll use the symmetric sigmoid transfer function in combination with the cross-entropy loss (logistic loss). Let’s set the regularization parameter 𝜆 of the loss function to 0.01 to present overfitting. The algorithm used here for training is the scaled conjugate gradient backpropagation.

The FFNN achieved accuracy and f1-score of 97.1% and 97.6%, respectively.

Differential privacy and k-anonymity for machine learning (8)

This performance was slightly better than the one reported by the original study on this dataset [2] where was evaluated the Naïve Bayes, Logistic Regression, and Random Forest (RF) models. Several other algorithms were evaluated on [3] including the Support Vector Machine, Decision Tree, K- Nearest Neighbor, Naïve Bayesian Classifier, Random Forest (RF) Classifier and Logistic Regression and none of them achieved the accuracy of our model. The same configuration for splitting the dataset into training and testing was applied on [2, 3]. The only approach with similar performance to ours was the Adaptive Particle Grey Wolf Optimization (APGWO) model used on [3] with 97% accuracy.

Our first experiment consists of adding Laplace noise to the private attributes to make them anonymous. By doing this, we perform local differential privacy on our dataset. For each entry on the age category, we added a noise with the sensitivity (Δ𝑢) equal to 1 and 𝜖 = 0.1. Then, we generated a Laplace noise with mean 0 and scale Δ𝑢 and added it to the 𝜖 original values as shown in equation 4.

Differential privacy and k-anonymity for machine learning (9)

We first need to implement the Laplace noise in MATLAB. There are several ways to do so, and here is one:

Now let’s add the noise into the dataset.

The new dataset had an age distribution similar to the original data. The histogram for the new age distribution is shown in the figure below.

Differential privacy and k-anonymity for machine learning (10)

The second approach consisted of applying the k-anonymity strategy on the dataset for data anonymization of the age of the participants. We’ll use the generalization technique since the age feature can be grouped in different ranges. Hence, we created 9 distinct groups and classified this attribute according to the figure below.

Differential privacy and k-anonymity for machine learning (11)

Comparison

I run our model against the two anonymization algorithms and recorded the results. The table below compiles the performance of our model with and without data anonymization.

Differential privacy and k-anonymity for machine learning (12)

As can be seen, our FFNN had an outstanding performance when used with the anonymized dataset. The best performance was reached when using the Laplace noise to anonymize the dataset. This configuration reduced the ac- curacy by only 2.9%. However, the generalization algorithm used on k-anonymity also presented a good performance in the utility of the dataset, reducing the data utility by only 3.8%. Therefore, the feed-forward neural network with both the Laplace Noise and k-anonymity performed well on this dataset.

Data utility and privacy are of utmost importance for pre- serving sensitive information when releasing a new dataset. Differential privacy and k-anonymity are some of the strategies used for data anonymization, and several solutions have been developed around these topics. We saw a feed-forward neural network (FFNN) model for model prediction when using an anonymized dataset.

We also evaluated the performance of two anonymization approaches, the Laplace noise and k-anonymity, for anonymizing private attributes in the Early Stage Diabetes dataset.

That’s all folks! I hope you like this small demo about data anonymization.

I’m an M.A.Sc. student at York University, and a Software Engineer by heart. During the past decade, I’ve been working in several industries in areas such as software development, cloud computing and systems engineering. Currently, I’m developing research on cloud computing and distributes systems at the PACS Lab.

[1] Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9, 3–4 (2014), 211–407. https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf

[2] M. M. Faniqul Islam, Rahatara Ferdousi, Sadikur Rahman, and Hu- mayra Yasmin Bushra. 2020. Likelihood Prediction of Diabetes at Early Stage Using Data Mining Techniques. In Computer Vision and Machine Intelligence in Medical Image Analysis, Mousumi Gupta, Debanjan Konar, Siddhartha Bhattacharyya, and Sambhunath Biswas (Eds.). Springer Singapore, Singapore, 113–125

[2] Stage Using Data Mining Techniques. In Computer Vision and Machine Intelligence in Medical Image Analysis, Mousumi Gupta, Debanjan Konar, Siddhartha Bhattacharyya, and Sambhunath Biswas (Eds.). Springer Singapore, Singapore, 113–125.

[3] Tuan Minh Le, Thanh Minh Vo, Tan Nhat Pham, and Son Vu Truong Dao. 2021. A Novel Wrapper–Based Feature Selection for Early Diabetes Prediction Enhanced With a Metaheuristic. IEEE Access 9 (2021), 7869– 7884. https://doi.org/10.1109/ACCESS.2020.3047942

[4] Pierangela Samarati and Latanya Sweeney. 1998. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. (1998). https://epic.org/privacy/ reidentification/Samarati_Sweeney_paper.pdf

Differential privacy and k-anonymity for machine learning (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Velia Krajcik

Last Updated:

Views: 6053

Rating: 4.3 / 5 (54 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Velia Krajcik

Birthday: 1996-07-27

Address: 520 Balistreri Mount, South Armand, OR 60528

Phone: +466880739437

Job: Future Retail Associate

Hobby: Polo, Scouting, Worldbuilding, Cosplaying, Photography, Rowing, Nordic skating

Introduction: My name is Velia Krajcik, I am a handsome, clean, lucky, gleaming, magnificent, proud, glorious person who loves writing and wants to share my knowledge and understanding with you.