Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
| dcc:pdpsol:de-identification [2026/02/17 09:32] – add section marlon | dcc:pdpsol:de-identification [2026/02/17 10:09] (current) – marlon | ||
|---|---|---|---|
| Line 2: | Line 2: | ||
| ===== De-identification ===== | ===== De-identification ===== | ||
| ==== Introduction ==== | ==== Introduction ==== | ||
| - | De-identification is a data protection method that can effectively protect your research participants. It is especially important | + | De-identification is the masking, manipulation or removal of personal data with the aim to make individuals in a dataset less easy to identify. It is especially important when you want to share, publish or archive |
| - | ==== Remove | + | ==== Anonymization versus pseudonymization ==== |
| + | |||
| + | === Pseudonymization === | ||
| + | Pseudonymization is a de-identification procedure which is often implemented during data collection. During pseudonymization personally identifiable information is replaced by an unique alias or code (pseudonym). In general, the names and/or contact details of data subjects are stored with this pseudonym in a so-called keyfile. The keyfile enables the re-identification of individuals in the dataset. Keyfiles are stored separately from the rest of the data and access should be restricted. In contrast to an anonymized dataset, a pseudonymized dataset in principle still allows for the re-identification of data subjects. | ||
| + | |||
| + | [[pseudonymization|→ Refer to our page on pseudonymization for practical advise on its implementation.]] | ||
| + | |||
| + | === Anonymization === | ||
| + | Anonymization is a de-identification procedure during which “personal data is altered in such a way that a data subject can no longer be identified directly or indirectly, either by the data controller alone or in collaboration with any other party." | ||
| + | |||
| + | **Warning: | ||
| + | |||
| + | ==== De-identification techniques ==== | ||
| + | There are several techniques that can make your dataset less identifiable. Check out possible techniques to de-identify your data below, but be aware that these techniques often affect its analytical value. | ||
| + | |||
| + | === Removing or suppressing | ||
| Consider whether you can remove or suppress sensitive elements. | Consider whether you can remove or suppress sensitive elements. | ||
| * Remove variables that reveal rare personal attributes. | * Remove variables that reveal rare personal attributes. | ||
| Line 10: | Line 25: | ||
| * Use restricted access to your data and only provide those variables to researchers that are necessary to answer their research question. | * Use restricted access to your data and only provide those variables to researchers that are necessary to answer their research question. | ||
| - | ==== Replace | + | === Replacing |
| A practice in which you replace sensitive personal data with values or codes that are not sensitive: | A practice in which you replace sensitive personal data with values or codes that are not sensitive: | ||
| * Replace direct identifiers (‘name’) with a pseudonym (‘X’). | * Replace direct identifiers (‘name’) with a pseudonym (‘X’). | ||
| * Make numerical values less precise. | * Make numerical values less precise. | ||
| * Replace identifiable text with ‘[redacted]’. | * Replace identifiable text with ‘[redacted]’. | ||
| - | Masking is typically partial, i.e. applied only to some characters in the attribute. For example, in the case of a postal code: change 9746DC into 97****. | + | Masking is typically partial, i.e. applied only to some characters in the attribute. For example, in the case of a postal code: change 9746DC into 97∗∗∗∗. |
| - | ==== Aggregation & generalization | + | === Aggregation & generalization === |
| Reduce the level of detail of your dataset by generalizing variables, which makes it harder to identify individual subjects. This can be applied to both quantitative and qualitative datasets. For example, changing addresses in the neighborhood or city, and changing birth date or age into an age group. | Reduce the level of detail of your dataset by generalizing variables, which makes it harder to identify individual subjects. This can be applied to both quantitative and qualitative datasets. For example, changing addresses in the neighborhood or city, and changing birth date or age into an age group. | ||
| - | ==== Bottom- and top-coding | + | === Bottom- and top-coding === |
| Bottom- and top-coding can be applied to datasets with unique extreme values. Set a maximum or minimum and recode all higher or lower values to that minimum or maximum. Replace values above or below a certain threshold with the same standard value. For instance, top-code the variable ‘income’ by setting all incomes over €100.000 to €100.000. This distorts the distribution, | Bottom- and top-coding can be applied to datasets with unique extreme values. Set a maximum or minimum and recode all higher or lower values to that minimum or maximum. Replace values above or below a certain threshold with the same standard value. For instance, top-code the variable ‘income’ by setting all incomes over €100.000 to €100.000. This distorts the distribution, | ||
| - | ==== Adding noise ==== | + | === Adding noise === |
| Noise addition is usually combined with other anonymization techniques and is mostly (but not always) applied to quantitative datasets: | Noise addition is usually combined with other anonymization techniques and is mostly (but not always) applied to quantitative datasets: | ||
| * Add half a standard deviation to a variable. | * Add half a standard deviation to a variable. | ||
| Line 29: | Line 44: | ||
| * Blur photos and videos or alter voices in audio recordings. | * Blur photos and videos or alter voices in audio recordings. | ||
| - | ==== Permutation | + | === Permutation === |
| Permutation is applied to quantitative datasets. Shuffle the attributes in a table to link some of them artificially to different data subjects. The exact distribution per attribute of the dataset is hereby retained, but identification of data subjects is made more difficult. | Permutation is applied to quantitative datasets. Shuffle the attributes in a table to link some of them artificially to different data subjects. The exact distribution per attribute of the dataset is hereby retained, but identification of data subjects is made more difficult. | ||
| - | ==== Synthetic data ==== | + | === Synthetic data === |
| Synthetic data are artificially generated rather than collected from real-world events (e.g., flight simulators or audio synthesizers). In research, synthetic datasets can be designed to replicate the statistical patterns of real datasets that are too sensitive to share openly. Creating a synthetic version of your dataset allows researchers to: | Synthetic data are artificially generated rather than collected from real-world events (e.g., flight simulators or audio synthesizers). In research, synthetic datasets can be designed to replicate the statistical patterns of real datasets that are too sensitive to share openly. Creating a synthetic version of your dataset allows researchers to: | ||
| * Access relevant data without compromising the privacy or safety of data subjects. | * Access relevant data without compromising the privacy or safety of data subjects. | ||
| Line 42: | Line 57: | ||
| For more in-depth information on these techniques, including guarantees, common mistakes, and potential failures, please refer to (Chapter 3 of) the Opinion 05/2014 on Anonymisation Techniques ([[https:// | For more in-depth information on these techniques, including guarantees, common mistakes, and potential failures, please refer to (Chapter 3 of) the Opinion 05/2014 on Anonymisation Techniques ([[https:// | ||
| - | |||
| - | This page is under construction! | ||
| - | |||
| - | [[dcc:start | → Go back to DCC home page]] | ||