Differences

This shows you the differences between two versions of the page.

--- dcc:pdpsol:de-identification [2026/04/29 14:50] – add synthetic data examples marlon
+++ dcc:pdpsol:de-identification [2026/05/13 13:23] (current) – alba
@@ Line 1: / Line 1: @@
 {{indexmenu_n>2}}
-===== De-identification, Anonymization and Pseudonymization =====
+===== De-identification, anonymization and pseudonymization =====
 ==== Introduction ====
 De-identification is the masking, manipulation or removal of personal data with the aim of making individuals in a dataset less easy to identify. It is especially important when you want to share, publish or archive your dataset, but it can also help protect your participants' privacy in case of a [[https://www.rug.nl/digital-competence-centre/privacy-and-data-protection/data-protection/data-leak|data leak]] during your research. During the different phases of your research, you should determine whether it is possible to de-identify your dataset while also keeping in mind its usability.
@@ Line 18: / Line 18: @@
 **Warning:** de-identification does not equal anonymization. Even if all [[https://www.rug.nl/digital-competence-centre/privacy-and-data-protection/gdpr-research/essential-concepts|direct identifiers]] and your pseudonymization key have been replaced or removed, it might still be possible to re-identify some data subjects in your data because, in combination, certain attributes (e.g., combination of height, job occupation and location of data collection) may single out an individual.
 ----
-**Table 1:** Five levels of Pseudonymization and Anonymization. Adapted from [[https://lcrdm.nl/wp-content/uploads/2023/03/LCRDM-Risk-management-for-research-data-about-people.pdf|LCRDM (2019)]] by [[https://journals.sagepub.com/doi/10.1177/25152459251336130| Van Ravenzwaaij et al, 2025]].
+**Table 1:** De-identification matrix adapted from [[https://lcrdm.nl/wp-content/uploads/2023/03/LCRDM-Risk-management-for-research-data-about-people.pdf|LCRDM (2019)]]. This matrix is an example of what de-identification and anonymization could look like in research. The identifiability of your data largely depends on the context of your research and only partly on the variables you collected. For example, the variable judge could be more identifiable for a person living in Leeuwarden than for a person living in Amsterdam, because more judges live in Amsterdam.
 ----
-{{:dcc:pdpsol:de-identification:pseudmatrix.png?direct&800|}}
+{{:dcc:pdpsol:de-identification:de-identification_matrix.png?direct&800|}}
 ==== General de-identification techniques ====
-There are several techniques that can help you make your dataset less identifiable. You can apply these techniques during different phases of your research:
+Use the de-identification techniques outlined below to reduce the identifiability of your dataset. Be aware that these techniques often affect its analytical value. Therefore, always make sure to document the way you transformed your data.
+You can apply these techniques during different phases of your research:
-  * After data collection to protect participants when analyzing their data
+  * After data collection, to protect participants when analyzing their data
   * Before sharing data with collaborators or other third parties
   * Before archiving data
   * Before publishing data (with access restrictions)
-Be aware that these techniques often affect its analytical value. Therefore, always make sure to document the way you transformed your data.
@@ Line 78: / Line 81: @@
   * Evaluate whether the dataset suits their research needs and begin developing code, refining models, and testing hypotheses.
   * Educate students on how to preprocess and analyze sensitive data without exposing information about real individuals.
+Two examples of synthetic data publications:
+  * [[https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/HNA8QQ|How Do Psychological and Physiological Performance Determinants Interact Within Individual Athletes? An Analytical Network Approach ]] (Neumann and colleagues, 2024)
+  * [[https://doi.org/10.17605/osf.io/eqbd3|De-identification when making datasets FAIR: Two worked examples from the behavioral and social sciences.]] (van Ravenzwaaij et al., 2025)
 [[https://www.youtube.com/watch?v=Im0jqBVRJgI&t=10s|Watch this video for an accessible introduction to synthetic data]]
-Two synthetic data examples:
-  - [[https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/HNA8QQ|How Do Psychological and Physiological Performance Determinants Interact Within Individual Athletes? An Analytical Network Approach (Neumann and colleagues, 2024)]]
-  - [[https://doi.org/10.17605/osf.io/eqbd3|De-identification when making datasets FAIR: Two worked examples from the behavioral and social sciences.(van Ravenzwaaij et al., 2025)]]
 ----
@@ Line 121: / Line 123: @@
 ++++
 ++++ (Click) Acoustic de-identification |
-If the research extends beyond the textual content and transcript analysis alone is insufficient, additional de-identification measures may be considered. In such cases, parts of the audio data can be modified to protect the identity of your participants. For audio recordings, editing software such as [[https://nl.wikipedia.org/wiki/Audacity|Audacity]] can be used to alter or distort voices or to mask personal information (e.g., by muting or inserting bleeps). Be aware that applying these techniques can be time-consuming and can also heavily impact the usability of the data.
+If the research extends beyond the textual content and transcript analysis alone is insufficient, additional de-identification measures may be considered. In such cases, parts of the audio data can be modified to protect the identity of your participants. For audio recordings, editing software such as [[https://nl.wikipedia.org/wiki/Audacity|Audacity]], and [[https://www.fon.hum.uva.nl/praat/|Praat]] can be used to alter or distort voices or to mask personal information (e.g., by muting or inserting bleeps). Be aware that applying these techniques can be time-consuming and can also heavily impact the usability of the data if you aare interested in speech features.
 ++++