Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
dcc:pdpsol:de-identification [2026/04/28 12:32] marlondcc:pdpsol:de-identification [2026/05/13 13:23] (current) alba
Line 1: Line 1:
 {{indexmenu_n>2}} {{indexmenu_n>2}}
-===== De-identification =====+===== De-identification, anonymization and pseudonymization =====
 ==== Introduction ==== ==== Introduction ====
 De-identification is the masking, manipulation or removal of personal data with the aim of making individuals in a dataset less easy to identify. It is especially important when you want to share, publish or archive your dataset, but it can also help protect your participants' privacy in case of a [[https://www.rug.nl/digital-competence-centre/privacy-and-data-protection/data-protection/data-leak|data leak]] during your research. During the different phases of your research, you should determine whether it is possible to de-identify your dataset while also keeping in mind its usability.  De-identification is the masking, manipulation or removal of personal data with the aim of making individuals in a dataset less easy to identify. It is especially important when you want to share, publish or archive your dataset, but it can also help protect your participants' privacy in case of a [[https://www.rug.nl/digital-competence-centre/privacy-and-data-protection/data-protection/data-leak|data leak]] during your research. During the different phases of your research, you should determine whether it is possible to de-identify your dataset while also keeping in mind its usability. 
Line 18: Line 18:
 **Warning:** de-identification does not equal anonymization. Even if all [[https://www.rug.nl/digital-competence-centre/privacy-and-data-protection/gdpr-research/essential-concepts|direct identifiers]] and your pseudonymization key have been replaced or removed, it might still be possible to re-identify some data subjects in your data because, in combination, certain attributes (e.g., combination of height, job occupation and location of data collection) may single out an individual. **Warning:** de-identification does not equal anonymization. Even if all [[https://www.rug.nl/digital-competence-centre/privacy-and-data-protection/gdpr-research/essential-concepts|direct identifiers]] and your pseudonymization key have been replaced or removed, it might still be possible to re-identify some data subjects in your data because, in combination, certain attributes (e.g., combination of height, job occupation and location of data collection) may single out an individual.
 ---- ----
-Table 1: Example of five levels of Pseudonymization/Anonymization +**Table 1:** De-identification matrix adapted from [[https://lcrdm.nl/wp-content/uploads/2023/03/LCRDM-Risk-management-for-research-data-about-people.pdf|LCRDM (2019)]]. This matrix is an example of what de-identification and anonymization could look like in research. The identifiability of your data largely depends on the context of your research and only partly on the variables you collected. For example, the variable judge could be more identifiable for a person living in Leeuwarden than for a person living in Amsterdam, because more judges live in Amsterdam. 
 ---- ----
-{{:dcc:pdpsol:de-identification:pseudmatrix.png?direct&400|}} +{{:dcc:pdpsol:de-identification:de-identification_matrix.png?direct&800|}} 
----- + 
-Adapted from the DUtch coordinatoion point for research data management ([[https://lcrdm.nl/wp-content/uploads/2023/03/LCRDM-Risk-management-for-research-data-about-people.pdf|LCRDM, 2019]]) by [[https://journals.sagepub.com/doi/10.1177/25152459251336130| Van Ravenzwaaij et al, 2025]]. + 
 ==== General de-identification techniques ==== ==== General de-identification techniques ====
-There are several techniques that can help you make your dataset less identifiable. You can apply these techniques during different phases of your research:+Use the de-identification techniques outlined below to reduce the identifiability of your dataset. Be aware that these techniques often affect its analytical value. Therefore, always make sure to document the way you transformed your data  
 + 
 +You can apply these techniques during different phases of your research:
  
-  * After data collection to protect participants when analyzing their data+  * After data collectionto protect participants when analyzing their data
   * Before sharing data with collaborators or other third parties   * Before sharing data with collaborators or other third parties
   * Before archiving data   * Before archiving data
   * Before publishing data (with access restrictions)   * Before publishing data (with access restrictions)
  
-Be aware that these techniques often affect its analytical value. Therefore, always make sure to document the way you transformed your data.  +
  
  
Line 54: Line 56:
  
 === Aggregation & generalization === === Aggregation & generalization ===
-Reduce the level of detail of your dataset by generalizing variables, which makes it harder to identify individual subjects. This can be applied to both quantitative and qualitative datasets. For example, changing addresses in the neighbourhood or city, and changing birth date or age into an age group. +Reduce the level of detail of your dataset by generalizing variables, which makes it harder to identify individual subjects. This can be applied to both quantitative and qualitative datasets. For example, changing addresses into neighbourhood or city, and changing birth date or age into an age group. 
  
 ---- ----
Line 80: Line 82:
   * Educate students on how to preprocess and analyze sensitive data without exposing information about real individuals.   * Educate students on how to preprocess and analyze sensitive data without exposing information about real individuals.
  
-[[https://www.youtube.com/watch?v=Im0jqBVRJgI&t=10s|Watch this video for an accessible introduction to synthetic data]]+Two examples of synthetic data publications: 
 +  * [[https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/HNA8QQ|How Do Psychological and Physiological Performance Determinants Interact Within Individual Athletes? An Analytical Network Approach ]] (Neumann and colleagues, 2024) 
 +  * [[https://doi.org/10.17605/osf.io/eqbd3|De-identification when making datasets FAIR: Two worked examples from the behavioral and social sciences.]] (van Ravenzwaaij et al., 2025)
  
 +[[https://www.youtube.com/watch?v=Im0jqBVRJgI&t=10s|Watch this video for an accessible introduction to synthetic data]]
 ---- ----
  
Line 118: Line 123:
 ++++ ++++
 ++++ (Click) Acoustic de-identification | ++++ (Click) Acoustic de-identification |
-If the research extends beyond the textual content and transcript analysis alone is insufficient, additional de-identification measures may be considered. In such cases, parts of the audio data can be modified to protect the identity of your participants. For audio recordings, editing software such as [[https://nl.wikipedia.org/wiki/Audacity|Audacity]] can be used to alter or distort voices or to mask personal information (e.g., by muting or inserting bleeps). Be aware that applying these techniques can be time-consuming and can also heavily impact the usability of the data. +If the research extends beyond the textual content and transcript analysis alone is insufficient, additional de-identification measures may be considered. In such cases, parts of the audio data can be modified to protect the identity of your participants. For audio recordings, editing software such as [[https://nl.wikipedia.org/wiki/Audacity|Audacity]], and [[https://www.fon.hum.uva.nl/praat/|Praat]] can be used to alter or distort voices or to mask personal information (e.g., by muting or inserting bleeps). Be aware that applying these techniques can be time-consuming and can also heavily impact the usability of the data if you aare interested in speech features
 ++++ ++++