Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
dcc:pdpsol:de-identification [2026/03/12 07:32] marlondcc:pdpsol:de-identification [2026/03/23 14:09] (current) – add go back to P&DP home page marlon
Line 2: Line 2:
 ===== De-identification ===== ===== De-identification =====
 ==== Introduction ==== ==== Introduction ====
-De-identification is the masking, manipulation or removal of personal data with the aim to make individuals in a dataset less easy to identify. It is especially important when you want to share, publish or archive your dataset. Before sharingpublishing or archiving your data, you should determine whether it is possible to de-identify your dataset, while also keeping in mind its usability. +De-identification is the masking, manipulation or removal of personal data with the aim to make individuals in a dataset less easy to identify. It is especially important when you want to share, publish or archive your dataset, but it can also help protect your participants' privacy in case of a [[https://www.rug.nl/digital-competence-centre/privacy-and-data-protection/data-protection/data-leak|data leak]] during your research. During the different phases of your research, you should determine whether it is possible to de-identify your dataset, while also keeping in mind its usability. 
  
 ==== Anonymization versus pseudonymization ==== ==== Anonymization versus pseudonymization ====
Line 19: Line 19:
  
 ==== General de-identification techniques ==== ==== General de-identification techniques ====
-There are several techniques that can help you make your dataset less identifiable. You can apply these techniques during different phases in your research:+There are several techniques that can help you make your dataset less identifiable. You can apply these techniques during different phases of your research:
  
   * After data collection to protect participants when analyzing their data   * After data collection to protect participants when analyzing their data
Line 35: Line 35:
  
 === Removing or suppressing === === Removing or suppressing ===
-Consider whether you can remove or suppress sensitive elements.+When personal data is not particularly relevant for your research or highly sensitive, you can consider removing or suppressing these elements.
   * Remove variables that reveal rare personal attributes.   * Remove variables that reveal rare personal attributes.
-  * Remove direct identifiers, such as Patiënt ID.+  * Remove direct identifiers, such as names or Patiënt ID.
   * Use restricted access to your data and only provide those variables to researchers that are necessary to answer their research question.   * Use restricted access to your data and only provide those variables to researchers that are necessary to answer their research question.
  
Line 76: Line 76:
   * Access relevant data without compromising the privacy or safety of data subjects.   * Access relevant data without compromising the privacy or safety of data subjects.
   * Evaluate whether the dataset suits their research needs and begin developing code, refining models, and testing hypotheses.   * Evaluate whether the dataset suits their research needs and begin developing code, refining models, and testing hypotheses.
-  * Educate students on how to preprocess and analyze sensitive datawithout exposing information about real individuals.+  * Educate students on how to preprocess and analyze sensitive data without exposing information about real individuals.
  
 [[https://www.youtube.com/watch?v=Im0jqBVRJgI&t=10s|Watch this video for an accessible introduction to synthetic data]] [[https://www.youtube.com/watch?v=Im0jqBVRJgI&t=10s|Watch this video for an accessible introduction to synthetic data]]
Line 88: Line 88:
 ==== Research specific de-identification techniques ====  ==== Research specific de-identification techniques ==== 
 === Video data === === Video data ===
 +Researchers use video to record real-world behavior, interactions, or experiments in detail, for example, tracking how people move, communicate, or perform tasks over time. It is important to de-identify this type of data, because videos can easily reveal faces, voices, or surroundings, and leaving those visible can reveal participants’ identities.
  
 ++++ Face and body masking |[[https://github.com/MaskAnyone/MaskAnyone|MaskAnyone]] is a de-identification toolbox for videos that allows you to remove personal identifiable information from videos, while at the same time preserving utility. It provides a variety of algorithms that allows you to de-identify or even anonymize videos (video & audio).  ++++ Face and body masking |[[https://github.com/MaskAnyone/MaskAnyone|MaskAnyone]] is a de-identification toolbox for videos that allows you to remove personal identifiable information from videos, while at the same time preserving utility. It provides a variety of algorithms that allows you to de-identify or even anonymize videos (video & audio). 
 ++++  ++++ 
 ++++ Metadata de-identification | ++++ Metadata de-identification |
-Even after de-identifying video data to the extent that it's unrecognizable to people or machines, metadata of the file, such as timestamps or location tags, can still indirectly reveal participants’ identities.+Even after de-identifying video data so it's unrecognizable to people or machines, metadata, such as timestamps or location tags, can still indirectly reveal participants’ identities.
 To protect participant privacy, always remove or mask the following metadata:  To protect participant privacy, always remove or mask the following metadata: 
-  * location data (e.g. GPS coordinates)+  * Location data (e.g. GPS coordinates)
   * Network identifiers (e.g. IP addresses)   * Network identifiers (e.g. IP addresses)
   * Device or user IDs (e.g. serial numbers, or account IDs)   * Device or user IDs (e.g. serial numbers, or account IDs)
- 
  
 ++++  ++++ 
Line 104: Line 104:
  
 === Audio data === === Audio data ===
 +Audio recordings are typically collected to capture exactly what participants say during interviews or focus groups, or to study voice patterns. Audio data itself can contain identifying information: Participants may be recognizable from their voice by other people, and modern speech recognition technologies can also be used to identify participants. For this reason, audio data should be de-identified before further use or sharing. 
  
 ++++ Transcription | ++++ Transcription |
-Audio recordings in research are typically collected to capture exactly what participants say during interviews or focus groups. However, audio data itself can contain identifying information. Participants may be recognizable from their voice by other people, and modern speech recognition technologies can also be used to identify participants. For this reason, audio data should be de-identified before further use or sharing. A common approach is to convert the recordings into written transcripts and then work only with the text data. Transcription removes the direct voice signal that could reveal the speaker’s identity. +A common step in research to make audio data suitable for analysis is to convert the recordings into written transcripts and then work only with the text data. Transcription removes the direct voice signal that could reveal the speaker’s identity. 
  
 ---- ----
Line 114: Line 115:
   * If an automated tool is preferred, researchers can use the [[..:itsol:whisper:|institutional instance of Whisper]] for transcription. **Warning:** Do not use commercial software for transcription without the right [[https://www.rug.nl/digital-competence-centre/privacy-and-data-protection/data-protection/protocols-agreements|legal agreements]] in place    * If an automated tool is preferred, researchers can use the [[..:itsol:whisper:|institutional instance of Whisper]] for transcription. **Warning:** Do not use commercial software for transcription without the right [[https://www.rug.nl/digital-competence-centre/privacy-and-data-protection/data-protection/protocols-agreements|legal agreements]] in place 
  
-After transcription, the resulting text should still be reviewed to remove any remaining identifying information (such as names, locations, or other personal details) before the data is used for analysis or distribution.+After transcription, the resulting text should still be reviewed to remove any remaining identifying information (such as names, locations, or other personal details) before the data is used for analysis or distribution. Refer to [[https://dmeg.cessda.eu/Data-Management-Expert-Guide/5.-Protect/Anonymisation| the Data Management Expert Guide of CESSDA]] for guidance on how to de-identify your transcript, including a practical case study example.  
 ++++ ++++
 ++++ Acoustic de-identification | ++++ Acoustic de-identification |
 +If the research extends beyond the textual content and transcript analysis alone is insufficient, additional de-identification measures may be considered. In such cases, parts of the audio data can be modified to protect the identity of your participants. For audio recordings, editing software such as [[https://nl.wikipedia.org/wiki/Audacity|Audacity]] can be used to alter or distort voices or to mask personal information (e.g., by muting or inserting bleeps). Be aware that applying these techniques can be time consuming and can also heavily impact the usability of the data. 
 ++++ ++++
  
 ++++ Metadata de-identification | ++++ Metadata de-identification |
-Even after de-identifying audio data to the extent that it's unrecognizable to people or machines, metadata metadata of the file, such as timestamps or location tags, can still indirectly reveal participants’ identities.+Even after de-identifying audio data so it's unrecognizable to people or machines, metadata, such as timestamps or location tags, can still indirectly reveal participants’ identities.
 To protect participant privacy, always remove or mask the following metadata:  To protect participant privacy, always remove or mask the following metadata: 
   * location data (e.g. GPS coordinates)   * location data (e.g. GPS coordinates)
   * Network identifiers (e.g. IP addresses)   * Network identifiers (e.g. IP addresses)
   * Device or user IDs (e.g. serial numbers, or account IDs)   * Device or user IDs (e.g. serial numbers, or account IDs)
- 
 ++++  ++++ 
  
 +----
 +[[dcc:pdpsol:start | → Go back to the Privacy & Data protection home page]]