Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
| dcc:pdpsol:de-identification [2026/03/16 12:09] – improve introduction text marlon | dcc:pdpsol:de-identification [2026/03/23 14:09] (current) – add go back to P&DP home page marlon | ||
|---|---|---|---|
| Line 19: | Line 19: | ||
| ==== General de-identification techniques ==== | ==== General de-identification techniques ==== | ||
| - | There are several techniques that can help you make your dataset less identifiable. You can apply these techniques during different phases | + | There are several techniques that can help you make your dataset less identifiable. You can apply these techniques during different phases |
| * After data collection to protect participants when analyzing their data | * After data collection to protect participants when analyzing their data | ||
| Line 35: | Line 35: | ||
| === Removing or suppressing === | === Removing or suppressing === | ||
| - | Consider whether | + | When personal data is not particularly relevant for your research or highly sensitive, |
| * Remove variables that reveal rare personal attributes. | * Remove variables that reveal rare personal attributes. | ||
| - | * Remove direct identifiers, | + | * Remove direct identifiers, |
| * Use restricted access to your data and only provide those variables to researchers that are necessary to answer their research question. | * Use restricted access to your data and only provide those variables to researchers that are necessary to answer their research question. | ||
| Line 76: | Line 76: | ||
| * Access relevant data without compromising the privacy or safety of data subjects. | * Access relevant data without compromising the privacy or safety of data subjects. | ||
| * Evaluate whether the dataset suits their research needs and begin developing code, refining models, and testing hypotheses. | * Evaluate whether the dataset suits their research needs and begin developing code, refining models, and testing hypotheses. | ||
| - | * Educate students on how to preprocess and analyze sensitive data—without exposing information about real individuals. | + | * Educate students on how to preprocess and analyze sensitive data without exposing information about real individuals. |
| [[https:// | [[https:// | ||
| Line 88: | Line 88: | ||
| ==== Research specific de-identification techniques ==== | ==== Research specific de-identification techniques ==== | ||
| === Video data === | === Video data === | ||
| + | Researchers use video to record real-world behavior, interactions, | ||
| ++++ Face and body masking |[[https:// | ++++ Face and body masking |[[https:// | ||
| ++++ | ++++ | ||
| ++++ Metadata de-identification | | ++++ Metadata de-identification | | ||
| - | Even after de-identifying video data to the extent that it's unrecognizable to people or machines, metadata | + | Even after de-identifying video data so it's unrecognizable to people or machines, metadata, such as timestamps or location tags, can still indirectly reveal participants’ identities. |
| To protect participant privacy, always remove or mask the following metadata: | To protect participant privacy, always remove or mask the following metadata: | ||
| - | * location | + | * Location |
| * Network identifiers (e.g. IP addresses) | * Network identifiers (e.g. IP addresses) | ||
| * Device or user IDs (e.g. serial numbers, or account IDs) | * Device or user IDs (e.g. serial numbers, or account IDs) | ||
| - | |||
| ++++ | ++++ | ||
| Line 104: | Line 104: | ||
| === Audio data === | === Audio data === | ||
| + | Audio recordings are typically collected to capture exactly what participants say during interviews or focus groups, or to study voice patterns. Audio data itself can contain identifying information: | ||
| ++++ Transcription | | ++++ Transcription | | ||
| - | Audio recordings | + | A common step in research to make audio data suitable for analysis |
| ---- | ---- | ||
| Line 114: | Line 115: | ||
| * If an automated tool is preferred, researchers can use the [[..: | * If an automated tool is preferred, researchers can use the [[..: | ||
| - | After transcription, | + | After transcription, |
| ++++ | ++++ | ||
| ++++ Acoustic de-identification | | ++++ Acoustic de-identification | | ||
| + | If the research extends beyond the textual content and transcript analysis alone is insufficient, | ||
| ++++ | ++++ | ||
| ++++ Metadata de-identification | | ++++ Metadata de-identification | | ||
| - | Even after de-identifying audio data to the extent that it's unrecognizable to people or machines, metadata | + | Even after de-identifying audio data so it's unrecognizable to people or machines, metadata, such as timestamps or location tags, can still indirectly reveal participants’ identities. |
| To protect participant privacy, always remove or mask the following metadata: | To protect participant privacy, always remove or mask the following metadata: | ||
| * location data (e.g. GPS coordinates) | * location data (e.g. GPS coordinates) | ||
| * Network identifiers (e.g. IP addresses) | * Network identifiers (e.g. IP addresses) | ||
| * Device or user IDs (e.g. serial numbers, or account IDs) | * Device or user IDs (e.g. serial numbers, or account IDs) | ||
| - | |||
| ++++ | ++++ | ||
| + | ---- | ||
| + | [[dcc: | ||