Archiving and publishing human subject data [CIT Research Documentation]

Your project nears its completion. It is time to prepare your data for archiving and publishing in accordance with the FAIR principles, to make your data as open "as possible and as closed as necessary". When research involves human participants, there is a tension between protecting the privacy of your participants and meeting expectations to archive and publish data so others can verify and reuse your work. Navigating this playing field requires careful planning and thoughtful decisions, putting safeguards in place that protect participants while still allowing responsible access for future research. You can use the sections below to guide you in this process.

Check whether you can select data, with two goals of archiving in mind:

Select and organize the data and other materials that are needed to validate your findings in line with the research data policy of your faculty or institute;
Select and organize the data and other materials that are potentially valuable for further research by you, your team, or fellow researchers.

Often, it is not necessary to keep all collected data:

Limit the (personal) data and materials you archive to the ones that you need for verification of your research. Follow the procedures in the destruction protocol(s) that you designed. Add these protocol(s) to your data package, publication package or archive. (e.g. anonymised consent forms can be archived, while consent forms containing personal data should be de-identified or destroyed)
De-identify data before publishing, while also keeping in mind the usability of your dataset.

FAIR data does not necessarily mean that all your data and materials need to be openly available. Even after de-identification, there can be good reasons to restrict access to your data. The objective is to have data as open as possible, and as closed and protected as necessary.

Apply a ‘layered’ approach to your (de-identified) files by classifying them according to their level of sensitivity.

Level 1: contains no personal data

Publish your (anonymized) dataset and supporting materials in a recognized data repository such as DataverseNL, on the condition that no other reasons for restricting access apply. Allow for reuse by adding a license (for instance, a Creative Commons license) and use the persistent identifier (e.g., DOI) for data citation. If the data are anonymized human subject data, make sure that the terms of use align with the informed consent.

Level 2: contains personal data in de-identified form (not anonymized)

Publish your de-identified dataset and supporting materials on DataverseNL, under restricted access. Determine the terms of use for external parties that would like to reuse your data. Creative Commons licenses are not suitable for data containing personal data with access restrictions. The UG DCC can assist in developing a procedure for making these data available for reuse under well-defined conditions. Make sure that these conditions align with the informed consent.

Level 3: contains sensitive personal data

When your data still contains highly sensitive information, do not publish this data openly or with access controls in a data repository. Instead, archive your data in accordance with the research data policy of your faculty or institute.The UG DCC can assist in developing a procedure for making these sensitive data available for reuse under well-defined conditions. Make sure that these conditions are in line with the informed consent.

If your dataset contains sensitive personal data, you can still publish the supporting materials on DataverseNL. Via your DataverseNL page, you can also inform researchers who want to reuse your data about the procedure to request the data.

→Corpus PINO: A spoken language resource for multiple simultaneous comparisons. (Cristiano et al., 2024)

“Corpus PINO is a resource designed for research on different styles of spoken Italian and Neapolitan dialect. The corpus consists of [de-identified] audio recordings and ELAN time-aligned orthographic transcriptions involving fifty participants (stratified by age, gender, and education level). …. PINO is a contribution to the preservation of the local cultural heritage and of a minority language, i.e., an Italo-Romance dialect. It attests the lives, memories, opinions, traditions, practices, and attitudes of fifty members of this community.”

Score the sensitivity of your data and supporting materials

Corpus PINO contains data and materials that fall under the three different sensitivity levels.

Level 1 (open): Materials used during fieldwork (forms, stimuli, questionnaire, tables, etc.)
Level 2 (restricted): Transcriptions of all the activities the speakers carried out, organized by speaker
Level 3 (restricted on UG premises): De-identified audio data

Define the terms of use

Given thatCreative Commons licenses are not suitable for datasets containing personal data with access restrictions, the custom terms of use largely depend on the consent given by the participants and the degree of de-identification:

Corpus PINO terms of use: "This data can be accessed and reused by researchers affiliated with universities or non-profit, non-commercial organizations in the fields of linguistics, semiotics, sociology, anthropology, and affiliated fields. Due to their increased re-identification potential, the audio files in the corpus shall be facilitated for linguistic and relevant discipline-specific research where analyzing the audio content is pertinent. In these cases, the signing of a data transfer agreement is necessary."

Data transfer agreement

When an external party requests level 3 or, in some cases, level 2 data, a data transfer agreement needs to be signed. A data transfer agreement is a legal contract that defines the specific purposes for which the data may be used by the requesting party. As such, it is the most comprehensive specification of terms of use. The data transfer agreement also describes the rights and obligations of both parties involved and sets out the measures for data protection. The UG has its own model data transfer agreement that can be tailored for the dataset of your research project.

→Refer to the DCC website for more information on legal agreements

→ Go back to the Privacy & Data protection home page