Archiving and Publishing Human Subject Data

This is an old revision of the document!

Your project nears its completion. It is time to prepare your data for archiving and publishing in accordance with the FAIR principles, to make your data as open “as possible and as closed as necessary”. When research involves human participants, there is a tension between protecting the privacy of your participants and meeting expectations to archive and publish data so others can verify and reuse your work. Navigating this playing field requires careful planning and thoughtful decisions, putting safeguards in place that protect participants, while still allowing responsible access for future research. You can use the sections below to guide you in this process.

Check whether you can select data, with two goals of archiving in mind:

Select and organize the data and other materials that are needed to validate your findings;
Select and organize the data and other materials that are potentially valuable for further research by you, your team, or fellow researchers.

Often it is not necessary to keep all collected data for the purpose of validating your findings or for researchers to reuse your data.

Limit the (personal) data and materials you archive to the ones that you need for verification of your research. Follow the procedures in the destruction protocol(s) that you designed. Add these protocol(s) to your data package, publication package or archive. (e.g. anonymised consent forms can be archived, while consent forms containing personal data should be de-identified or destroyed in accordance with the UG protocol)
Determine whether it is possible to de-identify before publishing, while also keeping in mind the usability of your dataset.

FAIR data does not necessarily mean that all your data and materials need to openly available. Even after de-identification, there can be good reasons to restrict access to your data. The objective is to have data as open as possible, and as closed and protected as necessary.

Consider applying a ‘layered’ approach to your (de-identified) files by scoring your files in terms of sensitivity.

Publish your (anonymized dataset) and supporting materials in a recognized data repository such as DataverseNL, on the condition that no other reasons for restricting access apply. Allow for reuse by adding a license (for instance, a Creative Commons license) and use the persistent identifier (e.g., DOI) for data citation.

Publish your de-identified dataset and supporting materials on DataverseNL, under restricted access. Determine the terms of access and use for external parties that would like to reuse your data. Make sure that these terms of access align with the informed consent.

When your data still contains highly sensitive information, do not publish this data openly or with access controls in a data repository. Instead, archive your data in accordance with the research data policy of your faculty or institute. The UG DCC can assist in developing a procedure for making these sensitive data available for reuse under well-defined conditions. Make sure that these conditions are in line with the informed consent.

If your dataset contains sensitive personal data, you can still publish the supporting materials on DataverseNL. Via your DataverseNL page, you can also inform researchers that want to reuse your data about the procedure to request the data.

→Corpus PINO: A spoken language resource for multiple simultaneous comparisons. (Cristiano et al., 2024)

“Corpus PINO is a resource designed for research on different styles of spoken Italian and Neapolitan dialect. The corpus consists of anonymized audio recordings and ELAN time-aligned orthographic transcriptions involving fifty participants (stratified by age, gender, and education level). …. PINO is a contribution to the preservation of the local cultural heritage and of a minority language, i.e., an italo-romance dialect. It attests the lives, memories, opinions, traditions, practices, attitudes of fifty members of this community.”

Score the sensitivity of your data and supporting materials

Corpus PINO contains data and materials that fall under the three different sensitivity levels.

Level 1 (open): Materials used during fieldwork (forms, stimuli, questionnaire, tables, etc.)
Level 2 (restricted): Transcriptions of all the activities the speakers carried out, organized by speaker
Level 3 (restricted on UG premises): De-identified audio data

Define the terms of use

Creative Commons licenses are not suitable for data containing personal data with access restrictions. Instead, custom terms of use have to be set which will largely depend on the consent given by the participants and the degree of de-identification. As such, the custom terms of use have to reflect what is allowed according to the informed consent.

Corpus Pino Terms of Use: “This data can be accessed and reused by researchers affiliated with universities or no-profit, non-commercial organizations in the fields of linguistics, semiotics, sociology, anthropology, and affiliated fields. Due to their increased re-identification potential, the audio files in the corpus shall be facilitated to linguistic and relevant discipline-specific research where analyzing the audio content is pertinent. In these cases, the signing of a data transfer agreement is necessary.”

Data sharing agreement

A data transfer agreement is a legal contract that defines the specific purposes for which the data may be used by the requesting party. As such it is the most comprehensive specification of Terms of Use. The data sharing agreement also describes the rights and obligations of both parties involved and sets out the measures for data protection. The UG has its own model data transfer agreement that can be tailored for the dataset of your research project.

→Refer to the DCC website for more information on legal agreements

→ Go back to the Privacy & Data protection home page

Archiving and Publishing Human Subject Data

Introduction

What needs to be archived and what can be published?

De-identifying data before archiving or publishing

Publishing de-identified, anonymized or synthetic data

Level 1: contains no personal data

Level 2: contains personal data in de-identified form (not anonymized)

Level 3: contains sensitive personal data

Example dataset: Corpus PINO

Score the sensitivity of your data and supporting materials

Define the terms of use

Data sharing agreement