Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
dcc:itsol:whisper:datamanage [2025/09/10 13:44] – Minor text changes albadcc:itsol:whisper:datamanage [2025/10/30 13:56] (current) – adjusted index numbering giulio
Line 1: Line 1:
-{{indexmenu_n>4}}+{{indexmenu_n>3}}
 ===== Data management safety measures ===== ===== Data management safety measures =====
  
Line 6: Line 6:
 There are **three main areas** that you need to clear to secure your data: There are **three main areas** that you need to clear to secure your data:
   * The **audio** that you provided on **input**   * The **audio** that you provided on **input**
-  * The **transcripts** that Whisper created+  * The **transcripts** that Whisper created in **output**
   * The **SLURM file** created by HPC to record the job details   * The **SLURM file** created by HPC to record the job details
  
Line 15: Line 15:
 ==== Input Audio ==== ==== Input Audio ====
  
-The files contained in the folder ''whisper_audio'' need to be **removed before you can run a new transcription**. This is because the script is designed in such a way that it will transcribe anything that is located in the ''whisper_audio'' folder. **If you do not remove old audio data, Whisper will transcribe that audio again**, potentially running out of time to finish the transcription job.+The files contained in the folder ''input'' need to be **removed before you can run a new transcription**. This is because the script is designed in such a way that it will transcribe anything that is located in the ''input'' folder. **If you do not remove old audio data, Whisper will transcribe that audio again**, potentially running out of time to finish the transcription job. Removing audio that has already been transcribed also reduces the risks of a data leak and is considered good practice.
  
 Before removing the audio files, we advise you to first check if the transcripts are acceptable. Should you have to run the transcription again with a modified script (i.e., to force a language that Whisper did not automatically identify), then having the audio still on HPC will save you time. Before removing the audio files, we advise you to first check if the transcripts are acceptable. Should you have to run the transcription again with a modified script (i.e., to force a language that Whisper did not automatically identify), then having the audio still on HPC will save you time.
Line 23: Line 23:
 ==== Output Text ==== ==== Output Text ====
  
-The files contained in the folder ''whisper_output'' can be removed at any time, even after a new transcription job has been launched. We do advise you, however, to **clean your HPC environment completely, before running a new transcription**. +The files contained in the folder ''output'' can be removed at any time, even after a new transcription job has been launched. We do advise you, however, to **clean your HPC environment completely, before running a new transcription**. 
  
 The transcripts created by Whisper come in five different formats: The transcripts created by Whisper come in five different formats:
Line 34: Line 34:
 ==== Job Information File ==== ==== Job Information File ====
  
-Finally, there is one last file that needs to be removed before you are done cleaning your HPC environment. In your HOME folder (the one you are directed to when you connect to HPC), there is a file called:+Finally, there is one last file that needs to be removed before you are done cleaning your HPC environment. In your ''whisper'' folder, next to the script the interface created, there is a file called:
   * ''slurm-<jobID>.out''.   * ''slurm-<jobID>.out''.
  
-This file is created by HPC when you launch a job, and it is tagged with the ''jobID'' displayed when executing the script. It is used to record what happened while the job was running. Apart from the information on the status of the job and how it was completed, HPC also records the actual transcription here. This means that **the transcription of your audio can also be read by displaying this file**. In order to ensure that all data related to your audio is removed from HPC, these files need to be deleted as well.+This file is created by HPC when you launch a job, and it is tagged with the ''jobID'' displayed when executing the script. It is used to record what happened while the job was running. Apart from the information on the status of the job and how it was completed, the actual transcription is also recorded here. This means that **the transcription of your audio can also be read by displaying this file**. In order to ensure that all data related to your audio is removed from HPC, this file needs to be deleted as well.
  
 **Note**: If you were curious, SLURM stands for //Simple Linux Utility for Resource Management//. **Note**: If you were curious, SLURM stands for //Simple Linux Utility for Resource Management//.
  
 [[dcc:itsol:start | → Return to DCC IT Solutions Guides]] [[dcc:itsol:start | → Return to DCC IT Solutions Guides]]