Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
dcc:itsol:whisper:datamanage [2024/08/12 08:56] – [Job Information File] added meaning of slurm giuliodcc:itsol:whisper:datamanage [2025/09/10 13:44] (current) – Minor text changes alba
Line 1: Line 1:
-{{indexmenu_n>5}}+{{indexmenu_n>4}}
 ===== Data management safety measures ===== ===== Data management safety measures =====
  
-Because you are handling data containing the **voices of (multiple) people**, your data is considered one of the most sensitive kinds of data. It is important that you use the **VRW** environment to **properly protect this kind of data**. For the same reason, data uploaded to HPC to perform the transcription should not be left idling on HPC. As soon as you are aware that the transcription has been performed, you should take steps to **download the results from HPC to the VRW and to remove all traces of the data from HPC**.+Because you are handling data containing the **voices of (multiple) people**, your data is considered one of the most sensitive kinds of data. For this reason, data uploaded to HPC to perform the transcription should not be left idling on HPC. As soon as you are aware that the transcription has been performed, you should take steps to **download the results from HPC to your UG work environment and to remove all traces of the data from HPC**.
  
-There are **three main areas** that you need to clear in order to secure your data:+There are **three main areas** that you need to clear to secure your data:
   * The **audio** that you provided on **input**   * The **audio** that you provided on **input**
   * The **transcripts** that Whisper created   * The **transcripts** that Whisper created
Line 11: Line 11:
 Once all three of these locations/files are downloaded and cleared from HPC, you are **ready for a new round of transcriptions**. Once all three of these locations/files are downloaded and cleared from HPC, you are **ready for a new round of transcriptions**.
  
-Last but not least and independently of the sensitivity of your data, **HPC is a computing cluster** and therefore only intended for the **short-term storage of mutable data**. In order to ensure proper performance, data should not be stored long-term in the cluster.+Last but not leastand independently of the sensitivity of your data, **HPC is a computing cluster** and therefore only intended for the **short-term storage of mutable data**. In order to ensure proper performance, data should not be stored long-term in the cluster.
  
 ==== Input Audio ==== ==== Input Audio ====
Line 17: Line 17:
 The files contained in the folder ''whisper_audio'' need to be **removed before you can run a new transcription**. This is because the script is designed in such a way that it will transcribe anything that is located in the ''whisper_audio'' folder. **If you do not remove old audio data, Whisper will transcribe that audio again**, potentially running out of time to finish the transcription job. The files contained in the folder ''whisper_audio'' need to be **removed before you can run a new transcription**. This is because the script is designed in such a way that it will transcribe anything that is located in the ''whisper_audio'' folder. **If you do not remove old audio data, Whisper will transcribe that audio again**, potentially running out of time to finish the transcription job.
  
-Before removing the audio files, we advise you to first check if the transcripts are acceptable. Should you have to run the transcription again with a modified script (i.e. to force a language Whisper did not automatically identify), then having the audio still on HPC will save you time.+Before removing the audio files, we advise you to first check if the transcripts are acceptable. Should you have to run the transcription again with a modified script (i.e.to force a language that Whisper did not automatically identify), then having the audio still on HPC will save you time.
  
-If the transcripts are what you expect them to be, however, then the audio should be removed promptly. Please consider doing a brief check of the transcripts, rather than going through them line by line. You can check the details of the transcription on the VRW directly later on.+If the transcripts are what you expect them to be, however, then the audio should be removed promptly. Please consider doing a brief check of the transcripts, rather than going through them line by line. You can check the details of the transcription in your work environment directly later on.
  
 ==== Output Text ==== ==== Output Text ====
Line 37: Line 37:
   * ''slurm-<jobID>.out''.   * ''slurm-<jobID>.out''.
  
-This file is created by HPC when you launch a job and it is tagged with the ''jobID'' displayed when executing the script. It is used to record what happened while the job was running. Apart from the information on the status of the job and how it completed, HPC also records the actual transcription here. This means that **the transcription of your audio can also be read by displaying this file**. In order to ensure that all data related to your audio is removed from HPC, this files needs to be deleted as well.+This file is created by HPC when you launch a joband it is tagged with the ''jobID'' displayed when executing the script. It is used to record what happened while the job was running. Apart from the information on the status of the job and how it was completed, HPC also records the actual transcription here. This means that **the transcription of your audio can also be read by displaying this file**. In order to ensure that all data related to your audio is removed from HPC, these files need to be deleted as well.
  
 **Note**: If you were curious, SLURM stands for //Simple Linux Utility for Resource Management//. **Note**: If you were curious, SLURM stands for //Simple Linux Utility for Resource Management//.
  
 [[dcc:itsol:start | → Return to DCC IT Solutions Guides]] [[dcc:itsol:start | → Return to DCC IT Solutions Guides]]