This is an old revision of the document!
Workflow and data storage
In this section we will describe the basic workflow for working on the cluster. This workflow consists of five steps:
- Copy input data to the system
- Prepare the job script that defines requirements and tells the system what program you want to run on your data.
- Submit the computational task to the job scheduling system.
- Check the status and results of your calculations
- Copy results back to your local system or archival storage
This means that you'll need to know about the following topics:
- Data storage areas
- Data transfers
- Finding information about available software, or getting your software on the system
- Running computations using the job scheduler
- Checking the results of the computations
In this section we will focus on the data storage, and the next sections will delve deeper into the other topics, including the command-line interface, which is implied in some of the steps.
Data
For most applications users need to work with data. Data can be parameters for a program that needs to be run, for example to set up a simulation. It can be input data that needs to be analyzed. And finally running simulations or data analysis will result in data containing the results of the computations.
Peregrine has its own storage system, which is decoupled from the desktop storage systems the university has. Although it would be nice to be able to access data from your desktop system directly on Peregrine, currently this is not possible. Technically this would be challenging, and there would also be performance issues, when people start to do more heavy processing on the desktop storage systems.
Since the storage is decoupled, data needs to be transferred to and from the system. Input data needs to be transferred to the system. Any results that need to be further analyzed or stored for a longer period of time need to be transferred from the system to some local or archival storage.
Peregrine storage areas
Peregrine currently has four storage areas, with different capabilities. On each storage area limits are enforced with respect to the amount of data stored, to ensure that each user has some space, and that the file systems will not suddenly be completely full.
home
The home area is where users can store settings, programs and small data sets. This area is limited in space to 20 GB per user and a daily tape backup of the data is being made.
data
For larger data sets each user has access to a space on the data file system. By default 250 GB per user is available, which can be increased to a larger amount if required. Because of the size of the data no backups are being made of this data.
scratch
The last file system is scratch. Like the name suggests this is for temporary data only. It has a large 10 TB limit, but data will be removed automatically after 30 days. This means that this storage should only be used for data during computations.
local disks
The Peregrine nodes also have local disks that can only be used by calculations running on that specific machine. This implies that this also is temporary space.
Next section: Connecting to the system