Sharing data

We don't allow users to open up their private folders, using file system permissions or access control lists. This because managing these correctly can be complicated, and can therefore easily lead to security problems, where users accidentally share data with all other cluster users.

If you need to share data on Hábrók with other users, there are two options. The first is storage for a restricted group, the second is using a publicly accessible storage location.

Next to this we also offer /userapps for private or shared software installations.

Note that the second part of this page has sections on how to manage access privileges in order fix issues with group access to data sets.

A group directory is useful if you need to share data with a group of users, and the other users on the cluster must not have access to that data. In this case we can set up a group on the cluster for this limited set of users, and give the group access to one or more shared folders.

These group directories are created on /scratch, for data that needs to be processed, and on /projects, for data that needs to be stored safely. See the Storage areas page for more details on these filesystems.

For working with this data there are two models:

  1. There is a single group with access to the data, and the files in the shared folder are readable and writable for all group members.
  2. There are one or more data managers that manage the data in the shared folder, and these data managers are the only person with full write access. All other group members can only read the data. In practice this means that the data managers will be a in a second group with write access.

If you want to request a group directory, please contact us at hpc@rug.nl and let us know the following things:

  1. The proposed name of the group (this name should not be in use already, and be convenient on the command line). The group name will always be prefixed by hb-.
  2. The amount of space needed on the file systems involved, when more than the default quota are required. Note that for /projects there is a freemium model where you have to pay for storage above a certain threshold. For /scratch a fair use policy is in place.
  3. Who the primary owner of the group is. This person has to approve the requests for joining the group.
  4. A second person who can act as an alternative contact person for the group to approve these requests.
  5. Do all users need full write access or are there data managers? In case there are data managers two groups will be created, one with write access and another with read-only access. The group names will be suffixed with _rw and _ro to distinguish them.
  6. If there are data managers, who will fulfill that role?

Sometimes you need to share non-sensitive, public data with someone else. For this we have set up a directory /scratch/public/tmp. The data in this directory can be read by all users on the cluster. Since we have allocated limited space to this directory a cleanup script will remove data after 30 days.

When you need to share data for a longer period, please let us know. We can then create a persistent directory in /scratch/public. This can either be based on a group, where the data is managed by multiple people, or for a single person. You can request this at hpc@rug.nl, where you need to answer the same questions as for a regular group directory, or tell us that you'll manage the data yourself.

Since /scratch is optimized for large files, storing software (which normally consists of a set of many small files) on /scratch is not recommended. For large or shared software installations a NFS based file system share has been setup, which is available as /userapps. Since we assume that most software installations use downloads from external sites (like e.g. Python virtual environments) we do not make a backup of /userapps!

Please contact hpc@rug.nl if you need additional space on /userapps for your installations, or when you need to share your software stack with multiple users. For the latter you should answer the questions for a group directory above.

In order to be able to fix issues with the file system permissions, one first needs to understand how these work and how they were set up initially. A more thorough explanation of managing Linux file permissions can be found at: https://kb.iu.edu/d/abdb For this page we will just focus on understanding and repairing broken file system permissions for a group directory.

In POSIX based file systems, like used on Linux, files and directories have an owner and a group. Note that only the owner of a file can change the permissions on files and fix issues with those. Each file and directory has three sets of permissions. One regarding the owner, one regarding the group and one for everybody else. These can be listed using ls -l, e.g:

-rw-rw-r--. 1 user1 hb-public-courses 168024976 Sep  5  2023 dataset.tar.gz
-rw-rw-r--. 1 user2 hb-public-courses       930 May  2  2023 ex1_mandelbrot.R
drwxrwsr-x. 2 user2 hb-public-courses 4096 Oct 17 10:27 inputfiles
-rw-rw-r--. 1 user1 hb-public-courses      3712 Oct 10 10:50 train.py

In the example some files are owned by user1, another file and a directory is owned by user2. The group for all files and the directory is hb-public-courses.

There are three permission groups shown like -rw-rw-r–. The starting - is a d in the case of the directory inputfiles. The next three characters can have r, w and x. The r denotes read permissions, the w denotes write permissions and the x denotes execute permissions for files (which means the ability to run the program or script) and the ability to enter the directory using cd for a directory. The s shown for inputfiles is a sgid bit, explained below.

The first set of rwx is for the owner (user1 or user2 in the example), the second set is for the group hb-public-courses, the third set is for anybody else on the system.

Note that all top level directories for private and group directories on the cluster are set to be unreadable and unwritable by anybody except the group, which means that the files and directories inside can only be accessed by group members. This even though the files inside may have read, write or execute permissions for “others”. The group can be your private group (based on p- or s-number), containing a single user, or a shared group in the case of shared directories.

In normal situations the group attached to a newly created file or directory will be the primary group of the person writing the file. On the cluster this will be the p- or s-number group. This means that other group members that have access to the directory may not be able to read, modify or delete the file.

To make sure files are owned by the shared group instead, we set the sgid bit on the group directories. This is listed as a lowercase s in the directory permissions. This lowercase s includes the x for the group execute permission, which means that group members can enter the directory. An uppercase S will be shown when the execute bit is not set for the group. The effect of this sgid setting is that new files created in the directory will be owned by the group that owns the directory.

Please be aware that files can also have the s set, but these should not have this, as this means that an executable file will be run under the group the file is owned by.

The group directories are set up in such a way that files will be readable and writable by the appropriate group(s). Through the sgid bit newly created files will get the right permissions.

Archiving and copying tools may, however, override these default permissions, making data unreadable or unwritable for the other group members. This because archiving and copying tools (rsync, cp and tar) will try to keep permissions, group ownership and file attributes as they were in the source data. This can override the default settings that were made when setting up the group directory structure.

In order to deal with these issues, the person copying the data must take some precautions. Furthermore, you may still need to fix the permissions later on, when the original source data does have different permissions. First some hints to prevent files from being owned by the wrong group, or having the wrong permissions.

cp

For cp the commonly used options -a or -p will preserve too many file attributes, including group ownership and permissions, overriding the default settings for the group.

Normally the only important attributes to keep, when copying data are the original timestamps. This can be achieved using the option --preserve=timestamps, instead of -a, or -p.

Furthermore the permissions of the files in the destination should be those for the group directories, and not those in the source. This can be achieved using the --no-preserve=mode option.

So copying a directory of data, preserving the time stamps, could be done like:

cp -r --preserve=timestamps --no-preserve=mode $HOME/mydata /scratch/hb-groupdir

rsync

The commonly used -a (archive) option for rsync does implicitly specify the options -rlptgoD. This will preserve too many things, including group ownership and permissions, breaking access to the files by other group members.

Leaving out pgo should make sure the default permissions for the group directory are used on the copied files. Since D is a special flag for device files we can leave it out for user data as well.

Furthermore we can tell rsync to create files as if the source has read, write and execute permissions for the user and group by adding --chmod=ug=rwX. The uppercase X indicates that the x bit should only be set when the original file or directory has it set for the user.

So the full example for rsync will look like:

rsync -rltv --chmod=ug=rwX  $HOME/mydata /scratch/hb-groupdir

tar

In our testing tar created files with the right group ownership. You may still need to fix the group read and write permissions after extraction. See the instructions below for details.

When permissions in a group directory are wrong, the person owning the files can fix these using the chmod command. You can use the output of ls -l to find the owner of the file. First we need to fix the read/write/execute permissions. This can be done for a single file or directory using:

chmod g+rwX file_or_directory

The g denotes that we want to change the group permissions (u for user, and o for others are alternative options). The + denotes that we want to add permissions. The rwX will set read, write and execute permissions. Using capital X makes sure the x is only set on files that have the execute bit set for the user and on directories.

If you want to change the permission for a directory, including all files and subdirectories inside, one can add the -R flag to make the command recursive:

chmod -R g+rwX directory_name

To prevent new files from being owned by the private group of the creator the sgid bit must be set on directories. This can be done using:

chmod g+s directory_name

Since this sgid bit should not be used on files, we cannot use the -R option. If many directories must be fixed, we can automate this using the find tool, e.g.:

find . -type d -exec chmod g+s {} \;

This will find all files of type d (directories) and run chmod g+s on these, where each directory name found is represented by {}. The chmod command is completed by the ; character. This character needs to be escaped using \ to make clear to the shell that it belongs to the find command, instead of being the shell command separator.

Finally giving other groups read and execute access can be achieved using:

chmod o+rX file_or_directory

This is of course only required in case a “readonly” group has been defined. This setup is described in the section below on access control lists.

The permission system described above can only handle a single user and group. If multiple groups need access to data, file system access control lists (ACLs) must be used. These give an additional set of controls on the access rights of files and directories.

Setting the correct rights on the top level group directory, using an ACL for the read-only group, is sufficient to prevent the other cluster users from accessing the files and directories inside. Because the ACL system is quite complex, it is better to manage the rights for the other read-only group using the standard permissions for “other” users. This prevents data managers from having to understand the complex ACL system.

Since it is important to be able to check the rights on the group folder, the use of getfacl is explained below.

Retrieving the current access control list

Files and directories that have an ACL applied will show an additional + at the end of the permissions overview. E.g.:

drwxrws---+ 7 root hb-acl_testing_rw 20480 Feb 28 08:56 hb-acl_testing_rw/

The current set of ACLs can be obtained using the getfacl command, e.g.:

$ getfacl hb-acl_testing_rw
# file: hb-acl_testing_rw/
# owner: root
# group: hb-acl_testing_rw
# flags: -s-
user::rwx
group::rwx
group:hb-acl_testing_ro:r-x
mask::rwx
other::---

This example shows that the directory is owned by the main system user root. This means that regular users cannot change the access rights to this directory. The directory is owned by the group hb-acl_testing_rw, where the group has full read, write and execute permissions. The sgid bit is also set to make sure that the files and directories created inside will be owned by the group.

The ACL list shows that another group hb-acl_testing_ro has read and execute permissions on the directory.

These settings mean that users from both the groups hb-acl_testing_rw and hb-acl_testing_ro can enter the directory (the execute permission) and read its contents (the read permission). Only members of the group hb-acl_testing_rw have write permissions in the directory and can therefore add and remove files and directories inside it.

All other users on the cluster will not be able to access the data inside the folder at all.

Inside the group folder the regular permission bits for “other” cluster users can be used to grant the hb-acl_testing_ro members access to the files inside. By default this will be read-only access, but care must be taken by the data managers that the access rights are correct, when adding new data.