Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
habrok:data_management:sharing_data [2023/03/21 13:22] adminhabrok:data_management:sharing_data [2025/03/07 16:10] (current) – Add Fokke's expanded shared dir docs pedro
Line 2: Line 2:
 ====== Sharing data ====== ====== Sharing data ======
  
-If you want to share data on Hábrók with other users, there are two options. Since letting users change file permissions and access control lists was found to lead to security problems we no longer support this. The two remaining alternative options are:+We don't allow users to open up their private folders, using file system permissions or access control lists. This because managing these correctly can be complicated, and can therefore easily lead to security problems, where users accidentally share data with all other cluster users. 
 + 
 +If you need to share data on Hábrók with other users, there are two options. The first is storage for a restricted group, the second is using a publicly accessible storage location. 
 + 
 +Next to this we also offer ''/userapps'' for private or shared software installations. 
 + 
 +Note that the second part of this page has sections on how to manage access privileges in order fix issues with group access to data sets. 
  
 ===== Group directory ===== ===== Group directory =====
  
-A group directory is useful if you want to share data with a group of users and the other cluster users should not have access to that data. +A group directory is useful if you need to share data with a group of usersand the other users on the cluster must not have access to that data. In this case we can set up a group on the cluster for this limited set of users, and give the group access to one or more shared folders
-Group directories are usually created on ''/scratch''. If you want to request a group directory, please contact ''hpc@rug.nl'' and let us know how much space you ideally would like to have (there are limits on what we can assign), which users should be part of the groupand what name you would like to assign to the directory/group.+ 
 +These group directories are created on ''/scratch'', for data that needs to be processed, and on ''/projects'', for data that needs to be stored safelySee the [[habrok:data_management:storage_areas|]] page for more details on these filesystems. 
 + 
 +For working with this data there are two models: 
 +  - There is a single group with access to the data, and the files in the shared folder are readable and writable for all group members.  
 +  - There are one or more data managers that manage the data in the shared folder, and these data managers are the only person with full write access. All other group members can only read the data. In practice this means that the data managers will be a in a second group with write access.  
 + 
 +If you want to request a group directory, please contact us at [[hpc@rug.nl]] and let us know the following things: 
 +  - The proposed name of the group (this name should not be in use already, and be convenient on the command line). The group name will always be prefixed by ''hb-''
 +  - The amount of space needed on the file systems involvedwhen more than the default quota are required. Note that for ''/projects'' there is a freemium model  where you have to pay for storage above a certain threshold. For ''/scratch'' a fair use policy is in place. 
 +  - Who the primary owner of the group is. This person has to approve the requests for joining the group. 
 +  - A second person who can act as an alternative contact person for the group to approve these requests. 
 +  - Do all users need full write access or are there data managers? In case there are data managers two groups will be created, one with write access and another with read-only access. The group names will be suffixed with ''_rw'' and ''_ro'' to distinguish them. 
 +  - If there are data managers, who will fulfill that role?
  
  
 ===== Public directory ====== ===== Public directory ======
  
-Sometimes you need to share non-sensitive, public data with someone else. For this we have set up a directory ''/scratch/public''. The data in this directory can be read by all users on the cluster. +Sometimes you need to share non-sensitive, public data with someone else. For this we have set up a directory ''/scratch/public/tmp''. The data in this directory can be read by all users on the cluster. Since we have allocated limited space to this directory a cleanup script will remove data after 30 days.  
 + 
 +When you need to share data for a longer period, please let us know. We can then create a persistent directory in ''/scratch/public''. This can either be based on a group, where the data is managed by multiple people, or for a single person. You can request this at [[hpc@rug.nl]], where you need to answer the same questions as for a regular group directory, or tell us that you'll manage the data yourself. 
 + 
 + 
 +===== Software directory ====== 
 + 
 +Since ''/scratch'' is optimized for large files, storing software (which normally consists of a set of many small files) on ''/scratch'' is not recommended. For large or shared software installations a NFS based file system share has been setup, which is available as ''/userapps''. Since we assume that most software installations use downloads from external sites (like e.g. Python virtual environments) we do not make a backup of ''/userapps''
 + 
 +Please contact [[hpc@rug.nl]] if you need additional space on ''/userapps'' for your installations, or when you need to share your software stack with multiple users. For the latter you should answer the questions for a group directory above. 
 + 
 + 
 +===== File system permissions ===== 
 + 
 +==== General description ==== 
 + 
 +In order to be able to fix issues with the file system permissions, one first needs to understand how these work and how they were set up initially.  
 +A more thorough explanation of managing Linux file permissions can be found at: https://kb.iu.edu/d/abdb For this page we will just focus on understanding and repairing broken file system permissions for a group directory. 
 + 
 +In POSIX based file systems, like used on Linux, files and directories have an owner and a group. Note that only the owner of a file can change the permissions on files and fix issues with those. Each file and directory has three sets of permissions. One regarding the owner, one regarding the group and one for everybody else. These can be listed using ''ls -l'', e.g: 
 + 
 +<code> 
 +-rw-rw-r--. 1 user1 hb-public-courses 168024976 Sep  5  2023 dataset.tar.gz 
 +-rw-rw-r--. 1 user2 hb-public-courses       930 May  2  2023 ex1_mandelbrot.R 
 +drwxrwsr-x. 2 user2 hb-public-courses 4096 Oct 17 10:27 inputfiles 
 +-rw-rw-r--. 1 user1 hb-public-courses      3712 Oct 10 10:50 train.py 
 +</code> 
 + 
 +In the example some files are owned by ''user1'', another file and a directory is owned by ''user2''. The group for all files and the directory is ''hb-public-courses''.  
 + 
 +There are three permission groups shown like ''-rw-rw-r--''. The starting ''-'' is a ''d'' in the case of the directory ''inputfiles''. The next three characters can have ''r'', ''w'' and ''x''. The ''r'' denotes read permissions, the ''w'' denotes write permissions and the ''x'' denotes execute permissions for files (which means the ability to run the program or script) and the ability to enter the directory using ''cd'' for a directory. The ''s'' shown for inputfiles is a sgid bit, explained below. 
 + 
 +The first set of ''rwx'' is for the owner (''user1'' or ''user2'' in the example), the second set is for the group ''hb-public-courses'', the third set is for anybody else on the system.  
 + 
 +Note that all top level directories for private and group directories on the cluster are set to be unreadable and unwritable by anybody except the group, which means that the files and directories inside can only be accessed by group members. This even though the files inside may have read, write or execute permissions for "others". The group can be your private group (based on p- or s-number), containing a single user, or a shared group in the case of shared directories.  
 + 
 + 
 +==== The sgid bit for group directories ==== 
 + 
 +In normal situations the group attached to a newly created file or directory will be the primary group of the person writing the file. On the cluster this will be the p- or s-number group. This means that other group members that have access to the directory may not be able to read, modify or delete the file. 
 + 
 +To make sure files are owned by the shared group instead, we set the sgid bit on the group directories. This is listed as a lowercase ''s'' in the directory permissions. This lowercase ''s'' includes the ''x'' for the group execute permission, which means that group members can enter the directory. An uppercase ''S'' will be shown when the execute bit is not set for the group. The effect of this sgid setting is that new files created in the directory will be owned by the group that owns the directory.  
 + 
 +Please be aware that files can also have the ''s'' set, but these should not have this, as this means that an executable file will be run under the group the file is owned by. 
 + 
 + 
 +===== Preserving group directory permissions ===== 
 + 
 +The group directories are set up in such a way that files will be readable and writable by the appropriate group(s). Through the sgid bit newly created files will get the right permissions. 
 + 
 +Archiving and copying tools may, however, override these default permissions, making data unreadable or unwritable for the other group members. This because archiving and copying tools (''rsync'', ''cp'' and ''tar'') will try to keep permissions, group ownership and file attributes as they were in the source data. This can override the default settings that were made when setting up the group directory structure. 
 + 
 +In order to deal with these issues, the person copying the data must take some precautions. Furthermore, you may still need to fix the permissions later on, when the original source data does have different permissions. First some hints to prevent files from being owned by the wrong group, or having the wrong permissions. 
 + 
 +=== cp === 
 + 
 +For ''cp'' the commonly used options ''-a'' or ''-p'' will preserve too many file attributes, including group ownership and permissions, overriding the default settings for the group.  
 + 
 +Normally the only important attributes to keep, when copying data are the original timestamps. This can be achieved using the option ''%%--%%preserve=timestamps'', instead of ''-a'', or ''-p''.  
 + 
 +Furthermore the permissions of the files in the destination should be those for the group directories, and not those in the source. This can be achieved using the ''%%--%%no-preserve=mode'' option. 
 + 
 +So copying a directory of data, preserving the time stamps, could be done like: 
 +<code> 
 +cp -r --preserve=timestamps --no-preserve=mode $HOME/mydata /scratch/hb-groupdir 
 +</code> 
 + 
 + 
 +=== rsync === 
 + 
 +The commonly used ''-a'' (archive) option for rsync does implicitly specify the options ''-rlptgoD''. This will preserve too many things, including group ownership and permissions, breaking access to the files by other group members. 
 + 
 +Leaving out ''pgo'' should make sure the default permissions for the group directory are used on the copied files. Since ''D'' is a special flag for device files we can leave it out for user data as well. 
 + 
 +Furthermore we can tell ''rsync'' to create files as if the source has read, write and execute permissions for the user and group by adding ''%%--%%chmod=ug=rwX''. The uppercase ''X'' indicates that the ''x'' bit should only be set when the original file or directory has it set for the user. 
 + 
 +So the full example for ''rsync'' will look like: 
 +<code> 
 +rsync -rltv --chmod=ug=rwX  $HOME/mydata /scratch/hb-groupdir 
 +</code> 
 + 
 + 
 +=== tar === 
 + 
 +In our testing tar created files with the right group ownership. You may still need to fix the group read and write permissions after extraction. See the instructions below for details. 
 + 
 + 
 +==== Fixing file and directory permissions ==== 
 + 
 +When permissions in a group directory are wrong, the person owning the files can fix these using the ''chmod'' command. You can use the output of ''ls -l'' to find the owner of the file. First we need to fix the read/write/execute permissions. This can be done for a single file or directory using: 
 +<code> 
 +chmod g+rwX file_or_directory 
 +</code> 
 +The ''g'' denotes that we want to change the group permissions (u for user, and o for others are alternative options). The ''+'' denotes that we want to add permissions. The ''rwX'' will set read, write and execute permissions. Using capital ''X'' makes sure the ''x'' is only set on files that have the execute bit set for the user and on directories. 
 + 
 +If you want to change the permission for a directory, including all files and subdirectories inside, one can add the ''-R'' flag to make the command recursive: 
 +<code> 
 +chmod -R g+rwX directory_name 
 +</code> 
 + 
 +To prevent new files from being owned by the private group of the creator the sgid bit must be set on directories. This can be done using: 
 +<code> 
 +chmod g+s directory_name 
 +</code> 
 + 
 +Since this sgid bit should not be used on files, we cannot use the ''-R'' option. If many directories must be fixed, we can automate this using the ''find'' tool, e.g.: 
 +<code> 
 +find . -type d -exec chmod g+s {} \; 
 +</code> 
 +This will find all files of type ''d'' (directories) and run ''chmod g+s'' on these, where each directory name found is represented by ''{}''. The ''chmod command'' is completed by the '';'' character. This character needs to be escaped using ''\'' to make clear to the shell that it belongs to the find command, instead of being the shell command separator. 
 + 
 +Finally giving other groups read and execute access can be achieved using: 
 +<code> 
 +chmod o+rX file_or_directory 
 +</code> 
 +This is of course only required in case a "readonly" group has been defined. This setup is described in the section below on access control lists. 
 + 
 + 
 +==== File system access control lists ===== 
 + 
 +The permission system described above can only handle a single user and group. If multiple groups need access to data, file system access control lists (ACLs) must be used. These give an additional set of controls on the access rights of files and directories.  
 + 
 +Setting the correct rights on the top level group directory, using an ACL for the read-only group, is sufficient to prevent the other cluster users from accessing the files and directories inside. Because the ACL system is quite complex, it is better to manage the rights for the other read-only group using the standard permissions for "other" users. This prevents data managers from having to understand the complex ACL system. 
 + 
 +Since it is important to be able to check the rights on the group folder, the use of ''getfacl'' is explained below. 
 + 
 +=== Retrieving the current access control list === 
 + 
 +Files and directories that have an ACL applied will show an additional ''+'' at the end of the permissions overview. E.g.: 
 +<code> 
 +drwxrws---+ 7 root hb-acl_testing_rw 20480 Feb 28 08:56 hb-acl_testing_rw/ 
 +</code> 
 + 
 +The current set of ACLs can be obtained using the ''getfacl'' command, e.g.: 
 +<code> 
 +$ getfacl hb-acl_testing_rw 
 +# file: hb-acl_testing_rw/ 
 +# owner: root 
 +# group: hb-acl_testing_rw 
 +# flags: -s- 
 +user::rwx 
 +group::rwx 
 +group:hb-acl_testing_ro:r-x 
 +mask::rwx 
 +other::--- 
 +</code> 
 +This example shows that the directory is owned by the main system user ''root''. This means that regular users cannot change the access rights to this directory. The directory is owned by the group ''hb-acl_testing_rw'', where the group has full read, write and execute permissions. The sgid bit is also set to make sure that the files and directories created inside will be owned by the group. 
 + 
 +The ACL list shows that another group ''hb-acl_testing_ro'' has read and execute permissions on the directory.  
 + 
 +These settings mean that users from both the groups ''hb-acl_testing_rw'' and ''hb-acl_testing_ro'' can enter the directory (the execute permission) and read its contents (the read permission). Only members of the group ''hb-acl_testing_rw'' have write permissions in the directory and can therefore add and remove files and directories inside it. 
 + 
 +All other users on the cluster will not be able to access the data inside the folder at all.
  
-Since we have allocated limited space to this directory a cleanup script will remove data after 30 days. Please let us know if you need to share data for a longer period. We can then create a group directory or move the data to a more permanent public location.+Inside the group folder the regular permission bits for "other" cluster users can be used to grant the ''hb-acl_testing_ro'' members access to the files inside. By default this will be read-only access, but care must be taken by the data managers that the access rights are correct, when adding new data.