Both sides previous revision Previous revision Next revision | Previous revision |
rdms:bestpractices [2025/02/03 13:12] – Adjusted menu numbering giulio | rdms:bestpractices [2025/07/02 12:08] (current) – [Bundling of Data Sets] removed space jelte |
---|
===== Bundling of Data Sets ===== | ===== Bundling of Data Sets ===== |
| |
To improve the performance of the RDMS, it is recommended to store data sets in numerous small files in a structured format like ''*.tar'', ''*.tar.gz'', ''*.tar.bz'', or ''*.zip''. This significantly improves transfer rates as the system engages in multi-threaded transfers after reaching a minimal file size threshold (32 MB). Transferring multiple smaller files furthermore results in big overhead, diminishing performance. | To improve the performance of the RDMS, it is recommended to store data sets in a structured format like ''*.tar'', ''*.tar.gz'', ''*.zip'', or similar (see below for more info about data compression) instead of individual files/folder. This significantly improves transfer rates as the system engages in multi-threaded transfers after reaching a minimal file size threshold (32 MB). Transferring multiple smaller files furthermore results in big overhead, diminishing performance. |
| |
Best practices to handle such cases are: | Best practices to handle such cases are: |
For RDMS web interface users, the "Uncompress tar" function, accessible via right-click on a ''*.tar'' file, enables extraction. Currently, this function supports only ''*.tar'' formats. | For RDMS web interface users, the "Uncompress tar" function, accessible via right-click on a ''*.tar'' file, enables extraction. Currently, this function supports only ''*.tar'' formats. |
| |
| **Note:** The ''ibun'' command does not support symlinks. It is therefore recommended to dereference symlinks upon local creation of the archives. For the ''tar'' command, this can be achieved via the additional ''-h'' flag. |
| |
| ==== Choosing a Data Compression Formats ==== |
| |
| While the bundling of data without extra compression (''*.tar'') is already very helpful to increase the performance of data transfers, additional compression is often useful as this can reduce the data size tremendously. There are different possibilities of compression, for example: |
| * ''*.tar.gz'' |
| * ''*.tar.bz2'' |
| * ''*.tar.xz'' |
| * ''*.tar.zst'' |
| * ''*.zip'' |
| * ''.7z'' |
| |
| From our experiences, ''*.tar.zst'' which uses [[https://en.wikipedia.org/wiki/Zstd|Zstandard compression]] delivers a very good compromise between achieved compression and compression time. |
| |
| **Notes:** |
| * Not all compression types can be extracted via ''ibun'' on the RDMS side if needed. From the above listed formats, ''*.7z'' does not work. In this cases, the file needs to be downloaded first before being able to extract. |
| * In general, for archived data sets, it is also recommended to not extract them on the RDMS, but rather keep them in their bundled (and compressed) format for long-term storage. |
| * In certain cases, it makes sense to not bundle the whole data set into one package, but rather in suitable sub-packages. For example if those constitute of defined subsets of the data where it makes sense to bundle. |
| * Also note that for bundled and compressed formats, it is not easy to directly see the content of the archives (exception: content of ''*.tar'' which can be previewed in the [[rdms:webapp:databrowser|data browser of the web interface]]). For cases where the bundled, and potentially compressed, data set is still of a big size, it is recommended to **create a list of files/folders in the archive locally before bundling and then upload this with the bundled data set.** In this cases, the text file, which is much smaller than data set, can be downloaded first and it can be used to check if the respective data set contains the searched for data. How these lists of files/folders are created depends on your system. Linux users can, for example, use the ''find'' or ''tree'' commands for that while Windows users can achieve similar results via the ''dir'' command (Windows command prompt) or ''Get-ChildItem'' (Windows Powershell). |
| |
| Please contact [[rdms-support@rug.nl|rdms-support@rug.nl]] if you are not sure how to bundle/compress your data sets for long-term storage. |
===== Locked Files (HIERARCHY_ERROR) ===== | ===== Locked Files (HIERARCHY_ERROR) ===== |
| |