Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
rdms:data:integrity [2025/03/14 13:23] – [Checksums in the RDMS] added this section with general info about checksums in RDMS jelte | rdms:data:integrity [2025/03/14 14:57] (current) – [Via Command-Line Interface] Adjusted script jelte | ||
---|---|---|---|
Line 21: | Line 21: | ||
==== Checksums in the RDMS ==== | ==== Checksums in the RDMS ==== | ||
- | One of the unique features of the RDMS is that it is not a simple storage solution, but also that it has a database running in the background that can be used to annotate data with user-defined [[rdms: | + | One of the unique features of the RDMS is that it is not a simple storage solution, but also that it has a database running in the background |
In the case of the RDMS, we also store a checksum for every file that is stored in the RDMS. This is by default done automatically upon data ingestion via <color # | In the case of the RDMS, we also store a checksum for every file that is stored in the RDMS. This is by default done automatically upon data ingestion via <color # | ||
Line 31: | Line 31: | ||
**Note**: If you use Windows, either via native [[rdms: | **Note**: If you use Windows, either via native [[rdms: | ||
==== Data Replica Status Explained ==== | ==== Data Replica Status Explained ==== | ||
- | ==== Via the Web Interface ==== | + | |
+ | Every file (not folder) in the RDMS also has a replica status associated with it. This replica status gets automatically assigned when the data enters the system. The replica status definitions result from iRODS the data management system that is the backbone of the RDMS. As of now, iRODS knows 4 different replica statuses of which three are used: | ||
+ | |||
+ | ^ Numeric Value ^ Symbolic Value ^ Name ^ Definition ^ | ||
+ | | 0 | '' | ||
+ | | 1 | ''&'' | ||
+ | | 2 | ''?'' | ||
+ | | 3 | ''?'' | ||
+ | | 4 | ''?'' | ||
+ | |||
+ | As seen in the above table, the replica status to aim for is the " | ||
+ | |||
+ | If you check your file statuses often, it is also not uncommon to see files in the intermediate state which usually happens if you are currently transferring the data that you are checking. | ||
+ | |||
+ | If on the other hand, your data is not actively written to by you and you see a file in either an intermediate or a locked state, this usually indicates a form of data corruption or the data not having arrived and been registered properly in the system. | ||
+ | |||
+ | This usually happens if your data transfer fails during active transfer, for example when the client or connection crashed. If you experience this, the data **can' | ||
+ | |||
+ | For a stale replica, the situation can indicate corrupted data, but does not have to. In these cases, it is best to compare the registered checksums with the checksums of the original files if they still exist on the source. If they do not exist anymore, it is also possible to check that the files are okay by downloading them and checking locally if they have the content, etc. that they should have. Otherwise, if unsure, please get an in contact with [[rdms-support@rug.nl|rdms-support@rug.nl]] and we will look at your specific case. | ||
+ | |||
+ | **Note:** While not directly related to the replica status information, | ||
+ | |||
==== Via Command-Line Interface ==== | ==== Via Command-Line Interface ==== | ||
+ | The most convenient way to check the status and integrity of your data in the RDMS is via the [[rdms: | ||
+ | |||
+ | === Checking Integrity during Data Ingestion === | ||
+ | The commands that are used for uploading data to the RDMS, namely '' | ||
+ | |||
+ | < | ||
+ | -K verify checksum - calculate and verify the checksum on the data, both | ||
+ | | ||
+ | </ | ||
+ | |||
+ | Which will compute the checksums for your locally, but also on the RDMS side. In the process the checksums are verified by the iCommands for you and also directly stored in the iCAT catalog/ | ||
+ | |||
+ | **Note**: Even without using the '' | ||
+ | |||
+ | === Checking Integrity after Data Ingestion === | ||
+ | |||
+ | For data that is already in the RDMS, there are different ways on how the integrity of the data can be checked. | ||
+ | |||
+ | **Comparing Checksums manually** | ||
+ | |||
+ | First, to see the status of a replica (file) in general, the '' | ||
+ | < | ||
+ | $ ils -L test.json | ||
+ | j.p.nimoth@r | ||
+ | sha2: | ||
+ | </ | ||
+ | |||
+ | In the above example of the '' | ||
+ | |||
+ | The good replica status already gives a good indication that the data is corrupted. To be sure the checksum of the data can be also computed locally and then compare these to the one that you see in the RDMS. | ||
+ | |||
+ | **For Linux operating system**, checksums can be calculated locally via: | ||
+ | < | ||
+ | sha256sum < | ||
+ | </ | ||
+ | |||
+ | which would compute for the shown example file: | ||
+ | |||
+ | < | ||
+ | $ sha256sum test.json | awk ' | ||
+ | p4K6fv/ | ||
+ | </ | ||
+ | |||
+ | As can be seen, both checksums, the one registered in the RDMS as well as the one computed for the same file locally, are the same. Therefore, it can be guaranteed that the file in the RDMS is the same as the one that was uploaded to it. | ||
+ | |||
+ | As a further tip, it is also possible to adjust the command a little so that it does not just calculate the checksum for a single file, but for all files in a folder. An example command to do so (assuming Bash shell): | ||
+ | |||
+ | < | ||
+ | for file in / | ||
+ | if [ -f " | ||
+ | checksum=$(sha256sum $file | awk ' | ||
+ | echo "File: $(basename " | ||
+ | fi | ||
+ | done | ||
+ | </ | ||
+ | |||
+ | which will iterate over all files in the specified local folder and display the found file including the computed checksums. This info can then be used to compare to the data in the RDMS. | ||
+ | |||
+ | |||
+ | **For Mac**, the respective commands have to be slightly adjusted. To compute the checksum stored in the RDMS on Mac for a single file, you can use: | ||
+ | < | ||
+ | shasum -a 256 < | ||
+ | </ | ||
+ | |||
+ | Or for all files in a certain folder: | ||
+ | < | ||
+ | for file in / | ||
+ | if [ -f " | ||
+ | checksum=$(shasum -a 256 " | ||
+ | echo "File: $(basename " | ||
+ | fi | ||
+ | done | ||
+ | </ | ||
+ | |||
+ | **For Windows operating system**, checksums can be calculated locally via Powershell. | ||
+ | |||
+ | To compute the base64-encoded checksum that is used in the RDMS via Powershell for a single file, you can use: | ||
+ | |||
+ | < | ||
+ | [System.Convert]:: | ||
+ | </ | ||
+ | |||
+ | Or to compute directly for all files in a certain directory: | ||
+ | < | ||
+ | Get-ChildItem -Path " | ||
+ | $file = $_.FullName | ||
+ | $checksum = [System.Convert]:: | ||
+ | [PSCustomObject]@{ | ||
+ | FileName = $_.Name | ||
+ | Checksum | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | **Note: | ||
+ | * The '' | ||
+ | |||
+ | **For finding non-good replicas** it is best to use the '' | ||
+ | |||
+ | < | ||
+ | $ iquest " | ||
+ | </ | ||
+ | |||
+ | which will check the location ''/ | ||
+ | |||
+ | |||
+ | ==== Via the Web Interface ==== | ||
+ | |||
+ | The RDMS web interface also can be used to display the data checksum and the replica status, so that the integrity of the data can be confirmed. | ||
+ | This info is visible via the object view which can be opened via the '' | ||
+ | |||
+ | {{ : | ||
+ | |||
+ | In the object view, the relevant information are shown and can be used to check if the data is good. | ||
+ | |||
+ | {{ : | ||
+ | |||
+ | **Notes**: | ||
+ | * While RDMS web interface does shows the checksum, you will still need to compute this value locally if you want to compare it. For that, please see the section above that described how to do this in different operating systems. | ||
+ | * As of now, the [[rdms: |