Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
rdms:data:integrity [2025/03/14 13:23] – [Checksums in the RDMS] added this section with general info about checksums in RDMS jelterdms:data:integrity [2025/03/14 14:57] (current) – [Via Command-Line Interface] Adjusted script jelte
Line 21: Line 21:
 ==== Checksums in the RDMS ==== ==== Checksums in the RDMS ====
  
-One of the unique features of the RDMS is that it is not a simple storage solution, but also that it has a database running in the background that can be used to annotate data with user-defined [[rdms:metadata:|metadata]], but which also is used to store other information about the data in the system. +One of the unique features of the RDMS is that it is not a simple storage solution, but also that it has a database running in the background (iCAT catalog) that can be used to annotate data with user-defined [[rdms:metadata:|metadata]], but which also is used to store other information about the data in the system. 
  
 In the case of the RDMS, we also store a checksum for every file that is stored in the RDMS. This is by default done automatically upon data ingestion via <color #ed1c24>delayed rules/processes</color>, but the calculation of checksums can also be enforced manually when using the [[rdms:access:linux:icommands|iCommands]].  In the case of the RDMS, we also store a checksum for every file that is stored in the RDMS. This is by default done automatically upon data ingestion via <color #ed1c24>delayed rules/processes</color>, but the calculation of checksums can also be enforced manually when using the [[rdms:access:linux:icommands|iCommands]]. 
Line 31: Line 31:
 **Note**: If you use Windows, either via native [[rdms:access:windows:|WebDAV in MS File Explorer]], [[rdms:access:windows:cyberduck|Cyberduck]], or [[rdms:access:windows:winscp|WinSCP]], the information about data checksums is not available. The same also applies for Mac users that use [[rdms:access:mac:cyberduck|Cyberduck]] or [[rdms:access:mac:finder|Finder]].   **Note**: If you use Windows, either via native [[rdms:access:windows:|WebDAV in MS File Explorer]], [[rdms:access:windows:cyberduck|Cyberduck]], or [[rdms:access:windows:winscp|WinSCP]], the information about data checksums is not available. The same also applies for Mac users that use [[rdms:access:mac:cyberduck|Cyberduck]] or [[rdms:access:mac:finder|Finder]].  
 ==== Data Replica Status Explained ==== ==== Data Replica Status Explained ====
-==== Via the Web Interface ====+ 
 +Every file (not folder) in the RDMS also has a replica status associated with it. This replica status gets automatically assigned when the data enters the system. The replica status definitions result from iRODS the data management system that is the backbone of the RDMS. As of now, iRODS knows 4 different replica statuses of which three are used: 
 + 
 +^ Numeric Value     ^ Symbolic Value      ^ Name  ^ Definition ^ 
 +| 0    | ''X'' | stale        | A stale replica is no longer marked as good. This usually result from another replica of the same file being written to more recently. A stale file does not necessarily mean that the data is corrupted. 
 +| 1    | ''&'' | good         | A good replica is the desired state. It marks that the data as well as associated metadata are correctly written to the system and understood to have arrived in a good state. | 
 +| 2    | ''?'' | intermediate | An intermediate replica state usually happens during an active data transfer. In this state, the replica (file) can not be opened or otherwise written to, nor can it be renamed. |  
 +| 3    | ''?'' | read-locked  | Currently unused state |  
 +| 4    | ''?'' | write-locked | A write-locked replica status also indicates that the file is currently locked (no read/write/rename) |  
 + 
 +As seen in the above table, the replica status to aim for is the "good" replica status.  
 + 
 +If you check your file statuses often, it is also not uncommon to see files in the intermediate state which usually happens if you are currently transferring the data that you are checking.  
 + 
 +If on the other hand, your data is not actively written to by you and you see a file in either an intermediate or a locked state, this usually indicates a form of data corruption or the data not having arrived and been registered properly in the system.  
 + 
 +This usually happens if your data transfer fails during active transfer, for example when the client or connection crashed. If you experience this, the data **can't be used** and the transfer should be re-attempted. As Replica in this state also often result in errors upon renewed transfers, please have a look at our [[rdms:bestpractices|Best Practices wiki section]] that explains in the //Locked Files (HIERARCHY_ERROR)// how you can recover from these cases. 
 + 
 +For a stale replica, the situation can indicate corrupted data, but does not have to. In these cases, it is best to compare the registered checksums with the checksums of the original files if they still exist on the source. If they do not exist anymore, it is also possible to check that the files are okay by downloading them and checking locally if they have the content, etc. that they should have. Otherwise, if unsure, please get an in contact with [[rdms-support@rug.nl|rdms-support@rug.nl]] and we will look at your specific case.  
 + 
 +**Note:** While not directly related to the replica status information, the size of the file that is visible in the RDMS can be also an indication if the data is good or somehow corrupted. If you see 0 byte size replica (files), this can be an indication of data not being good! 
 + 
 ==== Via Command-Line Interface ==== ==== Via Command-Line Interface ====
 +The most convenient way to check the status and integrity of your data in the RDMS is via the [[rdms:access:linux:icommands|iCommands]] command-line tool. 
 +
 +=== Checking Integrity during Data Ingestion ===
 +The commands that are used for uploading data to the RDMS, namely ''iput'' and ''irsync'', both have an option to enforce checksum calculation and comparison via the additional ''-K'' flag. From the user documentation of both commands:
 +
 +<code>
 +-K  verify checksum - calculate and verify the checksum on the data, both
 +       client-side and server-side, and store it in the catalog.
 +</code>
 +
 +Which will compute the checksums for your locally, but also on the RDMS side. In the process the checksums are verified by the iCommands for you and also directly stored in the iCAT catalog/database. 
 +
 +**Note**: Even without using the ''-K'' flag, your uploaded data will get a checksum eventually due to the defined <color #ed1c24>delayed rules/processes</color>, but using ''-K'' does that directly during data upload and also does the comparison for you. 
 +
 +=== Checking Integrity after Data Ingestion ===
 +
 +For data that is already in the RDMS, there are different ways on how the integrity of the data can be checked.
 +
 +**Comparing Checksums manually**
 +
 +First, to see the status of a replica (file) in general, the ''ils'' command with the additional ''-L'' flag can be used which will have an output similar to: 
 +<code>
 +$ ils -L test.json
 +  j.p.nimoth@r      0 rootResc;randy;pt0;mnt_nfsirods0         2629 2025-01-06.14:02 & test.json
 +    sha2:p4K6fv/5EVqpG1gugrXQrrk2Vqky72AVxcTDSW16W38=    generic    /mnt/nfsirods/home/j.p.nimoth@rug.nl/test.json
 +</code>
 +
 +In the above example of the ''test.json'' file that output of the command shows us that the status of the file is good as seen by the ''&'' replica status next to the file name. Moreover, we can see that the base64-encoded sha256 checksum for this file is ''p4K6fv/5EVqpG1gugrXQrrk2Vqky72AVxcTDSW16W38=''.
 +
 +The good replica status already gives a good indication that the data is corrupted. To be sure the checksum of the data can be also computed locally and then compare these to the one that you see in the RDMS.
 +
 +**For Linux operating system**, checksums can be calculated locally via:
 +<code>
 +sha256sum <filename> | awk '{print $1}' | xxd -r -p | base64
 +</code>
 +
 +which would compute for the shown example file:
 +
 +<code>
 +$ sha256sum test.json | awk '{print $1}' | xxd -r -p | base64
 +p4K6fv/5EVqpG1gugrXQrrk2Vqky72AVxcTDSW16W38=
 +</code>
 +
 +As can be seen, both checksums, the one registered in the RDMS as well as the one computed for the same file locally, are the same. Therefore, it can be guaranteed that the file in the RDMS is the same as the one that was uploaded to it. 
 +
 +As a further tip, it is also possible to adjust the command a little so that it does not just calculate the checksum for a single file, but for all files in a folder. An example command to do so (assuming Bash shell):
 +
 +<code>
 +for file in /path/to/folder/*; do
 +  if [ -f "$file" ]; then
 +    checksum=$(sha256sum $file | awk '{print $1}' | xxd -r -p | base64)
 +    echo "File: $(basename "$file"), Checksum: $checksum"
 +  fi
 +done
 +</code>
 +
 +which will iterate over all files in the specified local folder and display the found file including the computed checksums. This info can then be used to compare to the data in the RDMS.
 +
 +
 +**For Mac**, the respective commands have to be slightly adjusted. To compute the checksum stored in the RDMS on Mac for a single file, you can use:
 +<code>
 +shasum -a 256 <filename> | awk '{print $1}' | xxd -r -p | base64
 +</code>
 +
 +Or for all files in a certain folder:
 +<code>
 +for file in /path/to/folder/*; do
 +  if [ -f "$file" ]; then
 +    checksum=$(shasum -a 256 "$file" | awk '{print $1}' | xxd -r -p | base64)
 +    echo "File: $(basename "$file"), Checksum: $checksum"
 +  fi
 +done
 +</code>
 +
 +**For Windows operating system**, checksums can be calculated locally via Powershell.
 +
 + To compute the base64-encoded checksum that is used in the RDMS via Powershell for a single file, you can use:
 +
 +<code>
 +[System.Convert]::ToBase64String((Get-FileHash -Algorithm SHA256 -Path "\path\to\example_file" | Select-Object -ExpandProperty Hash | ForEach-Object { [System.Convert]::FromHexString($_) }))
 +</code>
 +
 +Or to compute directly for all files in a certain directory:
 +<code>
 +Get-ChildItem -Path "C:\path\to\folder" -File | ForEach-Object {
 +    $file = $_.FullName
 +    $checksum = [System.Convert]::ToBase64String((Get-FileHash -Algorithm SHA256 -Path $file | Select-Object -ExpandProperty Hash | ForEach-Object { [System.Convert]::FromHexString($_) }))
 +    [PSCustomObject]@{
 +        FileName = $_.Name
 +        Checksum  = $checksum
 +    }
 +}
 +</code>
 +
 +**Note:** 
 +  * The ''sha2:'' entry in front of the checkum of the RDMS does only hint at the used checksum algorithm. It is not part of the checksum. This is important when comparing checksums seen in the system with the one computed locally!
 +
 +**For finding non-good replicas** it is best to use the ''iquest'' command from the iCommands package. This command can query the RDMS database and as this also stores the replica statuses, we can use this to find all files that are marked as not to be known good. An example query for that would be:
 +
 +<code>
 +$ iquest "status: %s, name: %s/%s" "SELECT DATA_REPL_STATUS, COLL_NAME, DATA_NAME WHERE COLL_NAME LIKE '/rug/home/path/to/folder%' AND DATA_REPL_STATUS <> '1'"
 +</code>
 +
 +which will check the location ''/rug/home/path/to/folder%'' including all its files and subdirectories including their files for files that have a replica status not equal to 1 (good state). As shown here, the command will output a list with the numeric status value of the found files, including their full RDMS path. 
 +
 +
 +==== Via the Web Interface ====
 +
 +The RDMS web interface also can be used to display the data checksum and the replica status, so that the integrity of the data can be confirmed. 
 +This info is visible via the object view which can be opened via the ''i'' button when selecting a file in the interface.
 +
 +{{ :rdms:data:rdms_integrity_1.png?direct&800 |}}
 +
 +In the object view, the relevant information are shown and can be used to check if the data is good.
 +
 +{{ :rdms:data:rdms_integrity_2.png?direct&800 |}}
 +
 +**Notes**:
 +  * While RDMS web interface does shows the checksum, you will still need to compute this value locally if you want to compare it. For that, please see the section above that described how to do this in different operating systems.
 +  * As of now, the [[rdms:webapp:search|RDMS search]] does not allow to search for files with a specific checksum or for searching for all files with a certain replica status, for example to search for all non-good replica statuses in a certain RDMS location as can be done via iCommands. We are working on introducing this feature in the future.