Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
rdms:data:integrity [2025/03/19 12:47] – old revision restored (2025/03/18 10:55) added john's changes jelterdms:data:integrity [2025/03/24 08:12] (current) – [Data Safety and Integrity] burcu
Line 1: Line 1:
 ====== Data Safety and Integrity ====== ====== Data Safety and Integrity ======
  
-Compared to other UG storage solutions, the RDMS archive, is unique as it provided you with different means to check the integrity of the stored data. +Compared to other UG storage solutions, the RDMS archive, is unique as it provides you with different means to check the integrity of the stored data. 
  
-This section will explain the important concepts in the RDMS that relate to the save long-term storage of your data. It will also explain how you can check the integrity of your data yourself+This section will explain the key concepts in the RDMS related to long-term storage and how you can verify your data'integrity your self. 
 +In short, the key concepts are:
  
-In short, the important concepts are: +  * **Data Replication**: Data in the RDMS is stored at two different physical locations. The versions at both physical locations are called replicasas the file is identical, meaning replicated, in both locations. 
- +  * **Checksum**: A checksum is a unique value that is generated by running a checksum function on certain data. The uniqueness of these values allow to check the integrity of the data. 
-  * **Data Replication**: Data in the RDMS is stored at two different physical locations. The versions at both physical locations are called //replicas// as the file is the same, meaning replicated, in both locations. +
-  * **Checksum**: A checksum is a certain value that is produced by running a checksum algorithm/function on certain data. The uniqueness of these values allow to check the integrity of the data. +
  
 ===== Data Replication ===== ===== Data Replication =====
-As mentioned in the introduction, all data that is stored in the RDMS is replicated to two physical location. This is done automatically in the background.  +As mentioned in the introduction, all data that is stored in the RDMS is automatically replicated to two separate physical locations.  
-While the replication does not guarantee the integrity of the data, as also corrupted datawill get replicated, it is a safeguard mechanism in case of any harm to the data center. Due to the replication to two different physical location, the chances of both locations being affected is limited+While the replication does not guarantee the integrity of the data, since corrupted data will also be replicated, it is a safeguard mechanism in case of hardware failure or damage to data center. Because the data exists in two independent locations, the likelihood of both locations being affected is minimal
  
-**Note:** The replication in the RDMS functions on a hardware level. For you as a user, this is not directly visible with the tools discussed in this wiki section. For example, the iCommands CLI can be used to check data integrity and, as will be described below, also shows the status of the replica, but will still just show one replica. +**Note:** The replication in the RDMS operates at the hardware level. As a user, this is not directly visible with the tools discussed in this wiki section. For example, the ''iCommands'' CLI can be used to check data integrity and, as will be described below, also shows the status of the replica, but will still show only one replica. 
  
 ===== Checking Data Integrity ===== ===== Checking Data Integrity =====
-Here you will learn what steps you can take to check the integrity of your data in the RDMS yourself.  +This section explains how you can verify the integrity of your data in the RDMS yourself: How the RDMS uses checksums to verify integrity, different replica statuses and what they mean and  how you can use this info to check your data, either using the [[https://research.web.rug.nl/rdmswebapp|RDMS web interface]] or using the [[rdms:access:linux:icommands|iCommands]] CLI tool. 
-The section will with explaining the use of data checksums in the RDMSas well as describe the different //replica statuses// and what they mean. It will then describe how you can use this info to check your data, either using the [[https://research.web.rug.nl/rdmswebapp|RDMS web interface]] or using the [[rdms:access:linux:icommands|iCommands]] CLI tool. +
 ==== Checksums in the RDMS ==== ==== Checksums in the RDMS ====
  
-One of the unique features of the RDMS is that it is not a simple storage solution, but also that it has a database running in the background (iCAT catalog) that can be used to annotate data with user-defined [[rdms:metadata:|metadata]], but which also is used to store other information about the data in the system. +One of the unique features of the RDMS is that it is not just a simple storage solution,  it also has a database running in the background (iCAT catalog). This database allows user to annotate data with user-defined [[rdms:metadata:|metadata]], but it also functions to store other information about the data in the system. 
  
-In the case of the RDMS, we also store a checksum for every file that is stored in the RDMSThis is by default done automatically upon data ingestion via [[rdms:webapp:processes|delayed rules]], but the calculation of checksums can also be enforced manually when using the [[rdms:access:linux:icommands|iCommands]]. +In the RDMS, a checksum is stored for every file. By default, this happens automatically upon data ingestion via [[rdms:webapp:processes|delayed rules]]. Checksum calculation can also be enforced manually when using the [[rdms:access:linux:icommands|iCommands]]. Both methods ensure that data integrity is verifiable. 
 +   
 +==== Data Replica Status ====
  
-The checksum of your data files can be checked via the already mentioned iCommands, but the information about the file checksum is also visible via the web interface. +Every file (not folder) in the RDMS also has a replica status associated with it. This replica status gets automatically assigned when the data enters the system. The replica status definitions result from iRODS the data management system that is the backbone of the RDMS. As of now, iRODS knows five different replica statuse of which four are used:
- +
-The checksum that are stored in the RDMS are base64-encoded [[https://en.wikipedia.org/wiki/SHA-2|SHA256 checksums]] which is important to know when trying to reproduce the checksum in the RDMS locally (see below).  +
- +
-**Note**: If you use Windows, either via native [[rdms:access:windows:|WebDAV in MS File Explorer]], [[rdms:access:windows:cyberduck|Cyberduck]], or [[rdms:access:windows:winscp|WinSCP]], the information about data checksums is not available. The same also applies for Mac users that use [[rdms:access:mac:cyberduck|Cyberduck]] or [[rdms:access:mac:finder|Finder]].   +
-==== Data Replica Status Explained ==== +
- +
-Every file (not folder) in the RDMS also has a replica status associated with it. This replica status gets automatically assigned when the data enters the system. The replica status definitions result from iRODS the data management system that is the backbone of the RDMS. As of now, iRODS knows five different replica statuses of which four are used:+
  
 ^ Numeric Value     ^ Symbolic Value      ^ Name  ^ Definition ^ ^ Numeric Value     ^ Symbolic Value      ^ Name  ^ Definition ^
Line 52: Line 45:
  
 **Note:** While not directly related to the replica status information, the size of the file that is visible in the RDMS can be also an indication if the data is good or somehow corrupted. If you see 0 byte size replica (files), this can be an indication of data not being good! **Note:** While not directly related to the replica status information, the size of the file that is visible in the RDMS can be also an indication if the data is good or somehow corrupted. If you see 0 byte size replica (files), this can be an indication of data not being good!
- + 
 +====== How to Check Your File's Checksum ====== 
 + 
 +  * ''iCommands'' CLI: Use the command-line interface to verify file checksum. 
 +  * Web interface: Checksum information is also visible via the RDMS web interface 
 + 
 +The checksum that are stored in the RDMS are base64-encoded [[https://en.wikipedia.org/wiki/SHA-2|SHA256 checksums]] which is important to know when trying to reproduce the checksum in the RDMS locally (see below).  
 + 
 +**Note**: If you use Windows, either via native [[rdms:access:windows:|WebDAV in MS File Explorer]], [[rdms:access:windows:cyberduck|Cyberduck]], or [[rdms:access:windows:winscp|WinSCP]], the information about data checksums is not available. The same also applies for Mac users that use [[rdms:access:mac:cyberduck|Cyberduck]] or [[rdms:access:mac:finder|Finder]]. 
 ==== Via Command-Line Interface ==== ==== Via Command-Line Interface ====
 The most convenient way to check the status and integrity of your data in the RDMS is via the [[rdms:access:linux:icommands|iCommands]] command-line tool.  The most convenient way to check the status and integrity of your data in the RDMS is via the [[rdms:access:linux:icommands|iCommands]] command-line tool. 
  
 === Checking Integrity during Data Ingestion === === Checking Integrity during Data Ingestion ===
-The commands that are used for uploading data to the RDMS, namely ''iput'' and ''irsync'', both have an option to enforce checksum calculation and comparison via the additional ''-K'' flag. From the user documentation of both commands:+The commands that are used for uploading data to the RDMS, namely ''iput'' and ''irsync'', both have an option to enforce checksum calculation and comparison via the additional ''-K'' flag. From the user documentation of these commands:
  
 <code> <code>
Line 64: Line 66:
 </code> </code>
  
-Which will compute the checksums for you locally, but also on the RDMS side. In the process the checksums are verified by the iCommands for you and also directly stored in the iCAT catalog/database. +Which will compute the checksums for you locally, but also on the RDMS side. In the process the checksums are verified by the ''iCommands'' for you and also directly stored in the iCAT catalog/database. 
  
 **Note**: Even without using the ''-K'' flag, your uploaded data will get a checksum eventually due to the defined [[rdms:webapp:processes|delayed rules]], but using ''-K'' does that directly during data upload and also does the comparison for you.  **Note**: Even without using the ''-K'' flag, your uploaded data will get a checksum eventually due to the defined [[rdms:webapp:processes|delayed rules]], but using ''-K'' does that directly during data upload and also does the comparison for you. 
Line 97: Line 99:
 </code> </code>
  
-As can be seen, both checksums, the one registered in the RDMS as well as the one computed for the same file locally, are the sameTherefore, it can be guaranteed that the file in the RDMS is the same as the one that was uploaded to it+As can be seen, both checksums, the one registered in the RDMS and the one computed locally for the same file, are identicalThis confirms that the file stored in the RDMS matches the originally uploaded version
  
-As a further tip, it is also possible to adjust the command a little so that it does not just calculate the checksum for a single file, but for all files in a folder. An example command to do so (assuming Bash shell):+**Tip**: It is also possible to adjust the command a little so that it does not just calculate the checksum for a single file, but for all files in a folder. An example command to do so (assuming Bash shell):
  
 <code> <code>
Line 133: Line 135:
  
 <code> <code>
-[System.Convert]::ToBase64String((Get-FileHash -Algorithm SHA256 -Path "\path\to\example_file| Select-Object -ExpandProperty Hash | ForEach-Object { [System.Convert]::FromHexString($_) }))+[System.Convert]::ToBase64String((Get-FileHash -Algorithm SHA256 -Path "C:\path\to\fileForEach-Object { [byte[]]($_.Hash -split '(..)' -ne '' | ForEach-Object { [Convert]::ToByte($_, 16) }) }))
 </code> </code>
  
Line 140: Line 142:
 Get-ChildItem -Path "C:\path\to\folder" -File | ForEach-Object { Get-ChildItem -Path "C:\path\to\folder" -File | ForEach-Object {
     $file = $_.FullName     $file = $_.FullName
-    $checksum = [System.Convert]::ToBase64String((Get-FileHash -Algorithm SHA256 -Path $file | Select-Object -ExpandProperty Hash | ForEach-Object { [System.Convert]::FromHexString($_) }))+    $checksum = [System.Convert]::ToBase64String((Get-FileHash -Algorithm SHA256 -Path $file | ForEach-Object { [byte[]]($_.Hash -split '(..)' -ne '' | ForEach-Object { [Convert]::ToByte($_, 16) }) }))
     [PSCustomObject]@{     [PSCustomObject]@{
         FileName = $_.Name         FileName = $_.Name
Line 172: Line 174:
  
 **Notes**: **Notes**:
-  * While RDMS web interface does shows the checksum, you will still need to compute this value locally if you want to compare it. For that, please see the section above that described how to do this in different operating systems. +  * While RDMS web interface displayes the checksum, you will still need to compute this value locally if you want to compare it. For that, please see the section above that described how to do this in different operating systems. 
-  * As of now, the [[rdms:webapp:search|RDMS search]] does not allow to search for files with a specific checksum or for searching for all files with a certain replica status, for example to search for all non-good replica statuses in a certain RDMS location as can be done via iCommands. We are working on introducing this feature in the future. +  * Currently, the [[rdms:webapp:search|RDMS search]] does not support searching for files by a specific checksum or searching for all files with a specific replica status (e.g. finding all non-good replica statuses in a given RDMS location.We are working on introducing this feature in the future.