Reference: StrongLink verify jobs#
file version: 01 Dec 2023
current software version: slk_helpers version 1.10.2
under re-construction (2023-12-01). Please also have a look into the section Verify file size
Introduction#
StrongLink allows to run so called verify jobs which check the integrity of files. Via the commands slk_helpers submit_verify_report_files
and slk_helpers submit_verify_report_namespace
we offer an interface to run a simple verify job for files in the HSM cache. A verify job is targeted on a set of files and will compare the actual size of each file with the expected size. Only files, which are currently stored in the HSM cache, are verified. All files only stored on tape will fail the verification.
The verify jobs run in the same queue as retrieval/recall jobs do. This means that, if there are many recall jobs waiting in the queue, new verify jobs are put in the end of the queue and need to wait. Please do not submit more than ten recall jobs at once.
When a verify job is finished, a so called verify report can be fetched from StrongLink. This report contains information on each file which failed the verification.
We plan to set up a nice interface and automatic evaluation of the verify report. However, this has low priority compared to other tasks. Therefore, it is not clear if and when it will come.
Run time#
A verify job targeting a few thousand of files runs approximately one minute. Please do not verify more than 100 000 files at once. Run times which we experianced for larger file numbers are:
10 000 files: 2.5 minutes
50 000 files: 6 minutes
Submitting a verify job#
When you submit a verify job you receive a job id via which you can check the job status and fetch a verify report.
Submit a verify job for a namespace (folder):
$ slk_helpers submit_verify_job /dkrz_test/netcdf/20230925a
156548
Submit a verify job for a list of files:
$ slk_helpers submit_verify_job /dkrz_test/netcdf/20230925a/file_001gb_d.nc /dkrz_test/netcdf/20230925a/file_001gb_i.nc
157303
Submit a verify job for a list of resource IDs:
$ slk_helpers submit_verify_job --resource-ids 74481514010 74481518010
157304
Submit a verify job for all *.nc
files in a namespace using a search and search id. Please note that StrongLink understands Regular Expressions in the filenames (not in the folder names of the path!) but no bash globs / wildcards. Therefore, the wildcard *
equals the regular expression .*
. There are two possibilities:
$ slk_helpers gen_file_query -R /arch/bm0146/k204221/iow/.*nc --cached-only
{"$and":[{"path":{"$gte":"/arch/bm0146/k204221/iow"}},{"smart_pool":"slpstor"}]}
$ slk search '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221/iow"}},{"smart_pool":"slpstor"}]}'
Search continuing... ...
Search ID: 529841
$ slk_helpers submit_verify_job --search-id 529841
157306
OR
$ slk_helpers gen_file_query -R /arch/bm0146/k204221/iow --cached-only
{"$and":[{"path":{"$gte":"/arch/bm0146/k204221/iow"}},{"smart_pool":"slpstor"}]}
$ slk_helpers submit_verify_job --search-query '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221/iow"}},{"smart_pool":"slpstor"}]}'
157306
Note
If you run a verify job for more than 100 files, please run the command with -v
(verbose). The command might take a while and through printing verbose information you will see that the command does not hang. The verbose output is printed to stderr and not to stdout.
Check the status of a verify job#
The results of a verify job are provided as a verify report. A verify report should not be fetch before the verify job is finished. slk_helpers job_status
can be used to check the processing stated of a verify job. You can get the report if the state is COMPLETED
or SUCCESSFUL
. Otherwise, the report might not exist (e.g. if in state QUEUED
) or might be incomplete (e.g. if in state PROCESSING
).
$ slk_helpers job_status 156548
COMPLETED
$ slk_helpers job_status 157303
QUEUED (21)
Getting results of a verify job#
Please run this command to fetch the results of the verify job
$ slk_helpers result_verify_job 156548
Errors:
Resource content size does not match record: /dkrz_test/netcdf/20230925a/file_001gb_b.nc
Resource content size does not match record: /dkrz_test/netcdf/20230925a/file_001gb_c.nc
Resource content size does not match record: /dkrz_test/netcdf/20230925a/file_001gb_a.nc
Resource content size does not match record: /dkrz_test/netcdf/20230925a/file_001gb_f.nc
Erroneous files: 4
Four size-mismatch errors were detected. The this case, these files should be re-archived or deleted from the archive.
Getting raw results of a verify job#
When the verify job is completed, the report can be fetched via slk_helpers verify_report
. It can be written into a file …
$ slk_helpers job_report 156548 --outfile verify_report_job_156548.txt
… or printed to the terminal
$ slk_helpers job_report 156548
StrongLink_Version,UNKNOWN
job_id,156548
job_name,slk_helpers Verify Job by user k204221
policy_type,VERIFY
job_status,COMPLETED
source_namespace,/dkrz_test/netcdf/20230925a
destination_pools,[37]
quick_check,true
update_deleted_files,false
email_job_log,false
job_start,Wed Oct 04 21:13:09 UTC 2023
job_end,Wed Oct 04 21:14:33 UTC 2023
scan_start,Wed Oct 04 20:56:18 UTC 2023
scan_end,Wed Oct 04 21:13:57 UTC 2023
io_start,Wed Oct 04 20:56:18 UTC 2023
io_end,Wed Oct 04 21:14:33 UTC 2023
scanned_directories,2
scanned_file_size,21104159996
scanned_files,13
scanned_skipped_no_copies,2
io_bytes,18400035132
io_file_not_found,1
io_files,5
io_verify_size_failed,5
create date,status,message,description
Wed Oct 04 21:14:33 UTC 2023,INFO,IO completed,The IO for job 156548 has completed
Wed Oct 04 21:13:57 UTC 2023,INFO,scanner completed,The scanner for job 156548 has completed
Wed Oct 04 21:13:11 UTC 2023,ERROR,File verification failed: Resource content size does not match record,Resource content size does not match record: /dkrz_test/netcdf/20230925a/file_001gb_b.nc
Wed Oct 04 21:13:11 UTC 2023,ERROR,File verification failed: Resource content size does not match record,Resource content size does not match record: /dkrz_test/netcdf/20230925a/file_001gb_a.nc
Wed Oct 04 21:13:11 UTC 2023,ERROR,File verification failed: Resource content size does not match record,Resource content size does not match record: /dkrz_test/netcdf/20230925a/file_001gb_c.nc
Wed Oct 04 21:13:11 UTC 2023,ERROR,File verification failed: Resource content size does not match record,Resource content size does not match record: /dkrz_test/netcdf/20230925a/file_001gb_f.nc
Wed Oct 04 21:13:11 UTC 2023,ERROR,File verification failed: File not found,File not found: /dkrz_test/netcdf/20230925a/file_001gb_g.nc
Wed Oct 04 21:13:11 UTC 2023,ERROR,File verification failed: Resource content size does not match record,Resource content size does not match record: /dkrz_test/netcdf/20230925a/file_001gb_d.nc
Wed Oct 04 21:13:10 UTC 2023,INFO,Skipped Resource without RCR,Resource ID: 74481521010 / path: /dkrz_test/netcdf/20230925a/file_001gb_h.nc
Wed Oct 04 21:13:10 UTC 2023,INFO,Skipped Resource without RCR,Resource ID: 74481678010 / path: /dkrz_test/netcdf/20230925a/empty_file.txt
Wed Oct 04 21:13:10 UTC 2023,INFO,GNS Scan Started,GNS Scan for job 156548 is starting
Wed Oct 04 21:13:10 UTC 2023,INFO,GNS Scan Started,GNS Scan for job 156548 is starting
Wed Oct 04 21:13:10 UTC 2023,INFO,GNS Scan Started,GNS Scan for job 156548 is starting
Wed Oct 04 21:13:10 UTC 2023,INFO,GNS Scan Started,GNS Scan for job 156548 is starting
Wed Oct 04 21:13:10 UTC 2023,INFO,GNS Scan Started,GNS Scan for job 156548 is starting
Wed Oct 04 21:13:10 UTC 2023,INFO,GNS Scan Started,GNS Scan for job 156548 is starting
Wed Oct 04 20:56:18 UTC 2023,INFO,New job starting,Job 156548 is starting
Evaluating a verification report#
An example verification report is printed in the previous section.
The lines below create date,status,message,description
are of interest. This is csv and you can load it into the spreadsheet software of your choice or as Python
Pandas.DataFrame
. Below we explain what certian status,message
combinations mean and what should be done:
ERROR,File verification failed: Resource not stored in pool(s): [37]
: please ignore this error; it indicates that the file has already been written to tape and deleted from the HSM cacheERROR,File verification failed: Resource content size does not match record
: please archive the file againERROR,File verification failed: File not found
: directly contact support@dkrz.deINFO,Skipped Resource without RCR
: printed in different situations; please check which of the following cases appliesnothing to do if the target file has 0 byte size and this intended
please archive the file again if the target file has 0 byte size but should be larger
directly contact support@dkrz.de if a size greate 0 byte is printed
INFO,Skipped Soft-Deleted Resource
: please ignore this information; the target file has been marked for deletion (deleted from user perspective) but has not been cleaned up yet