slk pitfalls (read this!)

file version: 29 September 2021

current slk version: 3.3.6

Run only one slk archive/retrieve per node and user

slk is fast but memory hungry. Additionally, it runs many threads in parallel. Running many slk calls in parallel on one node as one user (a) uses up a lot of memory and (b) causes issues in the thread management. The number of parallel slk commands per node and user depends on the amount of data that are archived / retrieved by slk and on the available memory. Therefore, we suggest to run only one slk command per user and node.

Running slk archive in one terminal window on mistralpp1 and doing slk list in another terminal window on the same node will not kill the slk archive. But, doing ten slk list -R NAMESPACE_WITH_MANY_SUBNAMESPACES on the same node might cause issues.

Running commands without -R

Non-recursiveness is interpreted differently in StrongLink than defined in POSIX. If a namespace/directory (not a file) is given as input to the commands slk archive, slk retrieve and slk tag, all files in this namespace/directory are affected. In constrast, cp and rm would throw an error that -r is missing. When -R is set, all sub-namespaces are also affected.

slk writes no output in non-interactive mode

All slk commands except for slk list do not print output to the stdout and stderr streams (== command line output) when they are in non-interactive mode – i.e. running in SLURM jobs. Please catch the exit codes of your slk archive call and check whether they are equal 0. If not, an error occurred. Details on the error can be found in the slk log file ~/.slk/slk-cli.log. However, when you run many slk commands in parallel, the slk log becomes hard to read. Please print the time stamp (i.e. via date) when the error occurred to be able to find the details in the slk log later on. See the next code block on how to do this.

The exit code of the previous program call is stored in $?. Example:

$ slk archive /work/project/user/data /ex/am/ple/blub
...
$ echo $?
0
# or 1 or higher

In a bash/batch script it could look like this:

# ...
slk archive /work/project/user/data /ex/am/ple/blub
exit_code=$?

# print exit code with prefix so that it is easy to `grep`
echo "exit code: ${exit_code}"
if [ ${exit_code} -ne 0 ]; then
    #  print date
    date
fi

slk never writes to stderr

Error output of slk is written to the stdout stream instead of the stderr stream. If slk output in non-interactive mode was activated (it is not!) then you would find all error output in the SLURM stdout (not stderr) file when running jobs on mistral.

difference: slk move and slk rename

The Linux mv can move and rename files. The slk move can just move files/namespaces from one namespace to another namespace. Renaming can only be performed by slk rename. Both commands can only target one file/namespace at a time. Wildcards are not supported.

slk archive compares file size and timestamp prior to overwriting files

slk archive compares file size and timestamp to decide whether to overwrite a file or not. rsync does it the same way. There might be rare situations when an archived file should be overwritten by another file with the same name, size and timestamp: this would fail.

Availability of archived data and modified metadata might be delayed by a few seconds

StrongLink is a distributed system. Metadata is stored in a distributed metadata database. Some operations might take a few seconds until their results are visible because they have to be synchronized amongst different nodes.

Please wait a few seconds before you retrieve a file that was just archived.

A file listed by slk list is not necessarily available for retrieval yet

The location, name and size of a file are metadata. These metadata are written into the StrongLink metadata database when an archival process starts. slk list only prints metadata. Hence, if slk list lists a file, which is e.g. part of a file set currently uploaded in a batch job, this file is not necessarily fully uploaded yet. Similarly, aborted slk archive calls can produce a file’s metadata entry without correct data. Such a file can be retrieved without error. Please see failed or canceled slk archive and slk retrieve calls leave file fragments for details on file fragments.

failed or canceled slk archive and slk retrieve calls leave file fragments

issues during archival

A file fragment remains in StrongLink if slk archive did not terminate properly during an archival process. Metadata is available for this file fragment and it can be retrieved. It has no checksum. The latter is due to the fact that some metadata – like checksums – will be written after the archival process has finished successfully. The existence of checksums can be checked via slk_helpers checksum GNS_PATH. In the case of netCDF files, the header section might be copied properly. Thus, an ncdump -h might be successfully applied on a file fragment.

These fragements might occur when a user aborts slk archive (CTRL + C), a ssh connection breaks or a SLURM job is killed due to a timeout. More than one file might be affected because multiple files can be archived in parallel.

issues during retrieval

If slk retrieve does not terminate properly during a retrieval process, a file fragment might be created. These file fragments of temporary file names containing the original FILENAME: ~FILENAME14620203101828317173.slkretrieve. The reasons for improper termination of slk retrieve are the same as for slk archive. More than one file might be affected because multiple files can be retrieved in parallel.

Commonly, a file was correctly retrieved when it has its original filename and when the exit code of slk retrieve is 0 (echo $? directly after retrieval). To be 100% sure that the files was correctly retrieved, you can compared the checksum of the retrieved file with the checksum stored in StrongLink. If there is no checksum stored in StrongLink, the source file already is incomplete.

Pagination mode of slk list

When slk list is used in interactive mode without piping its output into another command, it will print its output in “pagination mode”. This means that only 25 results are printed “per page” and the user has to “turn the page” manually by pressing Return/Enter. Turning a page back is not possible. Even if there are less than 25 result, pagination mode is entered and the user has to type Return/Enter to leave the pagination mode. When a user regularly leaves the pagination mode, the terminal is cleared as CTRL + L does. This behaviour is by design and cannot be changed. If one wants to avoid the terminal to be cleared or does not want to browse through 30 pages, one should abort slk list with CTRL + C. We recommend to use slk list in combination with cat, less, more or similar tools in order to avoid the pagination mode. Below you will find an example.

Please note that the output of slk list NAMESPACE and slk list NAMESPACE | cat differs in the last line. This might be important when you create scripts around slk list.

slk list in pagination mode:

$ slk list /k204221_test
drwxrwxrwx- k204221     bm0146                 24 Jun 2021  20210624_test
drwxrwxrwx- k204221     bm0146                 25 Jun 2021  20210625_test
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  abc
drwxrwxrwx- k204221     bm0146                 24 Jun 2021  blubber
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  defg
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  memory_issue_testing
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  sbds_test_data
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  sbds_test_data_b
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  test
drwxrwxrwx- k204221     ka1209                 22 Jun 2021  test_20210617
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  test_20210622
drwxrwxrwx- k204221     ka1209                 22 Jun 2021  testing
Files 1-12 of 12

Avoid pagination mode of slk list:

$ slk list /k204221_test | cat
drwxrwxrwx- k204221     bm0146                 24 Jun 2021  20210624_test
drwxrwxrwx- k204221     bm0146                 25 Jun 2021  20210625_test
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  abc
drwxrwxrwx- k204221     bm0146                 24 Jun 2021  blubber
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  defg
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  memory_issue_testing
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  sbds_test_data
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  sbds_test_data_b
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  test
drwxrwxrwx- k204221     ka1209                 22 Jun 2021  test_20210617
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  test_20210622
drwxrwxrwx- k204221     ka1209                 22 Jun 2021  testing
Files: 12

slk tag cannot be applied on individual files

slk tag cannot be applied on individual files but only on namespaces. If it is applied on a namespace, all files in this namespace are assigned the metadata provided in the slk tag call. The namespace itself does not get any metadata assigned. If -R is set, also all files in sub-namespaces are assigned the metadata.

slk does not have a –version flag

Instead, it has a version command: slk version

Update interval of progress bars (slk archive, group, retrieve, tag)

Progress bars are updated per file or per block of n files. If you archive a folder with three files of 99 GB, 550 MB and 450 MB size, you will not see any updates of the progress bar 99% of the archival time while the large 99 GB file is archived and the progress bar will jump from 0% to 99%. If you tag a few files, the process bar will remain at 0% for a long time and suddenly jump to 100%.

Using slk list to print search results

slk list prints only the file names – independent on whether we print the content of a namespace or the result of a search. However, a search might find files in arbitrary namespaces. Thus, it would be helpful to print the path/namespace of each file when search results are listed. This is not the case. Currently, you cannot find out in which namespace(s) your search results are located in.

slk performance on different node types

We suggest running slk archive and slk retrieve on the mistralpp and compute/compute2 nodes. The run time on the mistralpp nodes considerably depends on the activity of other users on these nodes.

Please do not run slk archive and retrieve on the mistral login nodes (mlogin10X) when you archive large amounts of data because slk causes high CPU load and uses much memory.

The available memory per job on the shared nodes is very low. Therefore, slk archive and slk retrieve are slower than on other nodes. The run time can be expected to be two to four times as long as on the mistralpp and compute/compute2 nodes.

group memberships of user updated on login

If a user is added to a new group/project, this information is not automatically passed to StrongLink. Instead, the user has to run slk login again. Background: StrongLink caches LDAP data of each user and only updates its cache on a new login.

slk retrieve does not overwrite files but creates duplicates

When a file already exists, it retrieves a copy and inserts .DUPLICATE_FILENAME.[ID].[VERSION] between name and extension of the file. However, slk retrieve will overwrite these DUPLICATE files without warning. Consecutive retrievals will overwrite this file even if it is modified.

VERSION indicates the file version in StrongLink. If you modify a file and archive it a second time, the version will be incremented by one. Commonly, the version is not visible to you. Old file versions are not kept. Metadata of old versions is partly kept.

Do not archive such a DUPLICATE file because it might overwrite itself during retrieval.