Intake catalogs

In order to make our data pool supply more FAIR, we support the python package intake-esm which allows you to use collections of climate data easily and fast. We fully agree with the self-description of the developer’s:

“intake-esm is a data cataloging utility built on top of intake, pandas, and xarray, and it’s pretty awesome!”

We provide a tutorial here: https://gitlab.dkrz.de/mipdata/intake-esm/-/blob/master/intake-esm_tutorial.rst

The offical intake-esm page: https://intake-esm.readthedocs.io/

Features

  • display catalogs as clearly structured tables inside jupyter notebooks for easy investigation

import intake
col = intake.open_esm_datastore("/pool/data/Catalogs/mistral-cmip6.json")
col.df.head()
activity_id institution_id source_id experiment_id member_id table_id variable_id grid_label dcpp_init_year version time_range path opendap_url
0 AerChemMIP BCC BCC-ESM1 hist-piAer r1i1p1f1 AERmon c2h6 gn NaN v20200511 185001-201412 /mnt/lustre02/work/ik1017/CMIP6/data/CMIP6/Aer... http://esgf3.dkrz.de/thredds/dodsC/cmip6/AerCh...
1 AerChemMIP BCC BCC-ESM1 hist-piAer r1i1p1f1 AERmon c3h6 gn NaN v20200511 185001-201412 /mnt/lustre02/work/ik1017/CMIP6/data/CMIP6/Aer... http://esgf3.dkrz.de/thredds/dodsC/cmip6/AerCh...
2 AerChemMIP BCC BCC-ESM1 hist-piAer r1i1p1f1 AERmon c3h8 gn NaN v20200511 185001-201412 /mnt/lustre02/work/ik1017/CMIP6/data/CMIP6/Aer... http://esgf3.dkrz.de/thredds/dodsC/cmip6/AerCh...
3 AerChemMIP BCC BCC-ESM1 hist-piAer r1i1p1f1 AERmon cdnc gn NaN v20200522 185001-201412 /mnt/lustre02/work/ik1017/CMIP6/data/CMIP6/Aer... http://esgf3.dkrz.de/thredds/dodsC/cmip6/AerCh...
4 AerChemMIP BCC BCC-ESM1 hist-piAer r1i1p1f1 AERmon ch3coch3 gn NaN v20200511 185001-201412 /mnt/lustre02/work/ik1017/CMIP6/data/CMIP6/Aer... http://esgf3.dkrz.de/thredds/dodsC/cmip6/AerCh...

Features

  • browse through the catalog and select your data without being on the pool file system

⇨ A pythonic reproducable alternative compared to complex find commands or GUI searches. No need for Filesystems and filenames.

tas = col.search(experiment_id="historical", source_id="MPI-ESM1-2-HR", variable_id="tas", table_id="Amon", member_id="r1i1p1f1")
tas

mistral-cmip6 catalog with 1 dataset(s) from 33 asset(s):

unique
activity_id 1
institution_id 1
source_id 1
experiment_id 1
member_id 1
table_id 1
variable_id 1
grid_label 1
dcpp_init_year 0
version 1
time_range 33
path 33
opendap_url 33

Features

  • open climate data in an analysis ready dictionary of xarray datasets

Forget about annoying temporary merging and reformatting steps!

tas.to_dataset_dict()
--> The keys in the returned dictionary of datasets are constructed as follows:
    'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
100.00% [1/1 00:00<00:00]
{'CMIP.MPI-M.MPI-ESM1-2-HR.historical.Amon.gn': <xarray.Dataset>
 Dimensions:    (bnds: 2, lat: 192, lon: 384, member_id: 1, time: 1980)
 Coordinates:
   * time       (time) datetime64[ns] 1850-01-16T12:00:00 ... 2014-12-16T12:00:00
   * lat        (lat) float64 -89.28 -88.36 -87.42 -86.49 ... 87.42 88.36 89.28
   * lon        (lon) float64 0.0 0.9375 1.875 2.812 ... 356.2 357.2 358.1 359.1
     height     float64 ...
   * member_id  (member_id) <U8 'r1i1p1f1'
 Dimensions without coordinates: bnds
 Data variables:
     time_bnds  (time, bnds) datetime64[ns] dask.array<chunksize=(60, 2), meta=np.ndarray>
     lat_bnds   (time, lat, bnds) float64 dask.array<chunksize=(60, 192, 2), meta=np.ndarray>
     lon_bnds   (time, lon, bnds) float64 dask.array<chunksize=(60, 384, 2), meta=np.ndarray>
     tas        (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 60, 192, 384), meta=np.ndarray>
 Attributes:
     source:                  MPI-ESM1.2-HR (2017): naerosol: none, prescribe...
     activity_id:             CMIP
     license:                 CMIP6 model data produced by MPI-M is licensed u...
     initialization_index:    1
     sub_experiment_id:       none
     further_info_url:        https://furtherinfo.es-doc.org/CMIP6.MPI-M.MPI-E...
     history:                 2019-08-25T06:42:07Z ; CMOR rewrote data to be c...
     parent_mip_era:          CMIP6
     data_specs_version:      01.00.30
     parent_activity_id:      CMIP
     experiment:              all-forcing simulation of the recent past
     branch_time_in_parent:   0.0
     physics_index:           1
     external_variables:      areacella
     tracking_id:             hdl:21.14100/68498095-cf29-4fb6-981a-ac9e3541d2c...
     institution:             Max Planck Institute for Meteorology, Hamburg 20...
     realization_index:       1
     project_id:              CMIP6
     intake_esm_varname:      tas
     grid_label:              gn
     parent_variant_label:    r1i1p1f1
     realm:                   atmos
     forcing_index:           1
     mip_era:                 CMIP6
     parent_source_id:        MPI-ESM1-2-HR
     title:                   MPI-ESM1-2-HR output prepared for CMIP6
     institution_id:          MPI-M
     table_info:              Creation Date:(09 May 2019) MD5:e6ef8ececc8f3386...
     cmor_version:            3.5.0
     frequency:               mon
     source_type:             AOGCM
     product:                 model-output
     parent_time_units:       days since 1850-1-1 00:00:00
     table_id:                Amon
     nominal_resolution:      100 km
     variable_id:             tas
     parent_experiment_id:    piControl
     variant_label:           r1i1p1f1
     source_id:               MPI-ESM1-2-HR
     grid:                    gn
     branch_time_in_child:    0.0
     sub_experiment:          none
     branch_method:           standard
     Conventions:             CF-1.7 CMIP-6.2
     experiment_id:           historical
     creation_date:           2019-08-25T11:10:34Z
     contact:                 cmip6-mpi-esm@dkrz.de
     references:              MPI-ESM: Mauritsen, T. et al. (2019), Developmen...
     intake_esm_dataset_key:  CMIP.MPI-M.MPI-ESM1-2-HR.historical.Amon.gn}

Features

  • display catalogs as clearly structured tables inside jupyter notebooks for easy investigation

  • browse through the catalog and select your data without being on the pool file system

  • open climate data in an analysis ready dictionary of xarray datasets

intake-esm reduces the data access and data preparation tasks on analysists side

Catalog content

The catalog is a combination of

  • a list of files (at dkrz compressed as .csv.gz) where each line contains a filepath as an index and column values to describe that file

    • The columns of the catalog should be selected such that a dataset in the project’s data repository can be uniquely identified. I.e., all elements of the project’s Data Reference Syntax should be covered (See the project’s documentation for more information about the DRS) .

  • a .json formatted descriptor file for the list which contains additional settings which tell intake how to interprete the data.

According to our policy, both files have the same name and are available in the same directory.

print("What is this catalog about? \n" + col.esmcol_data["description"])
#
print("The path to the list of files: "+ col.esmcol_data["catalog_file"])
What is this catalog about?
This is an ESM collection for CMIP6 data accessible on the DKRZ's MISTRAL disk storage system in /work/ik1017/CMIP6/data/CMIP6
The path to the list of files: /mnt/lustre02/work/ik1017/Catalogs/mistral-cmip6.csv.gz

Creation of the ``.csv.gz`` list :

  1. A file list is created based on a find shell command on the project directory in the data pool.

  2. For the column values, filenames and Pathes are parsed according to the project’s path_template and filename_template. These templates need to be constructed with attribute values requested and required by the project.

    • Filenames that cannot be parsed are sorted out

  3. Depending on the project, additional columns can be created by adding project’s specifications.

    • E.g., for CMIP6, we added a OpenDAP column which allows users to access data from everywhere via http

Configuration of the ``.json`` descriptor:

Makes the catalog self-descriptive by defining all necessary information to understand the .csv.gz file

  • Specifications for the headers of the columns - in case of CMIP6, each column is linked to a Controlled Vocabulary.

col.esmcol_data["attributes"][0]
{'column_name': 'activity_id',
 'vocabulary': 'https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json'}

Defines how to open the data as analysis ready as possible with the underlaying xarray tool:

  • which column of the .csv.gz file contains the path or link to the files

  • what is the data format

  • how to aggregate files to a dataset

    • set a column to be used as a new dimension for the xarray by merge

    • when opened a file, what is concat dimension?

    • additional options for the open function

Jobs we do for you

  • We make all catalogs available under /pool/data/Catalogs/

  • We create and update the content of project’s catalogs regularly by running scripts which are automatically executed and called cronjobs. We set the creation frequency so that the data of the project is updated sufficently quickly.

    • The updated catalog replaces the outdated one.

    • The updated catalog is uploaded to the DKRZ swift cloud

    • We plan to provide a catalog that tracks data which is removed by the update.

import pandas as pd
#pd.options.display.max_colwidth = 100
services = pd.DataFrame.from_dict({"CMIP6" : {
    "Update Frequency" : "Daily",
    "On cloud" : "Yes", #"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-cmip6.json",
    "Path to catalog" : "/pool/data/Catalogs/mistral-cmip6.json",
    "OpenDAP" : "Yes",
    "Retraction Tracking" : "",
    "Minimum required Memory" : "10GB",
}, "CMIP5": {
    "Update Frequency" : "Monthly",
    "On cloud" : "Yes", #"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-cmip5.json",
    "Path to catalog" : "/pool/data/Catalogs/mistral-cmip5.json",
    "OpenDAP" : "",
    "Retraction Tracking" : "",
    "Minimum required Memory" : "5GB",
}, "CORDEX": {
    "Update Frequency" : "Monthly",
    "On cloud" : "Yes", #"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-cordex.json",
    "Path to catalog" : "/pool/data/Catalogs/mistral-cordex.json",
    "OpenDAP" : "",
    "Retraction Tracking" : "",
    "Minimum required Memory" : "5GB",
}, "ERA5": {
    "Update Frequency" : "On demand",
    "On cloud" : "",
    "Path to catalog" : "/pool/data/Catalogs/mistral-era5.json",
    "OpenDAP" : "--",
    "Retraction Tracking" : "--",
    "Minimum required Memory" : "5GB",
}, "MPI-GE": {
    "Update Frequency" : "On demand",
    "On cloud" : "Yes",# "https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-MPI-GE.json
    "Path to catalog" : "/pool/data/Catalogs/mistral-mpige.json",
    "OpenDAP" : "",
    "Retraction Tracking" : "--",
    "Minimum required Memory" : "No minimum",
}}, orient  = "index")
servicestb=services.style.set_properties(**{
    'font-size': '14pt',
})

servicestb
Update Frequency On cloud Path to catalog OpenDAP Retraction Tracking Minimum required Memory
CMIP6 Daily Yes /pool/data/Catalogs/mistral-cmip6.json Yes 10GB
CMIP5 Monthly Yes /pool/data/Catalogs/mistral-cmip5.json 5GB
CORDEX Monthly Yes /pool/data/Catalogs/mistral-cordex.json 5GB
ERA5 On demand /pool/data/Catalogs/mistral-era5.json -- -- 5GB
MPI-GE On demand Yes /pool/data/Catalogs/mistral-mpige.json -- No minimum

Best practises and recommendations:

  • Intake can make your scripts reusable.

    • Instead of working with local copy or editions of files, always start from a globally defined catalog which everyone can access.

    • Save the subset of the catalog which you work on as a new catalog instead of a subset of files. It can be hard to find out why data is not included anymore in recent catalog versions, especially if retraction tracking is not enabled.

  • Intake helps you to avoid downloading data by reducing necessary temporary steps which can cause temporary output.

  • Check for new ingests by just repeating your script - it will open the most recent catalog.

  • Only load datasets with to_dataset_dict into xarrray which do not exceed your memory limits

Technical requirements for usage

  • Memory:

    • Depending on the project’s volume, the catalogs can be big. If you need to work with the total catalog, you require at least 10GB memory.

    • On jupyterhub.dkrz.de, start the notebook server with matching ressources.

  • Software:

    • Intake works on the basis of xarray and pandas.

    • On jupyterhub.dkrz.de , use one of the recent kernels:

      • bleeding edge

Load the catalog

#import intake
#collection = intake.open_esm_datastore(services["Path to catalog"][0])