CMIP6 storage

with pandas and hvplot

The primary publication of national Earth System Model data at DKRZ takes the largest part of the CMIP6 Data Pool. Most of the data have been produced within the national CMIP6 Project DICAD and in the compute project RZ988.

DKRZ supports modeling groups in all steps of the data wokflow from preparation to publication. In order to track and display the effort for this data workflow, we run automated scripts (crojobs) which capture the extent of the final product: the disk space usage of these groups in the data pool and update it daily. The resulting statistics are uploaded into a public and freely available swift storage.

In the following, we create responsive bar plots with pandas and hvplot for the statistics. - We decided to plot the data from the largest disk space usage to lowest. Therefore, we sort the input by size.

Timeseries

Starting in 2019, a replication infrastructure was deployed which synchronized CMIP6 data from other repositories at partner ESGF nodes. The package synda was used. For more information see our ingest service.

from IPython.display import IFrame

IFrame('https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/pool-timeseries-hvplot.html',
       800, 550)

The timeseries of the CMIP6 data pool shows that - The growth rate was about 2 PB per year for 2019 and 2020 - An average CMIP6 dataset contains about 5 files and covers 4GB

German contribution and publication

In March 2021, we group the CMIP6 data by the following keys and values:

As soon as CMIP6 data from other ESMs like ICON-ESM-LR or EMAC-2-53 is available, the lists will be expanded correspondingly. The file mistral-cmip6-allocation-by-rz988.csv.gz contains the results per source with an additional classification by experiment and only for experiments conducted in DKRZ compute project rz988. We sum up all experiments as follows:

import pandas as pd
sourcesum = pd.read_csv("https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/mistral-cmip6-allocation-by-source.csv.gz").sort_values("size", ascending=False)
#
expdf = pd.read_csv("https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/mistral-cmip6-allocation-by-rz988.csv.gz").sort_values("size", ascending=False)
sourcesumrz = expdf.groupby('source_id').sum().sort_values("size", ascending=False).reset_index()
sourcelistrz=sourcesumrz["size"].keys().to_list()
  • institution_ids:

    • MPI-M

    • AWI

    • DKRZ and DWD

      • These institutions have performed additional simulations with the MPI-ESM model for ScenarioMIP experiments.

allinstdf = pd.read_csv("https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/mistral-cmip6-allocation-by-dicad-institutes.csv.gz").sort_values("size", ascending=False)
  • publication type:

    • published originals: Data which has been published first at the esgf-node at dkrz and is still valid and available.

    • retracted originals: Data which has been published first at the esgf-node at dkrz but has also been retracted afterwards.

    • published replicas: Data which has been copied to and published at dkrz and is still valid and available.

    • retracted replicas: Data which has been copied to and published at dkrz but has also been retracted afterwards.

    • not yet published data.

allreplica = pd.read_csv("https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/mistral-cmip6-allocation-by-publicationType.csv.gz").sort_values("size", ascending=False)

At this point, we skip the creation of the plots and directly bind the resulting html to the notebook. If you are interested in creating the plots, scroll to the end. It takes some conversioning steps for the dataframe.

IFrame('https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/pool-statistics-hvplot.html', 800, 500)

The German contribution to CMIP6 by the four sources of MPI-M and AWI comprises - 1.1PB of data primary published at dkrz - more than 25% of the CMIP6 data pool - 1.4Mio files or 180 000 datasets

Plot creation

import intake
from pathlib import Path
import hvplot.pandas
from bokeh.models import NumeralTickFormatter
sourcesum["Group"]="By source_id"
sourcesum["Key"]="source_id"
sourcesum["Legend"]=sourcesum["source_id"]
allinstdf["Group"]="By institution_id"
allinstdf["Key"]="institution_id"
allinstdf["Legend"]=allinstdf["institution_id"]
sourcesumrz["Group"]="By Source (RZ988)"
sourcesumrz["Key"]="source_id"
sourcesumrz["Legend"]=sourcesumrz["source_id"]
allreplica["Group"]="By Publication Status"
allreplica["Key"]="publicationType"
allreplica["Legend"]=allreplica["publicationType"]
expdf["Group"]="By experiment_id"
expdf["Key"]="experiment_id"
expdf["Legend"]=expdf["experiment_id"]

#
sourcesum=sourcesum.set_index("Group")
allinstdf=allinstdf.set_index("Group")
sourcesumrz=sourcesumrz.set_index("Group")
allreplica=allreplica.set_index("Group")
expdf=expdf.set_index("Group")
#
plotdf=sourcesumrz.append(allinstdf).append(sourcesum).append(allreplica) #.append(expdf)
plot11 = plotdf.hvplot.bar(y="datasets",
                         ylabel="Sum of Datasets in the CMIP6 Data Pool",
                         xlabel="Group",
                         by="Legend",
                         stacked=True,
                         groupby="Key",
                         grid=True,
                         yformatter=NumeralTickFormatter(format='0,0'),
                         title="",
                           legend="top_left",
                           fontsize={'legend': "10%"},
                          width=300,
                          height=500,
                          muted_alpha=0)
plot12 = plotdf.hvplot.bar(y="filenumber",
                         ylabel="Sum of Files in the CMIP6 Data Pool",
                         xlabel="Group",
                         by="Legend",
                         stacked=True,
                         groupby="Key",
                         grid=True,
                         yformatter=NumeralTickFormatter(format='0,0'),
                         title="",
                           legend=False,
                          yaxis="right",
                          width=300,
                          height=500,
                          muted_alpha=0)
#
plot2 = plotdf.hvplot.bar(y="size",
                         ylabel="Allocation in the CMIP6 Data Pool [TB]",
                         xlabel="Group",
                         by="Legend",
                         stacked=True,
                         groupby="Key",
                          legend=False,
                          width=300,
                          height=500,
                         grid=True,
                         yformatter=NumeralTickFormatter(format='0,0'),
                         title="")
plot3=(plot11+plot12+plot2).cols(2).opts(show_title=False)
hvplot.save(plot3,"pool-statistics-hvplot.html")
timeseries=pd.read_csv("https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/mistral-cmip6-allocation-timeseries.csv.gz", index_col="Date", parse_dates=True)
tmplot= timeseries.loc["2019-02-18":"2020-12-31"].hvplot.line(y=["Disk Allocation [GB]", "Number of Datasets", "Number of Files"],
                               x="Date",
                               shared_axes=False,
                               grid=True,
                               yformatter=NumeralTickFormatter(format='0,0'),
                               width=600,
                               height=500,
                               legend="top_left",
                              ).opts(axiswise=True)
hvplot.save(tmplot,"pool-timeseries-hvplot.html")

Cloud upload

We use the swiftclient for the upload.

from swiftclient import client
from swiftenvbk0988 import *
#
with open("pool-statistics-hvplot.html", 'rb') as f:
    client.put_object(OS_STORAGE_URL, OS_AUTH_TOKEN, "Pool-Statistics", "pool-statistics-hvplot.html", f)
with open("pool-timeseries-hvplot.html", 'rb') as f:
    client.put_object(OS_STORAGE_URL, OS_AUTH_TOKEN, "Pool-Statistics", "pool-timeseries-hvplot.html", f)