Combining data + CLI¶

Here we show how to combine multiple datasets, export them to recipes, and run these from the command line.

Combining datasets¶

In the previous section we've seen that one of the main goals of springtime is to harmonize datasets from different sources, such that we can easily combine them. Here, we walk through an example with data from PEP725 and EOBS to show how this is done.

We start with basic observations from PEP725.

In [2]:

Copied!





from springtime.datasets import PEP725Phenor
from springtime.utils import germany

pep725 = PEP725Phenor(
    species="Syringa vulgaris",
    years=[2000, 2002],
    area=germany,
)

df_pep725 = pep725.load()
from springtime.datasets import PEP725Phenor
from springtime.utils import germany

pep725 = PEP725Phenor(
    species="Syringa vulgaris",
    years=[2000, 2002],
    area=germany,
)

df_pep725 = pep725.load()

Next, we want to find matching meteo data from E-OBS:

In [3]:

Copied!





from springtime.datasets import EOBS
from springtime.utils import PointsFromOther

eobs = EOBS(
    area=germany,
    years=["2000", "2002"],
    variables=["mean_temperature", "minimum_temperature"],
    resample={"frequency": "M", "operator": "mean"},
    points=PointsFromOther(source="pep725"),
)
from springtime.datasets import EOBS
from springtime.utils import PointsFromOther

eobs = EOBS(
    area=germany,
    years=["2000", "2002"],
    variables=["mean_temperature", "minimum_temperature"],
    resample={"frequency": "M", "operator": "mean"},
    points=PointsFromOther(source="pep725"),
)

Notice that we're using a special object called PointsFromOther. This helper object can retrieve the records from our pep725 dataset, and use those to subselect the E-OBS data. To this end, we call the get_points method with the pep725 dataframe as input. This seems convoluted, but as we will see later, it will help to write very succinct recipes.

In [4]:

Copied!

eobs.points.get_points(df_pep725)
df_eobs = eobs.load()
eobs.points.get_points(df_pep725)
df_eobs = eobs.load()

INFO:springtime.datasets.eobs:Locating data
INFO:springtime.datasets.eobs:Looking for variable mean_temperature in period 2000-2002...
INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc
INFO:springtime.datasets.eobs:Looking for variable minimum_temperature in period 2000-2002...
INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tn_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc
/home/peter/mambaforge/envs/springtime/lib/python3.10/site-packages/xarray/core/accessor_dt.py:72: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self.
  values_as_series = pd.Series(values.ravel(), copy=False)

Now, we're ready to join our dataframes.

In [5]:

Copied!

from springtime.utils import join_dataframes

join_dataframes([df_pep725, df_eobs])
from springtime.utils import join_dataframes

join_dataframes([df_pep725, df_eobs])

Out[5]:

		day	mean_temperature\|1	mean_temperature\|2	mean_temperature\|3	mean_temperature\|4	mean_temperature\|5	mean_temperature\|6	mean_temperature\|7	mean_temperature\|8	mean_temperature\|9	...	minimum_temperature\|3	minimum_temperature\|4	minimum_temperature\|5	minimum_temperature\|6	minimum_temperature\|7	minimum_temperature\|8	minimum_temperature\|9	minimum_temperature\|10	minimum_temperature\|11	minimum_temperature\|12
year	geometry
2000	POINT (10.00000 49.48330)	129	0.323548	3.664483	5.358709	9.966999	14.801293	17.880665	15.267097	18.310001	13.865000	...	2.293871	4.694333	8.915162	10.588665	11.211936	12.718388	9.747333	7.286451	2.840333	0.369677
	POINT (10.00000 50.85000)	120	0.943226	3.795517	5.660645	10.001336	14.421612	16.608667	14.690001	17.148067	13.630667	...	2.781935	4.193333	7.606451	9.388335	10.653547	11.400322	9.931665	6.439678	2.687333	-0.179032
	POINT (10.00000 51.71670)	116	1.694194	4.053448	5.399354	10.563000	14.321937	16.427999	14.901935	17.539680	14.346666	...	2.525161	4.929334	8.507420	10.715000	11.435804	12.211291	10.457333	6.996452	3.725999	1.026129
	POINT (10.00000 52.10000)	120	2.531935	4.937242	5.771289	10.993333	14.817741	16.890333	15.336773	17.556454	14.623999	...	2.947742	5.701666	9.234193	11.745667	11.983226	12.660645	11.131998	8.224839	5.002000	2.268064
	POINT (10.00000 53.08330)	121	2.119677	3.988276	4.812258	9.906999	14.663547	16.165335	15.199677	16.573227	13.515334	...	2.233871	5.040667	8.331290	11.057999	11.471289	11.897419	10.317667	7.165806	3.831666	0.932581
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2002	POINT (9.96667 50.15000)	120	0.033871	5.121071	5.777419	8.711332	13.648706	17.796333	17.810965	18.760643	12.582668	...	1.030000	3.688000	8.583226	11.736998	12.528385	14.005805	7.406667	4.896451	3.722333	-1.399677
	POINT (9.96667 50.95000)	131	0.667097	5.050714	4.759677	7.534667	13.160967	16.805666	16.781612	18.105484	11.913998	...	0.093871	2.317333	8.173548	11.191999	11.810968	13.421612	6.840666	4.556774	3.135667	-2.176774
	POINT (9.96667 52.81670)	131	2.531290	4.775357	4.818065	7.735334	14.055484	16.351336	17.142578	19.122257	13.324332	...	0.981290	3.607333	9.721291	12.178999	13.313224	15.073547	9.269666	4.023226	2.149333	-3.318065
	POINT (9.98333 49.76670)	118	0.221613	5.803572	6.415483	9.338666	14.178065	18.913002	18.554195	19.333548	13.619668	...	1.756452	4.493666	9.242903	13.312331	13.494192	14.762579	8.864333	5.864517	4.532332	-0.294839
	POINT (9.98333 53.36670)	127	3.214839	5.342857	5.029356	8.083667	13.778385	16.390997	17.183872	19.513546	14.155000	...	1.511613	3.936667	9.845483	12.234335	13.443872	15.924195	10.457666	4.634194	2.433000	-3.009355

4729 rows × 25 columns

From datasets to workflow¶

We've already had a sneak preview of yaml for indivual datasets. We can also combine the two datasets into what we call a "workflow".

In [6]:

Copied!

from springtime.main import Workflow, Session

workflow = Workflow(datasets={"pep725": pep725, "eobs": eobs})
print(workflow)
from springtime.main import Workflow, Session

workflow = Workflow(datasets={"pep725": pep725, "eobs": eobs})
print(workflow)

Workflow(
    datasets={
        'pep725': PEP725Phenor(
            dataset='PEP725Phenor',
            years=YearRange(start=2000, end=2002),
            credential_file=PosixPath('/home/peter/.config/springtime/pep725_credentials.txt'),
            species='Syringa vulgaris',
            phenophase=None,
            include_cols=['year', 'geometry', 'day'],
            area=NamedArea(
                name='Germany',
                bbox=BoundingBox(xmin=5.98865807458, ymin=47.3024876979, xmax=15.0169958839, ymax=54.983104153)
            )
        ),
        'eobs': EOBS(
            dataset='E-OBS',
            years=YearRange(start=2000, end=2002),
            product_type='ensemble_mean',
            variables=['mean_temperature', 'minimum_temperature'],
            grid_resolution='0.1deg',
            version='26.0e',
            points=PointsFromOther(source='pep725'),
            keep_grid_location=False,
            area=NamedArea(
                name='Germany',
                bbox=BoundingBox(xmin=5.98865807458, ymin=47.3024876979, xmax=15.0169958839, ymax=54.983104153)
            ),
            minimize_cache=False,
            resample={}
        )
    }
)

To execute the workflow we first create a session and set the log level to info. This will provide a bit more info about the progress and the data will automatically be stored in a dedicated output folder.

In [7]:

Copied!

session = Session()
workflow.execute(session)
session = Session()
workflow.execute(session)

INFO:springtime.main:Dataset pep725 loaded with 4723 rows
INFO:springtime.datasets.eobs:Locating data
INFO:springtime.datasets.eobs:Looking for variable mean_temperature in period 2000-2002...
INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc
INFO:springtime.datasets.eobs:Looking for variable minimum_temperature in period 2000-2002...
INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tn_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc
/home/peter/mambaforge/envs/springtime/lib/python3.10/site-packages/xarray/core/accessor_dt.py:72: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self.
  values_as_series = pd.Series(values.ravel(), copy=False)
INFO:springtime.main:Dataset eobs loaded with 4723 rows
INFO:springtime.main:Datasets joined to shape: (4729, 25)
INFO:springtime.main:Data saved to: /tmp/output/data.csv

Workflows can also be represented in recipes.

In [8]:

Copied!

recipe = workflow.to_recipe()
print(recipe)
recipe = workflow.to_recipe()
print(recipe)

datasets:
  pep725:
    dataset: PEP725Phenor
    years:
    - 2000
    - 2002
    credential_file: /home/peter/.config/springtime/pep725_credentials.txt
    species: Syringa vulgaris
    phenophase: null
    include_cols:
    - year
    - geometry
    - day
    area:
      name: Germany
      bbox:
      - 5.98865807458
      - 47.3024876979
      - 15.0169958839
      - 54.983104153
  eobs:
    dataset: E-OBS
    years:
    - 2000
    - 2002
    product_type: ensemble_mean
    variables:
    - mean_temperature
    - minimum_temperature
    grid_resolution: 0.1deg
    version: 26.0e
    points:
      source: pep725
    keep_grid_location: false
    area:
      name: Germany
      bbox:
      - 5.98865807458
      - 47.3024876979
      - 15.0169958839
      - 54.983104153
    minimize_cache: false
    resample: {}

Springtime's command line interface¶

Springtime recipes can also be executed from the command line. If we saved the recipe above as recipe_pep_eobs.yaml, we could execute it as follows:

springtime recipe_pep_eobs.yaml

The springtime command is available after (pip) installing springtime in your python environment.

Executing recipes from the command line makes it easy to automate tasks or submit them as long-running compute jobs.