Combining data + CLI¶
Here we show how to combine multiple datasets, export them to recipes, and run these from the command line.
Combining datasets¶
In the previous section we've seen that one of the main goals of springtime is to harmonize datasets from different sources, such that we can easily combine them. Here, we walk through an example with data from PEP725 and EOBS to show how this is done.
We start with basic observations from PEP725.
from springtime.datasets import PEP725Phenor
from springtime.utils import germany
pep725 = PEP725Phenor(
species="Syringa vulgaris",
years=[2000, 2002],
area=germany,
)
df_pep725 = pep725.load()
Next, we want to find matching meteo data from E-OBS:
from springtime.datasets import EOBS
from springtime.utils import PointsFromOther
eobs = EOBS(
area=germany,
years=["2000", "2002"],
variables=["mean_temperature", "minimum_temperature"],
resample={"frequency": "M", "operator": "mean"},
points=PointsFromOther(source="pep725"),
)
Notice that we're using a special object called PointsFromOther
. This helper object can retrieve the records from our pep725 dataset, and use those to subselect the E-OBS data. To this end, we call the get_points
method with the pep725 dataframe as input. This seems convoluted, but as we will see later, it will help to write very succinct recipes.
eobs.points.get_points(df_pep725)
df_eobs = eobs.load()
INFO:springtime.datasets.eobs:Locating data INFO:springtime.datasets.eobs:Looking for variable mean_temperature in period 2000-2002... INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc INFO:springtime.datasets.eobs:Looking for variable minimum_temperature in period 2000-2002... INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tn_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc /home/peter/mambaforge/envs/springtime/lib/python3.10/site-packages/xarray/core/accessor_dt.py:72: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self. values_as_series = pd.Series(values.ravel(), copy=False)
Now, we're ready to join our dataframes.
from springtime.utils import join_dataframes
join_dataframes([df_pep725, df_eobs])
day | mean_temperature|1 | mean_temperature|2 | mean_temperature|3 | mean_temperature|4 | mean_temperature|5 | mean_temperature|6 | mean_temperature|7 | mean_temperature|8 | mean_temperature|9 | ... | minimum_temperature|3 | minimum_temperature|4 | minimum_temperature|5 | minimum_temperature|6 | minimum_temperature|7 | minimum_temperature|8 | minimum_temperature|9 | minimum_temperature|10 | minimum_temperature|11 | minimum_temperature|12 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | geometry | |||||||||||||||||||||
2000 | POINT (10.00000 49.48330) | 129 | 0.323548 | 3.664483 | 5.358709 | 9.966999 | 14.801293 | 17.880665 | 15.267097 | 18.310001 | 13.865000 | ... | 2.293871 | 4.694333 | 8.915162 | 10.588665 | 11.211936 | 12.718388 | 9.747333 | 7.286451 | 2.840333 | 0.369677 |
POINT (10.00000 50.85000) | 120 | 0.943226 | 3.795517 | 5.660645 | 10.001336 | 14.421612 | 16.608667 | 14.690001 | 17.148067 | 13.630667 | ... | 2.781935 | 4.193333 | 7.606451 | 9.388335 | 10.653547 | 11.400322 | 9.931665 | 6.439678 | 2.687333 | -0.179032 | |
POINT (10.00000 51.71670) | 116 | 1.694194 | 4.053448 | 5.399354 | 10.563000 | 14.321937 | 16.427999 | 14.901935 | 17.539680 | 14.346666 | ... | 2.525161 | 4.929334 | 8.507420 | 10.715000 | 11.435804 | 12.211291 | 10.457333 | 6.996452 | 3.725999 | 1.026129 | |
POINT (10.00000 52.10000) | 120 | 2.531935 | 4.937242 | 5.771289 | 10.993333 | 14.817741 | 16.890333 | 15.336773 | 17.556454 | 14.623999 | ... | 2.947742 | 5.701666 | 9.234193 | 11.745667 | 11.983226 | 12.660645 | 11.131998 | 8.224839 | 5.002000 | 2.268064 | |
POINT (10.00000 53.08330) | 121 | 2.119677 | 3.988276 | 4.812258 | 9.906999 | 14.663547 | 16.165335 | 15.199677 | 16.573227 | 13.515334 | ... | 2.233871 | 5.040667 | 8.331290 | 11.057999 | 11.471289 | 11.897419 | 10.317667 | 7.165806 | 3.831666 | 0.932581 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2002 | POINT (9.96667 50.15000) | 120 | 0.033871 | 5.121071 | 5.777419 | 8.711332 | 13.648706 | 17.796333 | 17.810965 | 18.760643 | 12.582668 | ... | 1.030000 | 3.688000 | 8.583226 | 11.736998 | 12.528385 | 14.005805 | 7.406667 | 4.896451 | 3.722333 | -1.399677 |
POINT (9.96667 50.95000) | 131 | 0.667097 | 5.050714 | 4.759677 | 7.534667 | 13.160967 | 16.805666 | 16.781612 | 18.105484 | 11.913998 | ... | 0.093871 | 2.317333 | 8.173548 | 11.191999 | 11.810968 | 13.421612 | 6.840666 | 4.556774 | 3.135667 | -2.176774 | |
POINT (9.96667 52.81670) | 131 | 2.531290 | 4.775357 | 4.818065 | 7.735334 | 14.055484 | 16.351336 | 17.142578 | 19.122257 | 13.324332 | ... | 0.981290 | 3.607333 | 9.721291 | 12.178999 | 13.313224 | 15.073547 | 9.269666 | 4.023226 | 2.149333 | -3.318065 | |
POINT (9.98333 49.76670) | 118 | 0.221613 | 5.803572 | 6.415483 | 9.338666 | 14.178065 | 18.913002 | 18.554195 | 19.333548 | 13.619668 | ... | 1.756452 | 4.493666 | 9.242903 | 13.312331 | 13.494192 | 14.762579 | 8.864333 | 5.864517 | 4.532332 | -0.294839 | |
POINT (9.98333 53.36670) | 127 | 3.214839 | 5.342857 | 5.029356 | 8.083667 | 13.778385 | 16.390997 | 17.183872 | 19.513546 | 14.155000 | ... | 1.511613 | 3.936667 | 9.845483 | 12.234335 | 13.443872 | 15.924195 | 10.457666 | 4.634194 | 2.433000 | -3.009355 |
4729 rows × 25 columns
From datasets to workflow¶
We've already had a sneak preview of yaml for indivual datasets. We can also combine the two datasets into what we call a "workflow".
from springtime.main import Workflow, Session
workflow = Workflow(datasets={"pep725": pep725, "eobs": eobs})
print(workflow)
Workflow( datasets={ 'pep725': PEP725Phenor( dataset='PEP725Phenor', years=YearRange(start=2000, end=2002), credential_file=PosixPath('/home/peter/.config/springtime/pep725_credentials.txt'), species='Syringa vulgaris', phenophase=None, include_cols=['year', 'geometry', 'day'], area=NamedArea( name='Germany', bbox=BoundingBox(xmin=5.98865807458, ymin=47.3024876979, xmax=15.0169958839, ymax=54.983104153) ) ), 'eobs': EOBS( dataset='E-OBS', years=YearRange(start=2000, end=2002), product_type='ensemble_mean', variables=['mean_temperature', 'minimum_temperature'], grid_resolution='0.1deg', version='26.0e', points=PointsFromOther(source='pep725'), keep_grid_location=False, area=NamedArea( name='Germany', bbox=BoundingBox(xmin=5.98865807458, ymin=47.3024876979, xmax=15.0169958839, ymax=54.983104153) ), minimize_cache=False, resample={} ) } )
To execute the workflow we first create a session and set the log level to info. This will provide a bit more info about the progress and the data will automatically be stored in a dedicated output folder.
session = Session()
workflow.execute(session)
INFO:springtime.main:Dataset pep725 loaded with 4723 rows INFO:springtime.datasets.eobs:Locating data INFO:springtime.datasets.eobs:Looking for variable mean_temperature in period 2000-2002... INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc INFO:springtime.datasets.eobs:Looking for variable minimum_temperature in period 2000-2002... INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tn_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc /home/peter/mambaforge/envs/springtime/lib/python3.10/site-packages/xarray/core/accessor_dt.py:72: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self. values_as_series = pd.Series(values.ravel(), copy=False) INFO:springtime.main:Dataset eobs loaded with 4723 rows INFO:springtime.main:Datasets joined to shape: (4729, 25) INFO:springtime.main:Data saved to: /tmp/output/data.csv
Workflows can also be represented in recipes.
recipe = workflow.to_recipe()
print(recipe)
datasets: pep725: dataset: PEP725Phenor years: - 2000 - 2002 credential_file: /home/peter/.config/springtime/pep725_credentials.txt species: Syringa vulgaris phenophase: null include_cols: - year - geometry - day area: name: Germany bbox: - 5.98865807458 - 47.3024876979 - 15.0169958839 - 54.983104153 eobs: dataset: E-OBS years: - 2000 - 2002 product_type: ensemble_mean variables: - mean_temperature - minimum_temperature grid_resolution: 0.1deg version: 26.0e points: source: pep725 keep_grid_location: false area: name: Germany bbox: - 5.98865807458 - 47.3024876979 - 15.0169958839 - 54.983104153 minimize_cache: false resample: {}
Springtime's command line interface¶
Springtime recipes can also be executed from the command line. If we saved the recipe above as recipe_pep_eobs.yaml
, we could execute it as follows:
springtime recipe_pep_eobs.yaml
The springtime command is available after (pip) installing springtime in your python environment.
Executing recipes from the command line makes it easy to automate tasks or submit them as long-running compute jobs.