from springtime.datasets import EOBS
from springtime.utils import germany
ds_eobs = EOBS(
years=["2000", "2002"], # pyright: ignore (https://t.ly/gukmj)
variables=[
"mean_temperature",
"minimum_temperature",
],
area=germany,
)
print(ds_eobs)
ds_eobs.download()
EOBS( dataset='E-OBS', years=YearRange(start=2000, end=2002), product_type='ensemble_mean', variables=['mean_temperature', 'minimum_temperature'], grid_resolution='0.1deg', version='26.0e', points=None, keep_grid_location=False, area=NamedArea( name='Germany', bbox=BoundingBox(xmin=5.98865807458, ymin=47.3024876979, xmax=15.0169958839, ymax=54.983104153) ), minimize_cache=False, resample=None )
INFO:springtime.datasets.eobs:Locating data INFO:springtime.datasets.eobs:Looking for variable mean_temperature in period 2000-2002... INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc INFO:springtime.datasets.eobs:Looking for variable minimum_temperature in period 2000-2002... INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tn_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc
[PosixPath('/home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc'), PosixPath('/home/peter/.cache/springtime/e-obs/tn_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc')]
The data comes in netCDF format, so we represent the raw data as an xarray object.
ds = ds_eobs.raw_load()
ds
INFO:springtime.datasets.eobs:Locating data INFO:springtime.datasets.eobs:Looking for variable mean_temperature in period 2000-2002... INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc INFO:springtime.datasets.eobs:Looking for variable minimum_temperature in period 2000-2002... INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tn_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc
<xarray.Dataset> Dimensions: (latitude: 465, longitude: 705, time: 5844) Coordinates: * latitude (latitude) float64 25.05 25.15 25.25 ... 71.35 71.45 * longitude (longitude) float64 -24.95 -24.85 ... 45.35 45.45 * time (time) datetime64[ns] 1995-01-01 ... 2010-12-31 Data variables: mean_temperature (time, latitude, longitude) float32 dask.array<chunksize=(1525, 120, 183), meta=np.ndarray> minimum_temperature (time, latitude, longitude) float32 dask.array<chunksize=(1525, 120, 183), meta=np.ndarray> Attributes: E-OBS_version: 26.0e Conventions: CF-1.4 References: http://surfobs.climate.copernicus.eu/dataaccess/access_eo... history: Thu Sep 29 13:17:18 2022: ncks --no-abc -d time,16436,222... NCO: netCDF Operators version 4.7.5 (Homepage = http://nco.sf....
Minimizing cache¶
As you can see, the raw EOBS data span a larger domain and longer time period than we specified. The servers don't offer more fine-grained downloads. Thus, the first thing that the load
function will do is extract the years and area specified in the dataset definition.
Normally, all the raw data will be stored in your springtime cache directory. This makes it easy to load other years or areas without re-downloading. However, if you want to save on disk space, you can set minimize_cache
to true.
Additional options for load
¶
Clearly, we need to do some more tweaking to reformat and extract the relevant data, in order to match our standardized data format.
Firstly, notice that eobs has a time dimension that spans more than one record per year, whereas phenological datasets typically have only one unique row for each year/location. Thus, we need to reshape and/or aggregate the data.
Secondly, we need to extract only those points that are of interest. Typically, we will first download observations (e.g. pep725) and then the corresponding grid points from E-OBS.
Dealing with time¶
We start with the time dimension. While it is not impossible to work with daily data, for this example we are first going to resample it to monthly sums instead. Then, we'll split the time dimension in two: year and day of year.
# TODO: move to easier path?
from springtime.datasets.eobs import split_time
ds = ds.resample(time="M").mean() # [1]
ds = split_time(ds)
ds
# [1] see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for a full list
/home/peter/mambaforge/envs/springtime/lib/python3.10/site-packages/xarray/core/accessor_dt.py:72: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self. values_as_series = pd.Series(values.ravel(), copy=False)
<xarray.Dataset> Dimensions: (year: 16, timeinyear: 23, latitude: 465, longitude: 705) Coordinates: * year (year) int64 1995 1996 1997 1998 ... 2008 2009 2010 * timeinyear (timeinyear) int64 31 59 60 90 91 ... 334 335 365 366 * latitude (latitude) float64 25.05 25.15 25.25 ... 71.35 71.45 * longitude (longitude) float64 -24.95 -24.85 ... 45.35 45.45 Data variables: mean_temperature (latitude, longitude, year, timeinyear) float32 dask.array<chunksize=(120, 183, 1, 23), meta=np.ndarray> minimum_temperature (latitude, longitude, year, timeinyear) float32 dask.array<chunksize=(120, 183, 1, 23), meta=np.ndarray> Attributes: E-OBS_version: 26.0e Conventions: CF-1.4 References: http://surfobs.climate.copernicus.eu/dataaccess/access_eo... history: Thu Sep 29 13:17:18 2022: ncks --no-abc -d time,16436,222... NCO: netCDF Operators version 4.7.5 (Homepage = http://nco.sf....
Extracing points / alignment with observations¶
Next, we noted that e-obs is a gridded dataset, but we want to retrieve only those points for which we have observations, so let's extract those. Two utility functions are available for this: extract points, or extract records. The difference is that extract records also takes the year index into account.
Let's illustrate this starting with a few points:
import geopandas as gpd
from springtime.utils import extract_points
points_pep725 = gpd.GeoSeries(gpd.points_from_xy(x=[0, 5, 7], y=[5, 10, 12]))
extract_points(ds, points_pep725)
<xarray.Dataset> Dimensions: (year: 16, timeinyear: 23, geometry: 3) Coordinates: * year (year) int64 1995 1996 1997 1998 ... 2008 2009 2010 * timeinyear (timeinyear) int64 31 59 60 90 91 ... 334 335 365 366 * geometry (geometry) object POINT (0 5) POINT (5 10) POINT (7 12) Data variables: mean_temperature (geometry, year, timeinyear) float32 dask.array<chunksize=(3, 1, 23), meta=np.ndarray> minimum_temperature (geometry, year, timeinyear) float32 dask.array<chunksize=(3, 1, 23), meta=np.ndarray> Attributes: E-OBS_version: 26.0e Conventions: CF-1.4 References: http://surfobs.climate.copernicus.eu/dataaccess/access_eo... history: Thu Sep 29 13:17:18 2022: ncks --no-abc -d time,16436,222... NCO: netCDF Operators version 4.7.5 (Homepage = http://nco.sf....
We've received 3 points, as expected. Notice that we've made a little effort to pass our points as a geopandas array. This makes it very easy to reuse points from other datasets. For example:
from springtime.datasets import PEP725Phenor
df_pep725 = PEP725Phenor(
species="Syringa vulgaris",
years=[2000, 2002],
area=germany,
).load()
# Use points from pep725
extract_points(ds, df_pep725.geometry)
<xarray.Dataset> Dimensions: (year: 16, timeinyear: 23, geometry: 1780) Coordinates: * year (year) int64 1995 1996 1997 1998 ... 2008 2009 2010 * timeinyear (timeinyear) int64 31 59 60 90 91 ... 334 335 365 366 * geometry (geometry) object POINT (13.2333 47.7833) ... POINT ... Data variables: mean_temperature (geometry, year, timeinyear) float32 dask.array<chunksize=(1780, 1, 23), meta=np.ndarray> minimum_temperature (geometry, year, timeinyear) float32 dask.array<chunksize=(1780, 1, 23), meta=np.ndarray> Attributes: E-OBS_version: 26.0e Conventions: CF-1.4 References: http://surfobs.climate.copernicus.eu/dataaccess/access_eo... history: Thu Sep 29 13:17:18 2022: ncks --no-abc -d time,16436,222... NCO: netCDF Operators version 4.7.5 (Homepage = http://nco.sf....
That's very convenient! However, we ended up with 1780 unique locations * 16 years = 28480 records, much more than the 4723 observation dataframe! That's because the observations are not taken at the same location each year. Instead, we want to make sure we have collocated pep725 and eobs data. To this end, we can use the extract_records
method:
from springtime.utils import extract_records
ds = extract_records(ds, df_pep725)
ds
<xarray.Dataset> Dimensions: (index: 4723, timeinyear: 23) Coordinates: year (index) int64 2001 2000 2002 2002 ... 2001 2002 2000 * timeinyear (timeinyear) int64 31 59 60 90 91 ... 334 335 365 366 * index (index) int64 0 1 2 3 4 5 ... 4718 4719 4720 4721 4722 geometry (index) object POINT (13.2333 47.7833) ... POINT (11... Data variables: mean_temperature (index, timeinyear) float32 dask.array<chunksize=(4723, 23), meta=np.ndarray> minimum_temperature (index, timeinyear) float32 dask.array<chunksize=(4723, 23), meta=np.ndarray> Attributes: E-OBS_version: 26.0e Conventions: CF-1.4 References: http://surfobs.climate.copernicus.eu/dataaccess/access_eo... history: Thu Sep 29 13:17:18 2022: ncks --no-abc -d time,16436,222... NCO: netCDF Operators version 4.7.5 (Homepage = http://nco.sf....
In this process, we choose the eobs grid cell that closest to the observations, recognizing that it might not be the exact same point. However, in order to join the datasets later on, the final dataframe retains the input coordinates.
At this stage, most of the heavy lifting is done, and the size of the total dataset is substantially reduced. Now, we can convert our data to a dataframe.
df_eobs = ds.to_dataframe()
df_eobs
year | mean_temperature | minimum_temperature | geometry | ||
---|---|---|---|---|---|
index | timeinyear | ||||
0 | 31 | 2001 | -0.027742 | -3.352580 | POINT (13.2333 47.7833) |
59 | 2001 | 1.162500 | -1.963571 | POINT (13.2333 47.7833) | |
60 | 2001 | NaN | NaN | POINT (13.2333 47.7833) | |
90 | 2001 | 5.345483 | 1.442258 | POINT (13.2333 47.7833) | |
91 | 2001 | NaN | NaN | POINT (13.2333 47.7833) | |
... | ... | ... | ... | ... | ... |
4722 | 305 | 2000 | 10.458709 | 7.613226 | POINT (11.9 50.65) |
334 | 2000 | NaN | NaN | POINT (11.9 50.65) | |
335 | 2000 | 5.600334 | 2.783000 | POINT (11.9 50.65) | |
365 | 2000 | NaN | NaN | POINT (11.9 50.65) | |
366 | 2000 | 2.445484 | 0.127742 | POINT (11.9 50.65) |
108629 rows × 4 columns
Notice that the DOY is still an index column. Since we want only one record per location/year, we can stack the DOY column and combine it with the variable name. Effectively, it means we treat the cumulative temperature for each month as a separate predictor.
The EOBS loader has this build in under the hood, such that we can do:
df_eobs = ds_eobs._to_dataframe(ds)
df_eobs
year | geometry | mean_temperature|31 | mean_temperature|59 | mean_temperature|60 | mean_temperature|90 | mean_temperature|91 | mean_temperature|120 | mean_temperature|121 | mean_temperature|151 | ... | minimum_temperature|243 | minimum_temperature|244 | minimum_temperature|273 | minimum_temperature|274 | minimum_temperature|304 | minimum_temperature|305 | minimum_temperature|334 | minimum_temperature|335 | minimum_temperature|365 | minimum_temperature|366 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2001 | POINT (13.23330 47.78330) | -0.027742 | 1.162500 | NaN | 5.345483 | NaN | 5.289666 | NaN | 14.228063 | ... | 13.442905 | NaN | 6.644666 | NaN | 9.290646 | NaN | -1.518667 | NaN | -6.672903 | NaN |
1 | 2000 | POINT (13.23330 47.78330) | -2.406774 | NaN | 1.763793 | NaN | 2.866452 | NaN | 9.612332 | NaN | ... | NaN | 13.460967 | NaN | 9.547999 | NaN | 6.900968 | NaN | 1.890333 | NaN | -0.024516 |
2 | 2002 | POINT (13.23330 47.78330) | -0.748709 | 3.706428 | NaN | 4.863225 | NaN | 6.443333 | NaN | 13.560322 | ... | 12.848707 | NaN | 7.639001 | NaN | 4.716774 | NaN | 2.709000 | NaN | -1.445806 | NaN |
3 | 2002 | POINT (14.88330 48.68330) | -2.181936 | 3.039643 | NaN | 3.765162 | NaN | 6.581666 | NaN | 14.010002 | ... | 12.012580 | NaN | 6.130666 | NaN | 2.722903 | NaN | 0.613667 | NaN | -4.385161 | NaN |
4 | 2000 | POINT (14.88330 48.68330) | -4.247742 | NaN | 1.692069 | NaN | 2.593548 | NaN | 9.577000 | NaN | ... | NaN | 10.741290 | NaN | 7.121333 | NaN | 5.788710 | NaN | -0.049667 | NaN | -3.313871 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4718 | 2002 | POINT (11.98330 50.70000) | 0.469032 | 4.536786 | NaN | 4.693226 | NaN | 6.893667 | NaN | 13.538388 | ... | 14.494839 | NaN | 8.478333 | NaN | 4.596774 | NaN | 2.489667 | NaN | -3.375161 | NaN |
4719 | 2000 | POINT (11.98330 50.70000) | 0.211290 | NaN | 3.563103 | NaN | 4.662581 | NaN | 10.020000 | NaN | ... | NaN | 12.347098 | NaN | 10.091666 | NaN | 7.811936 | NaN | 2.919333 | NaN | 0.433226 |
4720 | 2001 | POINT (11.98330 50.70000) | 0.044839 | 2.016071 | NaN | 3.662258 | NaN | 6.862000 | NaN | 13.639998 | ... | 13.560323 | NaN | 8.660666 | NaN | 8.577096 | NaN | 0.671333 | NaN | -3.940323 | NaN |
4721 | 2002 | POINT (11.90000 50.65000) | 0.109032 | 4.131786 | NaN | 4.365160 | NaN | 6.540333 | NaN | 13.248710 | ... | 14.109676 | NaN | 8.086999 | NaN | 4.349354 | NaN | 2.244334 | NaN | -3.562258 | NaN |
4722 | 2000 | POINT (11.90000 50.65000) | -0.186129 | NaN | 3.163793 | NaN | 4.261935 | NaN | 9.729666 | NaN | ... | NaN | 12.146775 | NaN | 9.836000 | NaN | 7.613226 | NaN | 2.783000 | NaN | 0.127742 |
4723 rows × 48 columns
Summary¶
We started with a raw_load
of the E-OBS data. After going through all the nitty-gritty details, we can appreciate all the work that happens under the hood when we call load directly:
eobs = EOBS(
area=germany,
years=["2000", "2002"],
variables=["mean_temperature", "minimum_temperature"],
resample={"frequency": "M", "operator": "mean"},
points=[(5, 10), (10, 12)],
)
eobs.load()
INFO:springtime.datasets.eobs:Locating data INFO:springtime.datasets.eobs:Looking for variable mean_temperature in period 2000-2002... INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc INFO:springtime.datasets.eobs:Looking for variable minimum_temperature in period 2000-2002... INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tn_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc /home/peter/mambaforge/envs/springtime/lib/python3.10/site-packages/xarray/core/accessor_dt.py:72: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self. values_as_series = pd.Series(values.ravel(), copy=False)
year | geometry | mean_temperature|1 | mean_temperature|2 | mean_temperature|3 | mean_temperature|4 | mean_temperature|5 | mean_temperature|6 | mean_temperature|7 | mean_temperature|8 | ... | minimum_temperature|3 | minimum_temperature|4 | minimum_temperature|5 | minimum_temperature|6 | minimum_temperature|7 | minimum_temperature|8 | minimum_temperature|9 | minimum_temperature|10 | minimum_temperature|11 | minimum_temperature|12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000 | POINT (5.00000 10.00000) | 2.223871 | 5.508276 | 7.324193 | 11.011667 | 15.914517 | 18.658998 | 17.424194 | 20.047421 | ... | 3.017419 | 6.017999 | 10.400967 | 12.795667 | 12.645484 | 14.570322 | 11.682666 | 8.179032 | 4.663000 | 3.367419 |
1 | 2000 | POINT (10.00000 12.00000) | -4.589999 | -1.344483 | -0.575161 | 4.090333 | 9.328388 | 12.574332 | 10.460644 | 14.224192 | ... | -3.975806 | -0.612000 | 4.156128 | 7.323998 | 6.108064 | 9.272257 | 5.804999 | 3.015162 | -2.132333 | -2.679032 |
2 | 2001 | POINT (5.00000 10.00000) | 4.325484 | 4.837501 | 8.908065 | 8.607333 | 16.301291 | 16.412333 | 19.994516 | 20.144192 | ... | 5.522581 | 4.552334 | 11.228709 | 11.025332 | 14.656127 | 14.453549 | 8.990667 | 9.785806 | 0.957333 | -1.507742 |
3 | 2001 | POINT (10.00000 12.00000) | -2.454194 | -1.650714 | 2.299355 | 0.978000 | 10.014839 | 9.487666 | 13.453225 | 14.326451 | ... | -1.291936 | -2.628334 | 4.744194 | 4.663000 | 8.377419 | 9.411936 | 2.642333 | 5.125806 | -4.241666 | -9.656453 |
4 | 2002 | POINT (5.00000 10.00000) | 2.929678 | 6.716786 | 8.146128 | 10.153666 | 12.974515 | 19.411001 | 18.343225 | 18.767416 | ... | 3.210645 | 4.024333 | 8.091290 | 13.437666 | 12.823548 | 13.938711 | 9.912333 | 7.014516 | 5.327667 | 3.429355 |
5 | 2002 | POINT (10.00000 12.00000) | -2.547742 | 0.474286 | 1.406774 | 3.338666 | 8.437097 | 13.611666 | 12.912580 | 12.243550 | ... | -3.157097 | -0.982000 | 3.709032 | 8.148667 | 8.189032 | 8.149031 | 4.204667 | 1.624516 | -1.128333 | -3.478387 |
6 rows × 26 columns
We can also represent this dataset as a recipe for easy sharing and reproducability.
print(eobs.to_recipe())
dataset: E-OBS years: - 2000 - 2002 product_type: ensemble_mean variables: - mean_temperature - minimum_temperature grid_resolution: 0.1deg version: 26.0e points: - - 5.0 - 10.0 - - 10.0 - 12.0 keep_grid_location: false area: name: Germany bbox: - 5.98865807458 - 47.3024876979 - 15.0169958839 - 54.983104153 minimize_cache: false resample: {}