E-OBS¶

Gridded meteorological observations over Europe from E-OBS

Basic download & raw data¶

For downloading e-obs we don't make use of an existing library. Instead, we simply download the data files directly from the source.

In [2]:

Copied!





from springtime.datasets import EOBS
from springtime.utils import germany

ds_eobs = EOBS(
    years=["2000", "2002"],  # pyright: ignore (https://t.ly/gukmj)
    variables=[
        "mean_temperature",
        "minimum_temperature",
    ],
    area=germany,
)
print(ds_eobs)
ds_eobs.download()
from springtime.datasets import EOBS
from springtime.utils import germany

ds_eobs = EOBS(
    years=["2000", "2002"],  # pyright: ignore (https://t.ly/gukmj)
    variables=[
        "mean_temperature",
        "minimum_temperature",
    ],
    area=germany,
)
print(ds_eobs)
ds_eobs.download()

EOBS(
    dataset='E-OBS',
    years=YearRange(start=2000, end=2002),
    product_type='ensemble_mean',
    variables=['mean_temperature', 'minimum_temperature'],
    grid_resolution='0.1deg',
    version='26.0e',
    points=None,
    keep_grid_location=False,
    area=NamedArea(
        name='Germany',
        bbox=BoundingBox(xmin=5.98865807458, ymin=47.3024876979, xmax=15.0169958839, ymax=54.983104153)
    ),
    minimize_cache=False,
    resample=None
)

INFO:springtime.datasets.eobs:Locating data
INFO:springtime.datasets.eobs:Looking for variable mean_temperature in period 2000-2002...
INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc
INFO:springtime.datasets.eobs:Looking for variable minimum_temperature in period 2000-2002...
INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tn_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc

Out[2]:

[PosixPath('/home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc'),
 PosixPath('/home/peter/.cache/springtime/e-obs/tn_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc')]

The data comes in netCDF format, so we represent the raw data as an xarray object.

In [3]:

Copied!

ds = ds_eobs.raw_load()
ds
ds = ds_eobs.raw_load()
ds

INFO:springtime.datasets.eobs:Locating data
INFO:springtime.datasets.eobs:Looking for variable mean_temperature in period 2000-2002...
INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc
INFO:springtime.datasets.eobs:Looking for variable minimum_temperature in period 2000-2002...
INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tn_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc

Out[3]:

<xarray.Dataset>
Dimensions:              (latitude: 465, longitude: 705, time: 5844)
Coordinates:
  * latitude             (latitude) float64 25.05 25.15 25.25 ... 71.35 71.45
  * longitude            (longitude) float64 -24.95 -24.85 ... 45.35 45.45
  * time                 (time) datetime64[ns] 1995-01-01 ... 2010-12-31
Data variables:
    mean_temperature     (time, latitude, longitude) float32 dask.array<chunksize=(1525, 120, 183), meta=np.ndarray>
    minimum_temperature  (time, latitude, longitude) float32 dask.array<chunksize=(1525, 120, 183), meta=np.ndarray>
Attributes:
    E-OBS_version:  26.0e
    Conventions:    CF-1.4
    References:     http://surfobs.climate.copernicus.eu/dataaccess/access_eo...
    history:        Thu Sep 29 13:17:18 2022: ncks --no-abc -d time,16436,222...
    NCO:            netCDF Operators version 4.7.5 (Homepage = http://nco.sf....

Minimizing cache¶

As you can see, the raw EOBS data span a larger domain and longer time period than we specified. The servers don't offer more fine-grained downloads. Thus, the first thing that the load function will do is extract the years and area specified in the dataset definition.

Normally, all the raw data will be stored in your springtime cache directory. This makes it easy to load other years or areas without re-downloading. However, if you want to save on disk space, you can set minimize_cache to true.

Additional options for `load`¶

Clearly, we need to do some more tweaking to reformat and extract the relevant data, in order to match our standardized data format.

Firstly, notice that eobs has a time dimension that spans more than one record per year, whereas phenological datasets typically have only one unique row for each year/location. Thus, we need to reshape and/or aggregate the data.

Secondly, we need to extract only those points that are of interest. Typically, we will first download observations (e.g. pep725) and then the corresponding grid points from E-OBS.

Dealing with time¶

We start with the time dimension. While it is not impossible to work with daily data, for this example we are first going to resample it to monthly sums instead. Then, we'll split the time dimension in two: year and day of year.

In [4]:

Copied!





# TODO: move to easier path?
from springtime.datasets.eobs import split_time

ds = ds.resample(time="M").mean()  # [1]
ds = split_time(ds)
ds

# [1] see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for a full list
# TODO: move to easier path?
from springtime.datasets.eobs import split_time

ds = ds.resample(time="M").mean()  # [1]
ds = split_time(ds)
ds

# [1] see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for a full list

/home/peter/mambaforge/envs/springtime/lib/python3.10/site-packages/xarray/core/accessor_dt.py:72: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self.
  values_as_series = pd.Series(values.ravel(), copy=False)

Out[4]:

<xarray.Dataset>
Dimensions:              (year: 16, timeinyear: 23, latitude: 465,
                          longitude: 705)
Coordinates:
  * year                 (year) int64 1995 1996 1997 1998 ... 2008 2009 2010
  * timeinyear           (timeinyear) int64 31 59 60 90 91 ... 334 335 365 366
  * latitude             (latitude) float64 25.05 25.15 25.25 ... 71.35 71.45
  * longitude            (longitude) float64 -24.95 -24.85 ... 45.35 45.45
Data variables:
    mean_temperature     (latitude, longitude, year, timeinyear) float32 dask.array<chunksize=(120, 183, 1, 23), meta=np.ndarray>
    minimum_temperature  (latitude, longitude, year, timeinyear) float32 dask.array<chunksize=(120, 183, 1, 23), meta=np.ndarray>
Attributes:
    E-OBS_version:  26.0e
    Conventions:    CF-1.4
    References:     http://surfobs.climate.copernicus.eu/dataaccess/access_eo...
    history:        Thu Sep 29 13:17:18 2022: ncks --no-abc -d time,16436,222...
    NCO:            netCDF Operators version 4.7.5 (Homepage = http://nco.sf....

Extracing points / alignment with observations¶

Next, we noted that e-obs is a gridded dataset, but we want to retrieve only those points for which we have observations, so let's extract those. Two utility functions are available for this: extract points, or extract records. The difference is that extract records also takes the year index into account.

Let's illustrate this starting with a few points:

In [5]:

Copied!

import geopandas as gpd
from springtime.utils import extract_points

points_pep725 = gpd.GeoSeries(gpd.points_from_xy(x=[0, 5, 7], y=[5, 10, 12]))
extract_points(ds, points_pep725)
import geopandas as gpd
from springtime.utils import extract_points

points_pep725 = gpd.GeoSeries(gpd.points_from_xy(x=[0, 5, 7], y=[5, 10, 12]))
extract_points(ds, points_pep725)

Out[5]:

<xarray.Dataset>
Dimensions:              (year: 16, timeinyear: 23, geometry: 3)
Coordinates:
  * year                 (year) int64 1995 1996 1997 1998 ... 2008 2009 2010
  * timeinyear           (timeinyear) int64 31 59 60 90 91 ... 334 335 365 366
  * geometry             (geometry) object POINT (0 5) POINT (5 10) POINT (7 12)
Data variables:
    mean_temperature     (geometry, year, timeinyear) float32 dask.array<chunksize=(3, 1, 23), meta=np.ndarray>
    minimum_temperature  (geometry, year, timeinyear) float32 dask.array<chunksize=(3, 1, 23), meta=np.ndarray>
Attributes:
    E-OBS_version:  26.0e
    Conventions:    CF-1.4
    References:     http://surfobs.climate.copernicus.eu/dataaccess/access_eo...
    history:        Thu Sep 29 13:17:18 2022: ncks --no-abc -d time,16436,222...
    NCO:            netCDF Operators version 4.7.5 (Homepage = http://nco.sf....

We've received 3 points, as expected. Notice that we've made a little effort to pass our points as a geopandas array. This makes it very easy to reuse points from other datasets. For example:

In [6]:

Copied!





from springtime.datasets import PEP725Phenor

df_pep725 = PEP725Phenor(
    species="Syringa vulgaris",
    years=[2000, 2002],
    area=germany,
).load()

# Use points from pep725
extract_points(ds, df_pep725.geometry)
from springtime.datasets import PEP725Phenor

df_pep725 = PEP725Phenor(
    species="Syringa vulgaris",
    years=[2000, 2002],
    area=germany,
).load()

# Use points from pep725
extract_points(ds, df_pep725.geometry)

Out[6]:

<xarray.Dataset>
Dimensions:              (year: 16, timeinyear: 23, geometry: 1780)
Coordinates:
  * year                 (year) int64 1995 1996 1997 1998 ... 2008 2009 2010
  * timeinyear           (timeinyear) int64 31 59 60 90 91 ... 334 335 365 366
  * geometry             (geometry) object POINT (13.2333 47.7833) ... POINT ...
Data variables:
    mean_temperature     (geometry, year, timeinyear) float32 dask.array<chunksize=(1780, 1, 23), meta=np.ndarray>
    minimum_temperature  (geometry, year, timeinyear) float32 dask.array<chunksize=(1780, 1, 23), meta=np.ndarray>
Attributes:
    E-OBS_version:  26.0e
    Conventions:    CF-1.4
    References:     http://surfobs.climate.copernicus.eu/dataaccess/access_eo...
    history:        Thu Sep 29 13:17:18 2022: ncks --no-abc -d time,16436,222...
    NCO:            netCDF Operators version 4.7.5 (Homepage = http://nco.sf....

That's very convenient! However, we ended up with 1780 unique locations * 16 years = 28480 records, much more than the 4723 observation dataframe! That's because the observations are not taken at the same location each year. Instead, we want to make sure we have collocated pep725 and eobs data. To this end, we can use the extract_records method:

In [7]:

Copied!

from springtime.utils import extract_records

ds = extract_records(ds, df_pep725)
ds
from springtime.utils import extract_records

ds = extract_records(ds, df_pep725)
ds

Out[7]:

<xarray.Dataset>
Dimensions:              (index: 4723, timeinyear: 23)
Coordinates:
    year                 (index) int64 2001 2000 2002 2002 ... 2001 2002 2000
  * timeinyear           (timeinyear) int64 31 59 60 90 91 ... 334 335 365 366
  * index                (index) int64 0 1 2 3 4 5 ... 4718 4719 4720 4721 4722
    geometry             (index) object POINT (13.2333 47.7833) ... POINT (11...
Data variables:
    mean_temperature     (index, timeinyear) float32 dask.array<chunksize=(4723, 23), meta=np.ndarray>
    minimum_temperature  (index, timeinyear) float32 dask.array<chunksize=(4723, 23), meta=np.ndarray>
Attributes:
    E-OBS_version:  26.0e
    Conventions:    CF-1.4
    References:     http://surfobs.climate.copernicus.eu/dataaccess/access_eo...
    history:        Thu Sep 29 13:17:18 2022: ncks --no-abc -d time,16436,222...
    NCO:            netCDF Operators version 4.7.5 (Homepage = http://nco.sf....

In this process, we choose the eobs grid cell that closest to the observations, recognizing that it might not be the exact same point. However, in order to join the datasets later on, the final dataframe retains the input coordinates.

At this stage, most of the heavy lifting is done, and the size of the total dataset is substantially reduced. Now, we can convert our data to a dataframe.

In [8]:

Copied!

df_eobs = ds.to_dataframe()
df_eobs
df_eobs = ds.to_dataframe()
df_eobs

Out[8]:

		year	mean_temperature	minimum_temperature	geometry
index	timeinyear
0	31	2001	-0.027742	-3.352580	POINT (13.2333 47.7833)
	59	2001	1.162500	-1.963571	POINT (13.2333 47.7833)
	60	2001	NaN	NaN	POINT (13.2333 47.7833)
	90	2001	5.345483	1.442258	POINT (13.2333 47.7833)
	91	2001	NaN	NaN	POINT (13.2333 47.7833)
...	...	...	...	...	...
4722	305	2000	10.458709	7.613226	POINT (11.9 50.65)
	334	2000	NaN	NaN	POINT (11.9 50.65)
	335	2000	5.600334	2.783000	POINT (11.9 50.65)
	365	2000	NaN	NaN	POINT (11.9 50.65)
	366	2000	2.445484	0.127742	POINT (11.9 50.65)

108629 rows × 4 columns

Notice that the DOY is still an index column. Since we want only one record per location/year, we can stack the DOY column and combine it with the variable name. Effectively, it means we treat the cumulative temperature for each month as a separate predictor.

The EOBS loader has this build in under the hood, such that we can do:

In [9]:

Copied!

df_eobs = ds_eobs._to_dataframe(ds)
df_eobs
df_eobs = ds_eobs._to_dataframe(ds)
df_eobs

Out[9]:

	year	geometry	mean_temperature\|31	mean_temperature\|59	mean_temperature\|60	mean_temperature\|90	mean_temperature\|91	mean_temperature\|120	mean_temperature\|121	mean_temperature\|151	...	minimum_temperature\|243	minimum_temperature\|244	minimum_temperature\|273	minimum_temperature\|274	minimum_temperature\|304	minimum_temperature\|305	minimum_temperature\|334	minimum_temperature\|335	minimum_temperature\|365	minimum_temperature\|366
0	2001	POINT (13.23330 47.78330)	-0.027742	1.162500	NaN	5.345483	NaN	5.289666	NaN	14.228063	...	13.442905	NaN	6.644666	NaN	9.290646	NaN	-1.518667	NaN	-6.672903	NaN
1	2000	POINT (13.23330 47.78330)	-2.406774	NaN	1.763793	NaN	2.866452	NaN	9.612332	NaN	...	NaN	13.460967	NaN	9.547999	NaN	6.900968	NaN	1.890333	NaN	-0.024516
2	2002	POINT (13.23330 47.78330)	-0.748709	3.706428	NaN	4.863225	NaN	6.443333	NaN	13.560322	...	12.848707	NaN	7.639001	NaN	4.716774	NaN	2.709000	NaN	-1.445806	NaN
3	2002	POINT (14.88330 48.68330)	-2.181936	3.039643	NaN	3.765162	NaN	6.581666	NaN	14.010002	...	12.012580	NaN	6.130666	NaN	2.722903	NaN	0.613667	NaN	-4.385161	NaN
4	2000	POINT (14.88330 48.68330)	-4.247742	NaN	1.692069	NaN	2.593548	NaN	9.577000	NaN	...	NaN	10.741290	NaN	7.121333	NaN	5.788710	NaN	-0.049667	NaN	-3.313871
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4718	2002	POINT (11.98330 50.70000)	0.469032	4.536786	NaN	4.693226	NaN	6.893667	NaN	13.538388	...	14.494839	NaN	8.478333	NaN	4.596774	NaN	2.489667	NaN	-3.375161	NaN
4719	2000	POINT (11.98330 50.70000)	0.211290	NaN	3.563103	NaN	4.662581	NaN	10.020000	NaN	...	NaN	12.347098	NaN	10.091666	NaN	7.811936	NaN	2.919333	NaN	0.433226
4720	2001	POINT (11.98330 50.70000)	0.044839	2.016071	NaN	3.662258	NaN	6.862000	NaN	13.639998	...	13.560323	NaN	8.660666	NaN	8.577096	NaN	0.671333	NaN	-3.940323	NaN
4721	2002	POINT (11.90000 50.65000)	0.109032	4.131786	NaN	4.365160	NaN	6.540333	NaN	13.248710	...	14.109676	NaN	8.086999	NaN	4.349354	NaN	2.244334	NaN	-3.562258	NaN
4722	2000	POINT (11.90000 50.65000)	-0.186129	NaN	3.163793	NaN	4.261935	NaN	9.729666	NaN	...	NaN	12.146775	NaN	9.836000	NaN	7.613226	NaN	2.783000	NaN	0.127742

4723 rows × 48 columns

Summary¶

We started with a raw_load of the E-OBS data. After going through all the nitty-gritty details, we can appreciate all the work that happens under the hood when we call load directly:

In [10]:

Copied!





eobs = EOBS(
    area=germany,
    years=["2000", "2002"],
    variables=["mean_temperature", "minimum_temperature"],
    resample={"frequency": "M", "operator": "mean"},
    points=[(5, 10), (10, 12)],
)
eobs.load()
eobs = EOBS(
    area=germany,
    years=["2000", "2002"],
    variables=["mean_temperature", "minimum_temperature"],
    resample={"frequency": "M", "operator": "mean"},
    points=[(5, 10), (10, 12)],
)
eobs.load()

INFO:springtime.datasets.eobs:Locating data
INFO:springtime.datasets.eobs:Looking for variable mean_temperature in period 2000-2002...
INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc
INFO:springtime.datasets.eobs:Looking for variable minimum_temperature in period 2000-2002...
INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tn_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc
/home/peter/mambaforge/envs/springtime/lib/python3.10/site-packages/xarray/core/accessor_dt.py:72: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self.
  values_as_series = pd.Series(values.ravel(), copy=False)

Out[10]:

	year	geometry	mean_temperature\|1	mean_temperature\|2	mean_temperature\|3	mean_temperature\|4	mean_temperature\|5	mean_temperature\|6	mean_temperature\|7	mean_temperature\|8	...	minimum_temperature\|3	minimum_temperature\|4	minimum_temperature\|5	minimum_temperature\|6	minimum_temperature\|7	minimum_temperature\|8	minimum_temperature\|9	minimum_temperature\|10	minimum_temperature\|11	minimum_temperature\|12
0	2000	POINT (5.00000 10.00000)	2.223871	5.508276	7.324193	11.011667	15.914517	18.658998	17.424194	20.047421	...	3.017419	6.017999	10.400967	12.795667	12.645484	14.570322	11.682666	8.179032	4.663000	3.367419
1	2000	POINT (10.00000 12.00000)	-4.589999	-1.344483	-0.575161	4.090333	9.328388	12.574332	10.460644	14.224192	...	-3.975806	-0.612000	4.156128	7.323998	6.108064	9.272257	5.804999	3.015162	-2.132333	-2.679032
2	2001	POINT (5.00000 10.00000)	4.325484	4.837501	8.908065	8.607333	16.301291	16.412333	19.994516	20.144192	...	5.522581	4.552334	11.228709	11.025332	14.656127	14.453549	8.990667	9.785806	0.957333	-1.507742
3	2001	POINT (10.00000 12.00000)	-2.454194	-1.650714	2.299355	0.978000	10.014839	9.487666	13.453225	14.326451	...	-1.291936	-2.628334	4.744194	4.663000	8.377419	9.411936	2.642333	5.125806	-4.241666	-9.656453
4	2002	POINT (5.00000 10.00000)	2.929678	6.716786	8.146128	10.153666	12.974515	19.411001	18.343225	18.767416	...	3.210645	4.024333	8.091290	13.437666	12.823548	13.938711	9.912333	7.014516	5.327667	3.429355
5	2002	POINT (10.00000 12.00000)	-2.547742	0.474286	1.406774	3.338666	8.437097	13.611666	12.912580	12.243550	...	-3.157097	-0.982000	3.709032	8.148667	8.189032	8.149031	4.204667	1.624516	-1.128333	-3.478387

6 rows × 26 columns

We can also represent this dataset as a recipe for easy sharing and reproducability.

In [11]:

Copied!

print(eobs.to_recipe())
print(eobs.to_recipe())

dataset: E-OBS
years:
- 2000
- 2002
product_type: ensemble_mean
variables:
- mean_temperature
- minimum_temperature
grid_resolution: 0.1deg
version: 26.0e
points:
- - 5.0
  - 10.0
- - 10.0
  - 12.0
keep_grid_location: false
area:
  name: Germany
  bbox:
  - 5.98865807458
  - 47.3024876979
  - 15.0169958839
  - 54.983104153
minimize_cache: false
resample: {}