PEP725¶

Retrieve observations from Pan European Phenological database using a pre-existing library called phenor.

Prerequisites¶

Phenor¶

Phenor is written in R, and therefore you need to have installed R with phenor.

devtools::install_github("bluegreen-labs/phenor@v1.3.1")

See the installation instructions for more details.

Credentials¶

To authenticate with the PEP725 data servers, you need to have an account and you need to store your credentials in a file called ~/.config/springtime/pep725_credentials.txt. Email adress on first line, password on second line. This path can be modified in the springtime configuration, but the default is quite okay.

Retrieving data¶

Springtime provides a semi-standardized interface for working with datasets. In this case, we will be using PEP725Phenor as the dataset class. You can see the full documentation of this class here.

Let's create a dataset with all PEP725 observations of the species "Syringa vulgaris"

In [2]:

Copied!

from springtime.datasets import PEP725Phenor

dataset = PEP725Phenor(species="Syringa vulgaris", phenophase=11)
print(dataset)
from springtime.datasets import PEP725Phenor

dataset = PEP725Phenor(species="Syringa vulgaris", phenophase=11)
print(dataset)

PEP725Phenor(
    dataset='PEP725Phenor',
    years=None,
    credential_file=PosixPath('/home/peter/.config/springtime/pep725_credentials.txt'),
    species='Syringa vulgaris',
    phenophase=11,
    include_cols=['year', 'geometry', 'day'],
    area=None
)

Notice that the credential_file has been configured automatically, and that there are some other fields that we can set. Before we dive into details about what those options mean, we will need to retrieve the data. We can do this with the download method.

In [3]:

Copied!

dataset.download()
dataset.download()

File already exists: /home/peter/.cache/springtime/PEP725/Syringa vulgaris.csv

If everything went well, the data should have been downloaded to some location like /home/username/.cache/springtime. Springtime will skip the download if the data is already present.

You can inspect the file on disk, but for transparancy springtime provides a raw_load method that loads the data more or less without modification.

In [4]:

Copied!

dataset.raw_load()
dataset.raw_load()

Out[4]:

	pep_id	bbch	year	day	country	species	national_id	lon	lat	alt	name
0	6446	60	1991	130	AT	Syringa vulgaris	5120	14.4167	48.2167	225	ASTEN
1	6446	60	1984	137	AT	Syringa vulgaris	5120	14.4167	48.2167	225	ASTEN
2	6446	60	1969	124	AT	Syringa vulgaris	5120	14.4167	48.2167	225	ASTEN
3	6446	60	1989	107	AT	Syringa vulgaris	5120	14.4167	48.2167	225	ASTEN
4	6446	60	1990	112	AT	Syringa vulgaris	5120	14.4167	48.2167	225	ASTEN
...	...	...	...	...	...	...	...	...	...	...	...
173752	19283	60	2004	125	UK	Syringa vulgaris	964298	-3.7330	58.5670	-1	964298
173753	19283	60	2002	130	UK	Syringa vulgaris	964298	-3.7330	58.5670	-1	964298
173754	19283	60	2003	113	UK	Syringa vulgaris	964298	-3.7330	58.5670	-1	964298
173755	19285	60	2002	125	UK	Syringa vulgaris	968311	-3.5170	58.6000	-1	968311
173756	19285	60	2003	113	UK	Syringa vulgaris	968311	-3.5170	58.6000	-1	968311

173757 rows × 11 columns

As you can see, there are various columns in the data, only a few of which are relevant for us. The "day" column contains the day of year of the event. The event, in this case, is given in the 'bbch' column, which contains phenophases according to the BBCH scale. For example, phenophase 60 means "beginning of flowering". To see all possible options, have a look at http://www.pep725.eu/pep725_phase.php.

The magic `load` method¶

While the raw data is interesting, it doesn't completely conform to our standard yet. The load method, as opposed to raw_load, does some additional work to parse the data into a format that we can easily combine with other datasets.

In [5]:

Copied!

dataset.load().reset_index(drop=True)
dataset.load().reset_index(drop=True)

Out[5]:

	year	geometry	day
0	1988	POINT (15.86660 44.80000)	85
1	1981	POINT (15.86660 44.80000)	83
2	1989	POINT (15.86660 44.80000)	80
3	1985	POINT (15.86660 44.80000)	94
4	2014	POINT (15.86660 44.80000)	77
...	...	...	...
1426	2000	POINT (18.25000 49.11670)	105
1427	2010	POINT (18.25000 49.11670)	106
1428	2004	POINT (18.25000 49.11670)	110
1429	2001	POINT (18.25000 49.11670)	116
1430	2009	POINT (18.25000 49.11670)	98

1431 rows × 3 columns

Notice that the year and geometry have been converted to index columns, we only retained the "day" column, as this will be the variable that we are trying to predict. The latitude and longitude have been combined into a "geometry" column in geopandas format.

We can influence the behaviour of the load method to select an area and years of interest, for example. To this end, we need to modify the dataset.

In [6]:

Copied!





germany = {
    "name": "Germany",
    "bbox": [
        5.98865807458,
        47.3024876979,
        15.0169958839,
        54.983104153,
    ],
}
dataset = PEP725Phenor(species="Syringa vulgaris", years=[2000, 2002], area=germany)
print(dataset)
df_pep725 = dataset.load()
df_pep725
germany = {
    "name": "Germany",
    "bbox": [
        5.98865807458,
        47.3024876979,
        15.0169958839,
        54.983104153,
    ],
}
dataset = PEP725Phenor(species="Syringa vulgaris", years=[2000, 2002], area=germany)
print(dataset)
df_pep725 = dataset.load()
df_pep725

PEP725Phenor(
    dataset='PEP725Phenor',
    years=YearRange(start=2000, end=2002),
    credential_file=PosixPath('/home/peter/.config/springtime/pep725_credentials.txt'),
    species='Syringa vulgaris',
    phenophase=None,
    include_cols=['year', 'geometry', 'day'],
    area=NamedArea(
        name='Germany',
        bbox=BoundingBox(xmin=5.98865807458, ymin=47.3024876979, xmax=15.0169958839, ymax=54.983104153)
    )
)

Out[6]:

	year	geometry	day
0	2001	POINT (13.23330 47.78330)	130
1	2000	POINT (13.23330 47.78330)	131
2	2002	POINT (13.23330 47.78330)	132
3	2002	POINT (14.88330 48.68330)	122
4	2000	POINT (14.88330 48.68330)	123
...	...	...	...
4718	2002	POINT (11.98330 50.70000)	130
4719	2000	POINT (11.98330 50.70000)	121
4720	2001	POINT (11.98330 50.70000)	133
4721	2002	POINT (11.90000 50.65000)	138
4722	2000	POINT (11.90000 50.65000)	120

4723 rows × 3 columns

Dataset as recipe¶

You may wonder why we pass these additional arguments to the dataset itself; why not pass them directly to the load function? Part of the reason is standardization: most datasets need to know about the area and time already for downloading anything. By making it part of the dataset definition, datasets from several sources become more alike.

Another advantage of this model is that it allows us to export springtime datasets as "recipes".

In [7]:

Copied!

recipe = dataset.to_recipe()
print(recipe)
recipe = dataset.to_recipe()
print(recipe)

dataset: PEP725Phenor
years:
- 2000
- 2002
species: Syringa vulgaris
include_cols:
- year
- geometry
- day
area:
  name: Germany
  bbox:
  - 5.98865807458
  - 47.3024876979
  - 15.0169958839
  - 54.983104153

These recipes are a yaml representation of the dataset definition. With their succinct and readible format, they can be stored and shared in a standardized way. We can easily load them again:

In [8]:

Copied!

from springtime.datasets import load_dataset

reloaded_ds = load_dataset(recipe)
reloaded_ds == dataset
from springtime.datasets import load_dataset

reloaded_ds = load_dataset(recipe)
reloaded_ds == dataset

Out[8]:

True

Moreover, springtime can read and execute these recipes from the command line as well (see "command line interface"). The idea is that recipes can help to make data loading more reproducible and easier to automate.