PEP725¶
Retrieve observations from Pan European Phenological database using a pre-existing library called phenor.
Prerequisites¶
Phenor¶
Phenor is written in R, and therefore you need to have installed R with phenor.
devtools::install_github("bluegreen-labs/phenor@v1.3.1")
See the installation instructions for more details.
Credentials¶
To authenticate with the PEP725 data servers, you need to have an account and
you need to store your credentials in a file called
~/.config/springtime/pep725_credentials.txt
. Email adress on first line,
password on second line. This path can be modified in the springtime
configuration, but the default is quite okay.
from springtime.datasets import PEP725Phenor
dataset = PEP725Phenor(species="Syringa vulgaris", phenophase=11)
print(dataset)
PEP725Phenor( dataset='PEP725Phenor', years=None, credential_file=PosixPath('/home/peter/.config/springtime/pep725_credentials.txt'), species='Syringa vulgaris', phenophase=11, include_cols=['year', 'geometry', 'day'], area=None )
Notice that the credential_file has been configured automatically, and that there are some other fields that we can set. Before we dive into details about what those options mean, we will need to retrieve the data. We can do this with the download
method.
dataset.download()
File already exists: /home/peter/.cache/springtime/PEP725/Syringa vulgaris.csv
If everything went well, the data should have been downloaded to some location
like /home/username/.cache/springtime
. Springtime will skip the download if
the data is already present.
You can inspect the file on disk, but for transparancy springtime provides a
raw_load
method that loads the data more or less without modification.
dataset.raw_load()
pep_id | bbch | year | day | country | species | national_id | lon | lat | alt | name | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6446 | 60 | 1991 | 130 | AT | Syringa vulgaris | 5120 | 14.4167 | 48.2167 | 225 | ASTEN |
1 | 6446 | 60 | 1984 | 137 | AT | Syringa vulgaris | 5120 | 14.4167 | 48.2167 | 225 | ASTEN |
2 | 6446 | 60 | 1969 | 124 | AT | Syringa vulgaris | 5120 | 14.4167 | 48.2167 | 225 | ASTEN |
3 | 6446 | 60 | 1989 | 107 | AT | Syringa vulgaris | 5120 | 14.4167 | 48.2167 | 225 | ASTEN |
4 | 6446 | 60 | 1990 | 112 | AT | Syringa vulgaris | 5120 | 14.4167 | 48.2167 | 225 | ASTEN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
173752 | 19283 | 60 | 2004 | 125 | UK | Syringa vulgaris | 964298 | -3.7330 | 58.5670 | -1 | 964298 |
173753 | 19283 | 60 | 2002 | 130 | UK | Syringa vulgaris | 964298 | -3.7330 | 58.5670 | -1 | 964298 |
173754 | 19283 | 60 | 2003 | 113 | UK | Syringa vulgaris | 964298 | -3.7330 | 58.5670 | -1 | 964298 |
173755 | 19285 | 60 | 2002 | 125 | UK | Syringa vulgaris | 968311 | -3.5170 | 58.6000 | -1 | 968311 |
173756 | 19285 | 60 | 2003 | 113 | UK | Syringa vulgaris | 968311 | -3.5170 | 58.6000 | -1 | 968311 |
173757 rows × 11 columns
As you can see, there are various columns in the data, only a few of which are relevant for us. The "day" column contains the day of year of the event. The event, in this case, is given in the 'bbch' column, which contains phenophases according to the BBCH scale. For example, phenophase 60 means "beginning of flowering". To see all possible options, have a look at http://www.pep725.eu/pep725_phase.php.
The magic load
method¶
While the raw data is interesting, it doesn't completely conform to our standard yet. The load
method, as opposed to raw_load
, does some additional work to parse the data into a format that we can easily combine with other datasets.
dataset.load().reset_index(drop=True)
year | geometry | day | |
---|---|---|---|
0 | 1988 | POINT (15.86660 44.80000) | 85 |
1 | 1981 | POINT (15.86660 44.80000) | 83 |
2 | 1989 | POINT (15.86660 44.80000) | 80 |
3 | 1985 | POINT (15.86660 44.80000) | 94 |
4 | 2014 | POINT (15.86660 44.80000) | 77 |
... | ... | ... | ... |
1426 | 2000 | POINT (18.25000 49.11670) | 105 |
1427 | 2010 | POINT (18.25000 49.11670) | 106 |
1428 | 2004 | POINT (18.25000 49.11670) | 110 |
1429 | 2001 | POINT (18.25000 49.11670) | 116 |
1430 | 2009 | POINT (18.25000 49.11670) | 98 |
1431 rows × 3 columns
Notice that the year and geometry have been converted to index columns, we only retained the "day" column, as this will be the variable that we are trying to predict. The latitude and longitude have been combined into a "geometry" column in geopandas format.
We can influence the behaviour of the load
method to select an area and years of interest, for example. To this end, we need to modify the dataset.
germany = {
"name": "Germany",
"bbox": [
5.98865807458,
47.3024876979,
15.0169958839,
54.983104153,
],
}
dataset = PEP725Phenor(species="Syringa vulgaris", years=[2000, 2002], area=germany)
print(dataset)
df_pep725 = dataset.load()
df_pep725
PEP725Phenor( dataset='PEP725Phenor', years=YearRange(start=2000, end=2002), credential_file=PosixPath('/home/peter/.config/springtime/pep725_credentials.txt'), species='Syringa vulgaris', phenophase=None, include_cols=['year', 'geometry', 'day'], area=NamedArea( name='Germany', bbox=BoundingBox(xmin=5.98865807458, ymin=47.3024876979, xmax=15.0169958839, ymax=54.983104153) ) )
year | geometry | day | |
---|---|---|---|
0 | 2001 | POINT (13.23330 47.78330) | 130 |
1 | 2000 | POINT (13.23330 47.78330) | 131 |
2 | 2002 | POINT (13.23330 47.78330) | 132 |
3 | 2002 | POINT (14.88330 48.68330) | 122 |
4 | 2000 | POINT (14.88330 48.68330) | 123 |
... | ... | ... | ... |
4718 | 2002 | POINT (11.98330 50.70000) | 130 |
4719 | 2000 | POINT (11.98330 50.70000) | 121 |
4720 | 2001 | POINT (11.98330 50.70000) | 133 |
4721 | 2002 | POINT (11.90000 50.65000) | 138 |
4722 | 2000 | POINT (11.90000 50.65000) | 120 |
4723 rows × 3 columns
Dataset as recipe¶
You may wonder why we pass these additional arguments to the dataset itself; why not pass them directly to the load function? Part of the reason is standardization: most datasets need to know about the area and time already for downloading anything. By making it part of the dataset definition, datasets from several sources become more alike.
Another advantage of this model is that it allows us to export springtime datasets as "recipes".
recipe = dataset.to_recipe()
print(recipe)
dataset: PEP725Phenor years: - 2000 - 2002 species: Syringa vulgaris include_cols: - year - geometry - day area: name: Germany bbox: - 5.98865807458 - 47.3024876979 - 15.0169958839 - 54.983104153
These recipes are a yaml
representation of the dataset definition. With their succinct and readible format, they can be stored and shared in a standardized way. We can easily load them again:
from springtime.datasets import load_dataset
reloaded_ds = load_dataset(recipe)
reloaded_ds == dataset
True
Moreover, springtime can read and execute these recipes from the command line as well (see "command line interface"). The idea is that recipes can help to make data loading more reproducible and easier to automate.