Adding new datasets¶
Here we present a walkthrough for adding your own datasets. It's not very detailed (yet), and some of the things may be quite advanced. If you need help with any of these steps, don't hesitate to reach out via github.
Base dataset¶
Springtime provides a semi-structured approach to loading datasets. Each dataset
is represented as a class with download
, raw_load
and load
methods. This
structure is defined in an abstract base class that can be imported from the
package. To start your own class, you can inherit from it and start implementing
the aforementioned methods. This is illustrated below:
from springtime.datasets.abstract import Dataset
class MyNewDataset(Dataset):
# You need to define a unique name
dataset = "my-new-dataset"
# Add any other attributes needed to define your dataset, e.g.
species: list[str] = []
def download(self):
"""Download the data."""
# your implementation here
# Check if already exists, if so, don't download again, unless CONFIG.force_override is True
# Return the path(s) to the downloaded/existing data
return []
def raw_load(self):
"""Load the data with minimal modification."""
paths = self.download()
data = ...
return data
def load(self):
"""Load the harmonized dataset.
This should do everything to adhere to the springtime dataset standard
format, i.e. a geopandas dataframe with a year and geometry column and
other relevant features also in columns.
"""
raw_data = self.raw_load()
harmonized_data = ...
return harmonized_data
Examples¶
While developing new datasets, it can be useful to look at the source code for existing datasets. You can browse that here.
Pydantic¶
Good to know: the base Dataset
is using pydantic for
runtime validation and (de)serialization to/from recipes. You may want to read
up on their documentation.
Utils¶
Several dataset need to do very similar operations, such as resample. To avoid
duplication, such functions can be generalized and shared between datasets. A
couple of generalized functions are available in springtime.utils
.
Adding your model to springtime¶
It probably makes sense to start developing your dataset class in a notebook or simply python script. However, it would be much nicer if you can make your dataset part of the springtime package. To this end, first, have a look at the contributing guide.
After cloning the source code and making an editable installation, you can add you dataset class to a new file in the datasets folder.
Registering your dataset¶
To make sure your dataset is recognized by springtime, you have to add it to the list of known datasets in https://github.com/phenology/springtime/blob/main/src/springtime/datasets/__init__.py.
Testing¶
To ensure continuity, we have a couple of tests for each dataset. When you add a new dataset, it is probably a good idea to copy the tests of an existing dataset and adapt them to your needs.