Adding new datasets¶

Here we present a walkthrough for adding your own datasets. It's not very detailed (yet), and some of the things may be quite advanced. If you need help with any of these steps, don't hesitate to reach out via github.

Base dataset¶

Springtime provides a semi-structured approach to loading datasets. Each dataset is represented as a class with download, raw_load and load methods. This structure is defined in an abstract base class that can be imported from the package. To start your own class, you can inherit from it and start implementing the aforementioned methods. This is illustrated below:

In [ ]:

Copied!





from springtime.datasets.abstract import Dataset


class MyNewDataset(Dataset):
    # You need to define a unique name
    dataset = "my-new-dataset"

    # Add any other attributes needed to define your dataset, e.g.
    species: list[str] = []

    def download(self):
        """Download the data."""
        # your implementation here

        # Check if already exists, if so, don't download again, unless CONFIG.force_override is True

        # Return the path(s) to the downloaded/existing data
        return []

    def raw_load(self):
        """Load the data with minimal modification."""
        paths = self.download()
        data = ...
        return data

    def load(self):
        """Load the harmonized dataset.

        This should do everything to adhere to the springtime dataset standard
        format, i.e. a geopandas dataframe with a year and geometry column and
        other relevant features also in columns.
        """
        raw_data = self.raw_load()
        harmonized_data = ...
        return harmonized_data
from springtime.datasets.abstract import Dataset


class MyNewDataset(Dataset):
    # You need to define a unique name
    dataset = "my-new-dataset"

    # Add any other attributes needed to define your dataset, e.g.
    species: list[str] = []

    def download(self):
        """Download the data."""
        # your implementation here

        # Check if already exists, if so, don't download again, unless CONFIG.force_override is True

        # Return the path(s) to the downloaded/existing data
        return []

    def raw_load(self):
        """Load the data with minimal modification."""
        paths = self.download()
        data = ...
        return data

    def load(self):
        """Load the harmonized dataset.

        This should do everything to adhere to the springtime dataset standard
        format, i.e. a geopandas dataframe with a year and geometry column and
        other relevant features also in columns.
        """
        raw_data = self.raw_load()
        harmonized_data = ...
        return harmonized_data

Examples¶

While developing new datasets, it can be useful to look at the source code for existing datasets. You can browse that here.

Pydantic¶

Good to know: the base Dataset is using pydantic for runtime validation and (de)serialization to/from recipes. You may want to read up on their documentation.

Utils¶

Several dataset need to do very similar operations, such as resample. To avoid duplication, such functions can be generalized and shared between datasets. A couple of generalized functions are available in springtime.utils.

Adding your model to springtime¶

It probably makes sense to start developing your dataset class in a notebook or simply python script. However, it would be much nicer if you can make your dataset part of the springtime package. To this end, first, have a look at the contributing guide.

After cloning the source code and making an editable installation, you can add you dataset class to a new file in the datasets folder.

Registering your dataset¶

To make sure your dataset is recognized by springtime, you have to add it to the list of known datasets in https://github.com/phenology/springtime/blob/main/src/springtime/datasets/__init__.py.

Testing¶

To ensure continuity, we have a couple of tests for each dataset. When you add a new dataset, it is probably a good idea to copy the tests of an existing dataset and adapt them to your needs.