Introduction: data and modelling standards¶
The goal of springtime is to facilitate phenological modelling studies. Phenology is the study of the timing of life cycle events of plants and animals. For example, when do the leafs turn green? 🌿
We can model these phenological events, such as spring onset, in various ways, either using "physics-based" models, statistical methods, or machine learning techniques. In any case, the problem can be formulated as such:
$y = f(\vec{x})$
where $y$ is the event, $\vec{x}$ is a set of predictor variables, and $f$ an unknown function. $y$ can be the exact date of the event or, for example, a binary classification whether the event has occured. $f(\vec{x})$ could be a very simple function of latitude, or a complex relationship between various inputs such as temperature timeseries from weather models, greenness indices from satellites, and categorical variables such as species, land use, and soil type.
There are various datasets out there, and several modelling packages as well.
However, they all come with their own quirks. The goal of springtime is to
harmonize these datasets and modelling packages, such that it becomes easy to
use them together in a single study.
At the core of this harmonization effort lies standardization of the data
structure. A lot of standardization has already taken place in the machine
learning realm. For example, in scikit-learn, predictors are represented as a
numpy array or pandas dataframe, typically called X
, and target variables y
as a one-dimensional array or series:
model.fit(X, y)
new_y = model.predict(new_X)
On top of that, there are packages like pycaret, that enable automated training and comparison of various models. Pycaret ingests both predictors and target as one dataframe and the name of the target column, like so:
experiment = RegressionExperiment()
experiment.setup(data = Xy, target = 'y')
experiment.compare_models(['list of models'])
Internally, pycaret then splits the data again and passes it to scikit learn just like above. But it can also pass data to other modelling packages with a similar interface.
With springtime, our goal is to make existing phenological data sources and modelling packages compatible with these standards. Thus, we need to get our data sources in the right format, and we need to make sure our models follow the same interface.
Dummy use case¶
We illustrate the "ideal" case with dummy data and models. This example is all about the data structure and modelling workflow, not (yet) about the science.
# Data format
from springtime import dummy
data = dummy.pycaret_ready(100)
data
spring onset (DOY) | minimum temperature | mean temperature | maximum temperature | ||
---|---|---|---|---|---|
year | geometry | ||||
2000 | POINT (0.22438 0.16026) | 143 | -3.033813 | 0.022000 | 3.039246 |
2001 | POINT (-0.65165 -0.31790) | 160 | -2.977279 | 0.043639 | 3.265090 |
2002 | POINT (2.21276 -1.31028) | 160 | -2.919174 | -0.043401 | 3.013467 |
2003 | POINT (-0.20714 -1.33902) | 121 | -2.872743 | -0.029902 | 2.987468 |
2004 | POINT (0.80240 0.02859) | 123 | -2.768337 | -0.003783 | 2.889369 |
... | ... | ... | ... | ... | ... |
2095 | POINT (-0.78062 0.62873) | 176 | -2.900270 | 0.015071 | 2.859301 |
2096 | POINT (-0.12186 0.82847) | 169 | -2.649552 | 0.111426 | 2.564799 |
2097 | POINT (-0.42651 0.80629) | 153 | -2.956537 | -0.015596 | 3.426379 |
2098 | POINT (0.20776 -0.96359) | 121 | -3.457283 | -0.006289 | 2.840914 |
2099 | POINT (1.10750 0.84402) | 142 | -3.344013 | -0.000756 | 2.716826 |
100 rows × 4 columns
Typical input data for phenological modelling contains unique observations for each year and location. These are the indexes of the dataframe. In this dummy sample, the target variable is the spring onset day of year. And, for the purpose of illustration, our predictors are aggregated temperature features.
With this data structure, we can run an automatic model comparison with pycaret
# Run a model comparison
from pycaret.regression import RegressionExperiment
exp = RegressionExperiment()
exp.setup(data=data, target="spring onset (DOY)")
exp.compare_models(["lr", "rf", "dummy"], n_select=3)
Description | Value | |
---|---|---|
0 | Session id | 6916 |
1 | Target | spring onset (DOY) |
2 | Target type | Regression |
3 | Original data shape | (100, 4) |
4 | Transformed data shape | (100, 4) |
5 | Transformed train set shape | (70, 4) |
6 | Transformed test set shape | (30, 4) |
7 | Numeric features | 3 |
8 | Preprocess | True |
9 | Imputation type | simple |
10 | Numeric imputation | mean |
11 | Categorical imputation | mode |
12 | Fold Generator | KFold |
13 | Fold Number | 10 |
14 | CPU Jobs | -1 |
15 | Use GPU | False |
16 | Log Experiment | False |
17 | Experiment Name | reg-default-name |
18 | USI | 4b26 |
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
dummy | Dummy Regressor | 13.8748 | 269.9634 | 16.1924 | -0.0925 | 0.1091 | 0.0953 | 0.0180 |
lr | Linear Regression | 14.2709 | 276.5581 | 16.3719 | -0.1425 | 0.1103 | 0.0981 | 0.0200 |
rf | Random Forest Regressor | 14.2633 | 305.7650 | 17.1991 | -0.2704 | 0.1161 | 0.0977 | 0.1240 |
[DummyRegressor(), LinearRegression(n_jobs=-1), RandomForestRegressor(n_jobs=-1, random_state=6916)]
Notice that Pycaret automatically cleans, splits, trains, and evaluates various ML models, and is able to select the best one based on the evaluation criteria we selected.
Summary & next steps¶
There are various nice packages out there for (phenological) modelling and data retrieval. We choose pycaret as our "modelling framework", as it can automate common tasks such as model fitting, scoring, and saving an experiment. Thus, we need to make sure that all datasets and models of interest are compatible with pycaret.
In the next steps, we will dive into the harmonization of datasets and models.
We recommend you first look at two datasets with distinct characteristics: PEP725 and E-OBS. Then, you may fast-forward to the chapter on combining datasets. The other dataset walkthroughs can be treated as reference material.