Introduction: data and modelling standards¶

The goal of springtime is to facilitate phenological modelling studies. Phenology is the study of the timing of life cycle events of plants and animals. For example, when do the leafs turn green? 🌿

We can model these phenological events, such as spring onset, in various ways, either using "physics-based" models, statistical methods, or machine learning techniques. In any case, the problem can be formulated as such:

$y = f(\vec{x})$

where $y$ is the event, $\vec{x}$ is a set of predictor variables, and $f$ an unknown function. $y$ can be the exact date of the event or, for example, a binary classification whether the event has occured. $f(\vec{x})$ could be a very simple function of latitude, or a complex relationship between various inputs such as temperature timeseries from weather models, greenness indices from satellites, and categorical variables such as species, land use, and soil type.

There are various datasets out there, and several modelling packages as well. However, they all come with their own quirks. The goal of springtime is to harmonize these datasets and modelling packages, such that it becomes easy to use them together in a single study. At the core of this harmonization effort lies standardization of the data structure. A lot of standardization has already taken place in the machine learning realm. For example, in scikit-learn, predictors are represented as a numpy array or pandas dataframe, typically called X, and target variables y as a one-dimensional array or series:

model.fit(X, y)
new_y = model.predict(new_X)

On top of that, there are packages like pycaret, that enable automated training and comparison of various models. Pycaret ingests both predictors and target as one dataframe and the name of the target column, like so:

experiment = RegressionExperiment()
experiment.setup(data = Xy,  target = 'y')
experiment.compare_models(['list of models'])

Internally, pycaret then splits the data again and passes it to scikit learn just like above. But it can also pass data to other modelling packages with a similar interface.

With springtime, our goal is to make existing phenological data sources and modelling packages compatible with these standards. Thus, we need to get our data sources in the right format, and we need to make sure our models follow the same interface.

Dummy use case¶

We illustrate the "ideal" case with dummy data and models. This example is all about the data structure and modelling workflow, not (yet) about the science.

In [1]:

Copied!

# Data format
from springtime import dummy

data = dummy.pycaret_ready(100)
data
# Data format
from springtime import dummy

data = dummy.pycaret_ready(100)
data

Out[1]:

		spring onset (DOY)	minimum temperature	mean temperature	maximum temperature
year	geometry
2000	POINT (0.22438 0.16026)	143	-3.033813	0.022000	3.039246
2001	POINT (-0.65165 -0.31790)	160	-2.977279	0.043639	3.265090
2002	POINT (2.21276 -1.31028)	160	-2.919174	-0.043401	3.013467
2003	POINT (-0.20714 -1.33902)	121	-2.872743	-0.029902	2.987468
2004	POINT (0.80240 0.02859)	123	-2.768337	-0.003783	2.889369
...	...	...	...	...	...
2095	POINT (-0.78062 0.62873)	176	-2.900270	0.015071	2.859301
2096	POINT (-0.12186 0.82847)	169	-2.649552	0.111426	2.564799
2097	POINT (-0.42651 0.80629)	153	-2.956537	-0.015596	3.426379
2098	POINT (0.20776 -0.96359)	121	-3.457283	-0.006289	2.840914
2099	POINT (1.10750 0.84402)	142	-3.344013	-0.000756	2.716826

100 rows × 4 columns

Typical input data for phenological modelling contains unique observations for each year and location. These are the indexes of the dataframe. In this dummy sample, the target variable is the spring onset day of year. And, for the purpose of illustration, our predictors are aggregated temperature features.

With this data structure, we can run an automatic model comparison with pycaret

In [ ]:

Copied!





# Run a model comparison
from pycaret.regression import RegressionExperiment

exp = RegressionExperiment()
exp.setup(data=data, target="spring onset (DOY)")
exp.compare_models(["lr", "rf", "dummy"], n_select=3)
# Run a model comparison
from pycaret.regression import RegressionExperiment

exp = RegressionExperiment()
exp.setup(data=data, target="spring onset (DOY)")
exp.compare_models(["lr", "rf", "dummy"], n_select=3)

	Description	Value
0	Session id	6916
1	Target	spring onset (DOY)
2	Target type	Regression
3	Original data shape	(100, 4)
4	Transformed data shape	(100, 4)
5	Transformed train set shape	(70, 4)
6	Transformed test set shape	(30, 4)
7	Numeric features	3
8	Preprocess	True
9	Imputation type	simple
10	Numeric imputation	mean
11	Categorical imputation	mode
12	Fold Generator	KFold
13	Fold Number	10
14	CPU Jobs	-1
15	Use GPU	False
16	Log Experiment	False
17	Experiment Name	reg-default-name
18	USI	4b26

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
dummy	Dummy Regressor	13.8748	269.9634	16.1924	-0.0925	0.1091	0.0953	0.0180
lr	Linear Regression	14.2709	276.5581	16.3719	-0.1425	0.1103	0.0981	0.0200
rf	Random Forest Regressor	14.2633	305.7650	17.1991	-0.2704	0.1161	0.0977	0.1240

Out[ ]:

[DummyRegressor(),
 LinearRegression(n_jobs=-1),
 RandomForestRegressor(n_jobs=-1, random_state=6916)]

Notice that Pycaret automatically cleans, splits, trains, and evaluates various ML models, and is able to select the best one based on the evaluation criteria we selected.

Summary & next steps¶

There are various nice packages out there for (phenological) modelling and data retrieval. We choose pycaret as our "modelling framework", as it can automate common tasks such as model fitting, scoring, and saving an experiment. Thus, we need to make sure that all datasets and models of interest are compatible with pycaret.

In the next steps, we will dive into the harmonization of datasets and models.

We recommend you first look at two datasets with distinct characteristics: PEP725 and E-OBS. Then, you may fast-forward to the chapter on combining datasets. The other dataset walkthroughs can be treated as reference material.