Modelling with PyCaret¶

In the springtime project we usePyCaret to train, evaluate and compare (machine learning) models. PyCaret is a Python wrapper around several machine learning libraries and frameworks such as scikit-learn and XGBoost.

For a (ML) model to work with PyCaret, it needs to adhere to the scikit-learn API. This API specifies, for example, that each model must have a fit and a predict method. It also specifies the expected structure of the input data. This is why we need to make a big effort to standardize our datasets.

Basic experiment¶

Let's see what a 'simple' experiment looks like. We'll load the same example data as before, and compare a few 'standard' models. For now, we'll stick to all the default settings of pycaret.

In [2]:

Copied!





from springtime.datasets import PEP725Phenor, EOBS
from springtime.utils import germany, PointsFromOther, join_dataframes

years = [2000, 2002]

pep725 = PEP725Phenor(
    species="Syringa vulgaris",
    years=years,
    area=germany,
)

eobs = EOBS(
    area=germany,
    years=years,
    variables=["mean_temperature"],
    resample={"frequency": "M", "operator": "mean"},
    points=PointsFromOther(source="pep725"),
)

df_pep725 = pep725.load()
eobs.points.get_points(df_pep725)
df_eobs = eobs.load()
df = join_dataframes([df_pep725, df_eobs])
df.head()
from springtime.datasets import PEP725Phenor, EOBS
from springtime.utils import germany, PointsFromOther, join_dataframes

years = [2000, 2002]

pep725 = PEP725Phenor(
    species="Syringa vulgaris",
    years=years,
    area=germany,
)

eobs = EOBS(
    area=germany,
    years=years,
    variables=["mean_temperature"],
    resample={"frequency": "M", "operator": "mean"},
    points=PointsFromOther(source="pep725"),
)

df_pep725 = pep725.load()
eobs.points.get_points(df_pep725)
df_eobs = eobs.load()
df = join_dataframes([df_pep725, df_eobs])
df.head()

INFO:springtime.datasets.eobs:Locating data
INFO:springtime.datasets.eobs:Looking for variable mean_temperature in period 2000-2002...
INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc

Out[2]:

		day	mean_temperature\|1	mean_temperature\|2	mean_temperature\|3	mean_temperature\|4	mean_temperature\|5	mean_temperature\|6	mean_temperature\|7	mean_temperature\|8	mean_temperature\|9	mean_temperature\|10	mean_temperature\|11	mean_temperature\|12
year	geometry
2000	POINT (10.00000 49.48330)	129	0.323548	3.664483	5.358709	9.966999	14.801293	17.880665	15.267097	18.310001	13.865000	10.093225	5.732333	2.525484
	POINT (10.00000 50.85000)	120	0.943226	3.795517	5.660645	10.001336	14.421612	16.608667	14.690001	17.148067	13.630667	9.831611	5.651332	2.396452
	POINT (10.00000 51.71670)	116	1.694194	4.053448	5.399354	10.563000	14.321937	16.427999	14.901935	17.539680	14.346666	10.444515	6.550000	3.295161
	POINT (10.00000 52.10000)	120	2.531935	4.937242	5.771289	10.993333	14.817741	16.890333	15.336773	17.556454	14.623999	11.281612	7.564668	4.204194
	POINT (10.00000 53.08330)	121	2.119677	3.988276	4.812258	9.906999	14.663547	16.165335	15.199677	16.573227	13.515334	10.373870	6.470334	3.108710

In [3]:

Copied!

from pycaret.regression import RegressionExperiment

df.reset_index(drop=True, inplace=True)  # drop index to avoid sorting errors

exp = RegressionExperiment()
exp.setup(data=df, target="day")
exp.compare_models(["lr", "rf", "dummy"], n_select=3)
from pycaret.regression import RegressionExperiment

df.reset_index(drop=True, inplace=True)  # drop index to avoid sorting errors

exp = RegressionExperiment()
exp.setup(data=df, target="day")
exp.compare_models(["lr", "rf", "dummy"], n_select=3)

	Description	Value
0	Session id	465
1	Target	day
2	Target type	Regression
3	Original data shape	(4729, 13)
4	Transformed data shape	(4729, 13)
5	Transformed train set shape	(3310, 13)
6	Transformed test set shape	(1419, 13)
7	Numeric features	12
8	Rows with missing values	0.8%
9	Preprocess	True
10	Imputation type	simple
11	Numeric imputation	mean
12	Categorical imputation	mode
13	Fold Generator	KFold
14	Fold Number	10
15	CPU Jobs	-1
16	Use GPU	False
17	Log Experiment	False
18	Experiment Name	reg-default-name
19	USI	6b62

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
lr	Linear Regression	4.1291	41.5792	6.4204	0.4731	0.0551	0.0344	1.6070
rf	Random Forest Regressor	4.2255	42.0004	6.4530	0.4663	0.0552	0.0351	2.6520
dummy	Dummy Regressor	6.7229	78.6030	8.8533	-0.0021	0.0739	0.0554	0.1870

Out[3]:

[LinearRegression(n_jobs=-1),
 RandomForestRegressor(n_jobs=-1, random_state=465),
 DummyRegressor()]

Pycaret successfully executed the training of three models, and it concluded that the random forest regressor performed best, with a mean absolute error of about 4 days and an RMSE of 6.5 days. This seems okay, but the low correlation coefficient (R2) suggests that the skill is still limited. Obviously, a more in-depth investigation is necessary to conclude anything from this, but the initial scoring is already a great start.

Notice the default settings of pycaret: it uses a 10-fold cross-validation strategy and can do simple preprocessing tasks such as simple imputation of missing data.

In [4]:

Copied!

# TODO: eobs has NANs due to DOY matching of monthly resampled data with leap years.
# TODO: eobs has NANs due to DOY matching of monthly resampled data with leap years.

Custom estimators: interpretML¶

Any model that adheres to the scikit-learn API can be used in pycaret. For example, we are interested in using interpretML.

Note: you need to have installed interpretML in your springtime environment for this to work:

pip install interpet

In [5]:

Copied!

from interpret.glassbox import ExplainableBoostingRegressor

ebm = ExplainableBoostingRegressor()

exp = RegressionExperiment()
exp.setup(data=df, target="day")
exp.compare_models(["lr", "rf", "dummy", ebm], n_select=3)
from interpret.glassbox import ExplainableBoostingRegressor

ebm = ExplainableBoostingRegressor()

exp = RegressionExperiment()
exp.setup(data=df, target="day")
exp.compare_models(["lr", "rf", "dummy", ebm], n_select=3)

	Description	Value
0	Session id	7467
1	Target	day
2	Target type	Regression
3	Original data shape	(4729, 13)
4	Transformed data shape	(4729, 13)
5	Transformed train set shape	(3310, 13)
6	Transformed test set shape	(1419, 13)
7	Numeric features	12
8	Rows with missing values	0.8%
9	Preprocess	True
10	Imputation type	simple
11	Numeric imputation	mean
12	Categorical imputation	mode
13	Fold Generator	KFold
14	Fold Number	10
15	CPU Jobs	-1
16	Use GPU	False
17	Log Experiment	False
18	Experiment Name	reg-default-name
19	USI	bce3

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
3	ExplainableBoostingRegressor	4.0748	40.6350	6.3287	0.5021	0.0539	0.0338	2.7770
0	Linear Regression	4.2003	42.7276	6.4894	0.4761	0.0554	0.0350	0.1440
1	Random Forest Regressor	4.3170	43.7444	6.5764	0.4626	0.0559	0.0358	1.8840
2	Dummy Regressor	6.7645	80.7642	8.9682	-0.0016	0.0746	0.0558	0.1680

INFO:interpret.utils._native:EBM lib loading.
INFO:interpret.utils._native:Loading native on Linux | debug = False
INFO:interpret.utils._compressed_dataset:Creating native dataset
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.utils._compressed_dataset:Creating native dataset
INFO:interpret.glassbox._ebm._ebm:Estimating with FAST
INFO:interpret.glassbox._ebm._bin:eval_terms

Out[5]:

[ExplainableBoostingRegressor(),
 LinearRegression(n_jobs=-1),
 RandomForestRegressor(n_jobs=-1, random_state=7467)]

Nice! Our explainable boosting machine outperformed the random forest regression. This is promising!

Custom estimators: Mixed-effects models¶

Another interesting approach to (interpretable) modelling is the use of mixed-effects models. We found an existing package, MERF, that combines a random forest with random effects. However, the package was not scikit-learn compatible out of the box. Therefore, we've made an adaptated version called [DumME] which is fully scikit-learn compatible and replaces the default model with the scikit-learn dummy model. You can install this package with

pip install dumme

The nice thing is that you are not tied to the Dummy or Random Forest models for the fixed effects. You can combine it with any other sklearn-compatible (regression) model. Here, we create three mixed effects models and see how they perform.

In [6]:

Copied!





import numpy as np
from dumme.dumme import MixedEffectsModel
from pycaret.regression import RegressionExperiment
from sklearn.ensemble import RandomForestRegressor

# We need to add a cluster column, otherwise each unique sample will be treated
# as a cluster and the algorithm will be very slow.
df = df.copy()
df["cluster"] = np.random.randint(0, 3, len(df))

ebm = ExplainableBoostingRegressor()
me_dummy = MixedEffectsModel(max_iterations=2)
me_rf = MixedEffectsModel(RandomForestRegressor(), max_iterations=2)
me_ebm = MixedEffectsModel(ExplainableBoostingRegressor(), max_iterations=2)

exp = RegressionExperiment()
exp.setup(data=df, target="day")

# Notice: We can pass in the fit arguments for DumME, but that breaks the other models.
# exp.compare_models(["lr", "rf", "dummy", ebm, me_dummy, me_rf, me_ebm], n_select=3, fit_kwargs={"cluster_column": "cluster"})

# Therefore, instead, we rely on the default of DumME to use the last column as cluster column
exp.compare_models(["lr", "rf", "dummy", ebm, me_dummy, me_rf, me_ebm], n_select=3)
import numpy as np
from dumme.dumme import MixedEffectsModel
from pycaret.regression import RegressionExperiment
from sklearn.ensemble import RandomForestRegressor

# We need to add a cluster column, otherwise each unique sample will be treated
# as a cluster and the algorithm will be very slow.
df = df.copy()
df["cluster"] = np.random.randint(0, 3, len(df))

ebm = ExplainableBoostingRegressor()
me_dummy = MixedEffectsModel(max_iterations=2)
me_rf = MixedEffectsModel(RandomForestRegressor(), max_iterations=2)
me_ebm = MixedEffectsModel(ExplainableBoostingRegressor(), max_iterations=2)

exp = RegressionExperiment()
exp.setup(data=df, target="day")

# Notice: We can pass in the fit arguments for DumME, but that breaks the other models.
# exp.compare_models(["lr", "rf", "dummy", ebm, me_dummy, me_rf, me_ebm], n_select=3, fit_kwargs={"cluster_column": "cluster"})

# Therefore, instead, we rely on the default of DumME to use the last column as cluster column
exp.compare_models(["lr", "rf", "dummy", ebm, me_dummy, me_rf, me_ebm], n_select=3)

	Description	Value
0	Session id	8566
1	Target	day
2	Target type	Regression
3	Original data shape	(4729, 14)
4	Transformed data shape	(4729, 14)
5	Transformed train set shape	(3310, 14)
6	Transformed test set shape	(1419, 14)
7	Numeric features	13
8	Rows with missing values	0.8%
9	Preprocess	True
10	Imputation type	simple
11	Numeric imputation	mean
12	Categorical imputation	mode
13	Fold Generator	KFold
14	Fold Number	10
15	CPU Jobs	-1
16	Use GPU	False
17	Log Experiment	False
18	Experiment Name	reg-default-name
19	USI	51a1

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
6	MixedEffectsModel	4.0400	41.5648	6.4140	0.4776	0.0551	0.0337	14.1230
3	ExplainableBoostingRegressor	4.0393	41.5784	6.4148	0.4775	0.0551	0.0336	4.1410
0	Linear Regression	4.1456	42.9686	6.5288	0.4586	0.0561	0.0346	0.1370
1	Random Forest Regressor	4.2927	44.7170	6.6625	0.4370	0.0570	0.0357	2.2600
5	MixedEffectsModel	4.2860	44.7678	6.6668	0.4365	0.0571	0.0356	10.4570
2	Dummy Regressor	6.6889	79.2088	8.8878	-0.0022	0.0743	0.0552	0.1100
4	MixedEffectsModel	6.6915	79.2414	8.8895	-0.0026	0.0743	0.0552	8.1230

WARNING:dumme.dumme:n_clusters=3, performance scales with number of clusters.
INFO:interpret.utils._compressed_dataset:Creating native dataset
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.utils._compressed_dataset:Creating native dataset
INFO:interpret.glassbox._ebm._ebm:Estimating with FAST
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:dumme.dumme:Training GLL is 15103.180216911029 at iteration 1.
INFO:interpret.utils._compressed_dataset:Creating native dataset
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.utils._compressed_dataset:Creating native dataset
INFO:interpret.glassbox._ebm._ebm:Estimating with FAST
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:dumme.dumme:Training GLL is 15091.134725740532 at iteration 2.
INFO:interpret.utils._compressed_dataset:Creating native dataset
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.utils._compressed_dataset:Creating native dataset
INFO:interpret.glassbox._ebm._ebm:Estimating with FAST
INFO:interpret.glassbox._ebm._bin:eval_terms

Out[6]:

[MixedEffectsModel(fe_model=ExplainableBoostingRegressor(), max_iterations=2),
 ExplainableBoostingRegressor(),
 LinearRegression(n_jobs=-1)]

That's as far as the introductions go.

Next steps: adding your own (physical) models¶

While many ML packages already adhere to the scikit-learn API, there are other methods that do not. Particularly, we want to compare our ML-based approach with more traditional physics-based models. In the next chapter, we will show how to add a growing-degree day model to our framework.