Modelling with PyCaret¶
In the springtime project we usePyCaret to train, evaluate and compare (machine learning) models. PyCaret is a Python wrapper around several machine learning libraries and frameworks such as scikit-learn and XGBoost.
For a (ML) model to work with PyCaret, it needs to adhere
to
the scikit-learn API.
This API specifies, for example, that each model must have a fit
and a
predict
method. It also specifies the expected structure of the input data.
This is why we need to make a big effort to standardize our datasets.
Basic experiment¶
Let's see what a 'simple' experiment looks like. We'll load the same example data as before, and compare a few 'standard' models. For now, we'll stick to all the default settings of pycaret.
from springtime.datasets import PEP725Phenor, EOBS
from springtime.utils import germany, PointsFromOther, join_dataframes
years = [2000, 2002]
pep725 = PEP725Phenor(
species="Syringa vulgaris",
years=years,
area=germany,
)
eobs = EOBS(
area=germany,
years=years,
variables=["mean_temperature"],
resample={"frequency": "M", "operator": "mean"},
points=PointsFromOther(source="pep725"),
)
df_pep725 = pep725.load()
eobs.points.get_points(df_pep725)
df_eobs = eobs.load()
df = join_dataframes([df_pep725, df_eobs])
df.head()
INFO:springtime.datasets.eobs:Locating data INFO:springtime.datasets.eobs:Looking for variable mean_temperature in period 2000-2002... INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc
day | mean_temperature|1 | mean_temperature|2 | mean_temperature|3 | mean_temperature|4 | mean_temperature|5 | mean_temperature|6 | mean_temperature|7 | mean_temperature|8 | mean_temperature|9 | mean_temperature|10 | mean_temperature|11 | mean_temperature|12 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | geometry | |||||||||||||
2000 | POINT (10.00000 49.48330) | 129 | 0.323548 | 3.664483 | 5.358709 | 9.966999 | 14.801293 | 17.880665 | 15.267097 | 18.310001 | 13.865000 | 10.093225 | 5.732333 | 2.525484 |
POINT (10.00000 50.85000) | 120 | 0.943226 | 3.795517 | 5.660645 | 10.001336 | 14.421612 | 16.608667 | 14.690001 | 17.148067 | 13.630667 | 9.831611 | 5.651332 | 2.396452 | |
POINT (10.00000 51.71670) | 116 | 1.694194 | 4.053448 | 5.399354 | 10.563000 | 14.321937 | 16.427999 | 14.901935 | 17.539680 | 14.346666 | 10.444515 | 6.550000 | 3.295161 | |
POINT (10.00000 52.10000) | 120 | 2.531935 | 4.937242 | 5.771289 | 10.993333 | 14.817741 | 16.890333 | 15.336773 | 17.556454 | 14.623999 | 11.281612 | 7.564668 | 4.204194 | |
POINT (10.00000 53.08330) | 121 | 2.119677 | 3.988276 | 4.812258 | 9.906999 | 14.663547 | 16.165335 | 15.199677 | 16.573227 | 13.515334 | 10.373870 | 6.470334 | 3.108710 |
from pycaret.regression import RegressionExperiment
df.reset_index(drop=True, inplace=True) # drop index to avoid sorting errors
exp = RegressionExperiment()
exp.setup(data=df, target="day")
exp.compare_models(["lr", "rf", "dummy"], n_select=3)
Description | Value | |
---|---|---|
0 | Session id | 465 |
1 | Target | day |
2 | Target type | Regression |
3 | Original data shape | (4729, 13) |
4 | Transformed data shape | (4729, 13) |
5 | Transformed train set shape | (3310, 13) |
6 | Transformed test set shape | (1419, 13) |
7 | Numeric features | 12 |
8 | Rows with missing values | 0.8% |
9 | Preprocess | True |
10 | Imputation type | simple |
11 | Numeric imputation | mean |
12 | Categorical imputation | mode |
13 | Fold Generator | KFold |
14 | Fold Number | 10 |
15 | CPU Jobs | -1 |
16 | Use GPU | False |
17 | Log Experiment | False |
18 | Experiment Name | reg-default-name |
19 | USI | 6b62 |
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
lr | Linear Regression | 4.1291 | 41.5792 | 6.4204 | 0.4731 | 0.0551 | 0.0344 | 1.6070 |
rf | Random Forest Regressor | 4.2255 | 42.0004 | 6.4530 | 0.4663 | 0.0552 | 0.0351 | 2.6520 |
dummy | Dummy Regressor | 6.7229 | 78.6030 | 8.8533 | -0.0021 | 0.0739 | 0.0554 | 0.1870 |
[LinearRegression(n_jobs=-1), RandomForestRegressor(n_jobs=-1, random_state=465), DummyRegressor()]
Pycaret successfully executed the training of three models, and it concluded that the random forest regressor performed best, with a mean absolute error of about 4 days and an RMSE of 6.5 days. This seems okay, but the low correlation coefficient (R2) suggests that the skill is still limited. Obviously, a more in-depth investigation is necessary to conclude anything from this, but the initial scoring is already a great start.
Notice the default settings of pycaret: it uses a 10-fold cross-validation strategy and can do simple preprocessing tasks such as simple imputation of missing data.
# TODO: eobs has NANs due to DOY matching of monthly resampled data with leap years.
Custom estimators: interpretML¶
Any model that adheres to the scikit-learn API can be used in pycaret. For example, we are interested in using interpretML.
Note: you need to have installed interpretML in your springtime environment for this to work:
pip install interpet
from interpret.glassbox import ExplainableBoostingRegressor
ebm = ExplainableBoostingRegressor()
exp = RegressionExperiment()
exp.setup(data=df, target="day")
exp.compare_models(["lr", "rf", "dummy", ebm], n_select=3)
Description | Value | |
---|---|---|
0 | Session id | 7467 |
1 | Target | day |
2 | Target type | Regression |
3 | Original data shape | (4729, 13) |
4 | Transformed data shape | (4729, 13) |
5 | Transformed train set shape | (3310, 13) |
6 | Transformed test set shape | (1419, 13) |
7 | Numeric features | 12 |
8 | Rows with missing values | 0.8% |
9 | Preprocess | True |
10 | Imputation type | simple |
11 | Numeric imputation | mean |
12 | Categorical imputation | mode |
13 | Fold Generator | KFold |
14 | Fold Number | 10 |
15 | CPU Jobs | -1 |
16 | Use GPU | False |
17 | Log Experiment | False |
18 | Experiment Name | reg-default-name |
19 | USI | bce3 |
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
3 | ExplainableBoostingRegressor | 4.0748 | 40.6350 | 6.3287 | 0.5021 | 0.0539 | 0.0338 | 2.7770 |
0 | Linear Regression | 4.2003 | 42.7276 | 6.4894 | 0.4761 | 0.0554 | 0.0350 | 0.1440 |
1 | Random Forest Regressor | 4.3170 | 43.7444 | 6.5764 | 0.4626 | 0.0559 | 0.0358 | 1.8840 |
2 | Dummy Regressor | 6.7645 | 80.7642 | 8.9682 | -0.0016 | 0.0746 | 0.0558 | 0.1680 |
INFO:interpret.utils._native:EBM lib loading. INFO:interpret.utils._native:Loading native on Linux | debug = False INFO:interpret.utils._compressed_dataset:Creating native dataset INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.utils._compressed_dataset:Creating native dataset INFO:interpret.glassbox._ebm._ebm:Estimating with FAST INFO:interpret.glassbox._ebm._bin:eval_terms
[ExplainableBoostingRegressor(), LinearRegression(n_jobs=-1), RandomForestRegressor(n_jobs=-1, random_state=7467)]
Nice! Our explainable boosting machine outperformed the random forest regression. This is promising!
Custom estimators: Mixed-effects models¶
Another interesting approach to (interpretable) modelling is the use of mixed-effects models. We found an existing package, MERF, that combines a random forest with random effects. However, the package was not scikit-learn compatible out of the box. Therefore, we've made an adaptated version called [DumME] which is fully scikit-learn compatible and replaces the default model with the scikit-learn dummy model. You can install this package with
pip install dumme
The nice thing is that you are not tied to the Dummy or Random Forest models for the fixed effects. You can combine it with any other sklearn-compatible (regression) model. Here, we create three mixed effects models and see how they perform.
import numpy as np
from dumme.dumme import MixedEffectsModel
from pycaret.regression import RegressionExperiment
from sklearn.ensemble import RandomForestRegressor
# We need to add a cluster column, otherwise each unique sample will be treated
# as a cluster and the algorithm will be very slow.
df = df.copy()
df["cluster"] = np.random.randint(0, 3, len(df))
ebm = ExplainableBoostingRegressor()
me_dummy = MixedEffectsModel(max_iterations=2)
me_rf = MixedEffectsModel(RandomForestRegressor(), max_iterations=2)
me_ebm = MixedEffectsModel(ExplainableBoostingRegressor(), max_iterations=2)
exp = RegressionExperiment()
exp.setup(data=df, target="day")
# Notice: We can pass in the fit arguments for DumME, but that breaks the other models.
# exp.compare_models(["lr", "rf", "dummy", ebm, me_dummy, me_rf, me_ebm], n_select=3, fit_kwargs={"cluster_column": "cluster"})
# Therefore, instead, we rely on the default of DumME to use the last column as cluster column
exp.compare_models(["lr", "rf", "dummy", ebm, me_dummy, me_rf, me_ebm], n_select=3)
Description | Value | |
---|---|---|
0 | Session id | 8566 |
1 | Target | day |
2 | Target type | Regression |
3 | Original data shape | (4729, 14) |
4 | Transformed data shape | (4729, 14) |
5 | Transformed train set shape | (3310, 14) |
6 | Transformed test set shape | (1419, 14) |
7 | Numeric features | 13 |
8 | Rows with missing values | 0.8% |
9 | Preprocess | True |
10 | Imputation type | simple |
11 | Numeric imputation | mean |
12 | Categorical imputation | mode |
13 | Fold Generator | KFold |
14 | Fold Number | 10 |
15 | CPU Jobs | -1 |
16 | Use GPU | False |
17 | Log Experiment | False |
18 | Experiment Name | reg-default-name |
19 | USI | 51a1 |
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
6 | MixedEffectsModel | 4.0400 | 41.5648 | 6.4140 | 0.4776 | 0.0551 | 0.0337 | 14.1230 |
3 | ExplainableBoostingRegressor | 4.0393 | 41.5784 | 6.4148 | 0.4775 | 0.0551 | 0.0336 | 4.1410 |
0 | Linear Regression | 4.1456 | 42.9686 | 6.5288 | 0.4586 | 0.0561 | 0.0346 | 0.1370 |
1 | Random Forest Regressor | 4.2927 | 44.7170 | 6.6625 | 0.4370 | 0.0570 | 0.0357 | 2.2600 |
5 | MixedEffectsModel | 4.2860 | 44.7678 | 6.6668 | 0.4365 | 0.0571 | 0.0356 | 10.4570 |
2 | Dummy Regressor | 6.6889 | 79.2088 | 8.8878 | -0.0022 | 0.0743 | 0.0552 | 0.1100 |
4 | MixedEffectsModel | 6.6915 | 79.2414 | 8.8895 | -0.0026 | 0.0743 | 0.0552 | 8.1230 |
WARNING:dumme.dumme:n_clusters=3, performance scales with number of clusters. INFO:interpret.utils._compressed_dataset:Creating native dataset INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.utils._compressed_dataset:Creating native dataset INFO:interpret.glassbox._ebm._ebm:Estimating with FAST INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:dumme.dumme:Training GLL is 15103.180216911029 at iteration 1. INFO:interpret.utils._compressed_dataset:Creating native dataset INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.utils._compressed_dataset:Creating native dataset INFO:interpret.glassbox._ebm._ebm:Estimating with FAST INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:dumme.dumme:Training GLL is 15091.134725740532 at iteration 2. INFO:interpret.utils._compressed_dataset:Creating native dataset INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.glassbox._ebm._bin:eval_terms INFO:interpret.utils._compressed_dataset:Creating native dataset INFO:interpret.glassbox._ebm._ebm:Estimating with FAST INFO:interpret.glassbox._ebm._bin:eval_terms
[MixedEffectsModel(fe_model=ExplainableBoostingRegressor(), max_iterations=2), ExplainableBoostingRegressor(), LinearRegression(n_jobs=-1)]
That's as far as the introductions go.
Next steps: adding your own (physical) models¶
While many ML packages already adhere to the scikit-learn API, there are other methods that do not. Particularly, we want to compare our ML-based approach with more traditional physics-based models. In the next chapter, we will show how to add a growing-degree day model to our framework.