Adding a (physics-based) model¶
So far we've only looked at machine learning models. We are very keen to know how these "smart" approaches compare to more traditional, physics-based models.
PyPhenology is a nice python package with a collection of physics-based models. We would like to compare those models, ideally within the same pycaret framework. However, pyPhenology is not consistent with the scikit-learn API. On the other hand, it is quite possible to cast their equations to a form that does adhere to these standards.
In this notebook, we will walk you through the steps to create a custom estimator, following the scikit-learn documentation. We will show how this is done for pyPhenology's ThermalTime model. At the end of the chapter, you should be able to repeat the trick for the other pyPhenology models as well.
The first blow is half the battle¶
As a starting point, we copied the scikit-learn project template and updated it with some of the information from the pyphenology ThermalTime class. Specifically, we:
- Added RegressorMixin from scikit-learn. This contains some methods specific to regression estimators.
- Replaced
check_X_y
andcheck_array
with the newer_validate_data()
(see SLEP010). - Merged docstrings of ThermalTime and sklearn template
- Changed to google-style docstrings and added type hints to method signatures instead of in the docstrings
- Set the default values of the parameters to sensible ints instead of valid ranges
import numpy as np
from sklearn.base import (
BaseEstimator,
RegressorMixin,
check_is_fitted,
)
from numpy.typing import ArrayLike
class ThermalTime(RegressorMixin, BaseEstimator):
"""Thermal Time Model
The classic growing degree day model using a fixed temperature threshold
above which forcing accumulates.
"""
def __init__(self):
pass
# TODO: consider adding DOY index series to fit/predict as optional argument
def fit(self, X: ArrayLike, y: ArrayLike):
"""Fit the model to the available observations.
Parameters:
X: 2D Array of shape (n_samples, n_features).
Daily mean temperatures for each unique site/year (n_samples) and
for each DOY (n_features). The first feature should correspond to
the first DOY, and so forth up to (max) 366.
y: 1D Array of length n_samples
Observed DOY of the spring onset for each unique site/year.
Returns:
Fitted model
"""
X, y = self._validate_data(X, y)
# TODO: check additional assumptions about input
# TODO: convert to proper fit; for now set some default values
self.t1_: int = 0
self.T_: int = 5
self.F_: int = 500
# `fit` should always return `self`
return self
def predict(self, X: ArrayLike):
"""Predict values of y given new predictors
Parameters:
X: array-like, shape (n_samples, n_features).
Daily mean temperatures for each unique site/year (n_samples) and
for each DOY (n_features). The first feature should correspond to
the first DOY, and so forth up to (max) 366.
Returns:
y: array-like, shape (n_samples,)
Predicted DOY of the spring onset for each sample in X.
"""
X = self._validate_data(X)
check_is_fitted(self, ["t1_", "T_", "F_"])
# TODO: Implement real predictions
return np.ones(X.shape[0], dtype=np.int64)
Phew! That's a big start! Notice that we're not passing in anything during initialization (yet). By convention of scikit-learn, the parameters of the model are only set during fit. Fitted parameters can be recognized by their trailing underscore.
This class can already be used, although it doesn't actually fit or predict anything (useful) yet.
However, before we proceed, let's see whether we already adhere to the scikit-learn API.
Checking sklearn compliance¶
Scikit-learn provide a nice compliance checker. With a bit of extra code we can print out which tests it fails (see below).
from sklearn.utils.estimator_checks import check_estimator
# This bit of code allows us to run the checks in a notebook
checks = check_estimator(ThermalTime(), generate_only=True)
passed_checks = 0
failed_checks = 0
for estimator, check in checks:
name = check.func.__name__
try:
check(estimator)
passed_checks += 1
except Exception as exc:
print(f"Check {name} failed with exception: {exc}")
failed_checks += 1
print(f"Passed checks: {passed_checks}, failed checks: {failed_checks}")
Check check_regressors_train failed with exception: Check check_regressors_train failed with exception: Check check_regressors_train failed with exception: Passed checks: 37, failed checks: 3
So far, so good. Most of the checks passed, and if we dive deep into what's being check, we can figure out that the others failed because the predictions were not that good. That makes sense...
We can inform the sklearn checker that this is expect by overriding the
poor_score
tag. Add the following method to your class:
def _more_tags(self):
# Pass checks related to performance of model as the thermaltime model
# cannot be expected to perform well for random data.
# https://scikit-learn.org/stable/developers/develop.html#estimator-tags
return {
'poor_score': True
}
Then, all checks should pass.
Using pytest and source files¶
While it is possible to do all this in a notebook, a neater and more convenient way is to use pytest. To this end:
Store the class definition above in a new file called
thermaltime.py
Create a new file called
test_thermaltime.py
and add the following contentfrom thermaltime import ThermalTime from sklearn.utils.estimator_checks import parametrize_with_checks @parametrize_with_checks([ThermalTime(),]) def test_sklearn_compatible_estimator(estimator, check): check(estimator)
Install pytest:
pip install pytest
Run pytest:
pytest test_thermaltime.py
So far, so good! Next, we need to
- Provide an implementation for the predict method
- Provide an implementation for the fit method
- Consider doing more validation of the input, since now we simply assume that the data is in ordere columns from DOY 1 up to (max) 366.
- Consider allowing an additional argument to fit and predict that contains the column indices in case they're not neatly formatted from 1 to (max) 366. This is allowed by scikit-learn as long as it's an optional argument.
Implementing predict¶
We'll start by implementing the predict method. This is relatively straightforward. We'll write a simple function that takes both X and the parameters, and returns the expected DOY. For ease of reference, we copied the docstrings from above. This implementation should be exactly the same as in pyPhenology.
# Note: Copy this function to your file thermaltime.py
def predict_thermaltime(X, t1: int = 1, T: int = 5, F: int = 500):
"""Make prediction with the thermaltime model.
X: array-like, shape (n_samples, n_features).
Daily mean temperatures for each unique site/year (n_samples) and for
each DOY (n_features). The first feature should correspond to
the first DOY, and so forth up to (max) 366.
t1: The DOY at which forcing accumulating beings (should be within [-67,298])
T: The threshold above which forcing accumulates (should be within [-25,25])
F: The total forcing units required (should be within [0,1000])
"""
# This allows us to pass both 1D and 2D arrays of temperature
# Copying X to safely modify it later on (may not be necessary, but readable)
X_2d = np.atleast_2d(np.copy(X))
# DOY starts at 1, where python array index start at 0
# TODO: Make this an optional argument?
doy = np.arange(X_2d.shape[1]) + 1
# Exclude days before the start of the growing season
X_2d[:, doy < t1] = 0
# Exclude days with temperature below threshold
X_2d[X_2d < T] = 0
# Accumulate remaining data
S = np.cumsum(X_2d, axis=-1)
# Find first entry that exceeds the total forcing units required.
doy = np.argmax(S > F, axis=-1)
return doy
Let's see how we can use this model:
# 10 degrees every day:
X_test = np.ones(365) * 10
# Predicted spring onset:
predict_thermaltime(X_test)
array([50])
# Also check for 2D X inputs:
X_test = np.ones((10, 365)) * 10
predict_thermaltime(X_test)
array([50, 50, 50, 50, 50, 50, 50, 50, 50, 50])
Good, it seems this works nicely, both for indivual prediction and for 2D arrays of inputs.
Adding tests¶
These quick checks are super useful! We can quickly add a few more and add them to our test file (test_thermaltime.py
). Note: also copy the predict_thermaltime
function from above to your file thermaltime.py
.
# Note: Copy these tests to your file thermaltime.py, uncomment the imports, and remove the bottom part.
# Note: these imports must be uncommented in your test file
# import numpy as np
# from thermaltime import ThermalTime, thermaltime
def test_1d_base_case():
# 10 degrees every day:
X_test = np.ones(365) * 10
assert predict_thermaltime(X_test) == 50
def test_late_growing_season():
# If the growing season starts later, the spring onset is later as well.
X_test = np.ones(365) * 10
assert predict_thermaltime(X_test, t1=11) == 60
def test_higher_threshold():
# If the total accumulated forcing required is higher, spring onset is later.
X_test = np.ones(365) * 10
assert predict_thermaltime(X_test, F=600) == 60
def test_exclude_cold_days():
# If some days are below the minimum growing T, spring onset is later.
X_test = np.ones(365) * 10
X_test[[1, 4, 8, 12, 17, 24, 29, 33, 38, 42]] = 3
assert predict_thermaltime(X_test) == 60
def test_lower_temperature_threshold():
# If the minimum growing T is lower, fewer days are exluded. However, the
# accumulated temperature rises more slowly.
X_test = np.ones(365) * 10
X_test[[1, 4, 8, 12, 17, 24, 29, 33, 38, 42]] = 5
assert predict_thermaltime(X_test, T=2) == 55
def test_2d():
# Should be able to predict for multiple samples at once
X_test = np.ones((10, 365)) * 10
expected = np.ones(10) * 50
result = predict_thermaltime(X_test)
assert np.all(result == expected)
# Note: The following lines are not needed in your test file. Pytest will
# automatically call all functions starting with "test_".
test_1d_base_case()
test_late_growing_season()
test_higher_threshold()
test_exclude_cold_days()
test_lower_temperature_threshold()
test_2d()
After you've copied the code to your files, you can run pytest again to check that all new tests pass.
Now that we're confident our new predict function works, the last thing we need to do is update the predict method on the class. Change it to look like this:
def predict(self, X: ArrayLike):
"""Predict values of y given new predictors
Parameters:
X: array-like, shape (n_samples, n_features).
Daily mean temperatures for each unique site/year (n_samples) and
for each DOY (n_features). The first feature should correspond to
the first DOY, and so forth up to (max) 366.
Returns:
y: array-like, shape (n_samples,)
Predicted DOY of the spring onset for each sample in X.
"""
X = self._validate_data(X)
check_is_fitted(self, ["t1_", "T_", "F_"])
return predict_thermaltime(X, self.t1_, self.T_, self.F_)
Implementing the fit
method¶
Now that we can make predictions, we can also think about optimizing the parameters of the model.
Generate training data¶
Our aim is to minimize the difference between the predictions based on the training data X
and the target data y
. Let's first prepare some training data to test the method once we have it.
# Prepare some training data. Take a base temperature of 10 degrees and add a
# random fluctuation on top of it. The corresponding spring onset should
# correlate with the random temperature fluctations.
temp_signal = np.random.randn(10, 365)
X_train = np.ones((10, 365)) * 10 + temp_signal * 10
y_train = np.ones(10) * 50 + temp_signal.mean(axis=1) * 100
# Check that the values are within somewhat realistic ranges
print(X_train.min(), X_train.max())
print(y_train.min(), y_train.max())
-25.002365102085015 50.045377945651 44.55695961294945 57.3285262390998
Exploring algorithms¶
Scipy has a lot of optimization routines. Looking at the list of optimizers, it seems curve-fit
has the exact signature we are looking for. However, if we try it, we will quickly find out that it is not fit for purpose.
from scipy.optimize import curve_fit
initial_guess = [1, 5, 500]
lower_bounds = [-67, -25, 0]
upper_bounds = [298, 25, 1000]
curve_fit(
predict_thermaltime,
X_train,
y_train,
p0=initial_guess,
bounds=(lower_bounds, upper_bounds),
)
(array([ 1.00000001, 5. , 500. ]), array([[0., 0., 0.], [0., 0., 0.], [0., 0., 0.]]))
This gives terrible fits! Or rather, it doesn't seem to do much fitting at all. If you play around with the initial settings you'll see that most of the output is rubbish. That's probably because our prediction model is not a nice, continuous functional form. Also, curve-fit
doesn't really like integer parameters such as the DOY (though our implementation doesn't really care either).
Clearly, we need to look further.
Reformulating the problem¶
Alternatively, following pyphenology, we can try the global optimizers instead. Looking at the documentation of the brute-force optimizer, it seems we need to reformulate our problem a little bit. First, we need to define a loss function, i.e. the function that scipy can minimize.
Intuitively, our loss function could look something like this:
def rmse_loss(X, y, t1: int = 1, T: int = 5, F: int = 500):
y_pred = predict_thermaltime(X, t1=t1, T=T, F=F)
sq_err = (y_pred - y) ** 2
return np.mean(sq_err) ** 0.5
# Try it with the default parameters:
rmse_loss(X_train, y_train)
5.6190548414478645
However, this is not consistent with the form expected by scipy's global optimizers: they want something of the form rmse_loss(x, *args)
. So we need to reformulate our loss function in terms of x
and args
.
Nota bene: where the brute-force example finds the minimum of the function in the x-y space given fixed parameters, we want to do the opposite: we want to find the minimum in the parameter space, given the (fixed for each training batch) X and y. So, x
is a tuple of our models params, *args
corresponds to X
and y
, and our loss function becomes something like this:
def rmse_loss(params, X, y):
"""RMSE loss between thermaltime predictions and observations.
params is a tuple with the parameters to thermaltime: (t1, T, F)
"""
y_pred = predict_thermaltime(X, *params)
sq_err = (y_pred - y) ** 2
return np.mean(sq_err) ** 0.5
# Try it with the default parameters:
rmse_loss([1, 5, 500], X_train, y_train)
5.6190548414478645
While the above is good enough, we can stil tweak our loss function a bit further while we're at it. Currently, our loss has the predict_thermaltime
function hardcoded. It would be much nicer if this was more flexible, so it could work for any (roughly similar) model, for example, all the pyphenology models.
One way to do this is to pass the predictor function as one of the arguments.
def rmse_loss(params, f, X, y):
"""RMSE loss between thermaltime predictions and observations.
f is a prediction model of the form f(X, *params)
params is a tuple with the parameters to f
"""
y_pred = f(X, *params)
sq_err = (y_pred - y) ** 2
return np.mean(sq_err) ** 0.5
# Try it with the default parameters:
rmse_loss([1, 5, 500], predict_thermaltime, X_train, y_train)
5.6190548414478645
Brute force optimization¶
Now we're ready to run the brute force optimization.
from scipy.optimize import brute
ranges = ((-67, 298), (-25, 25), (0, 1000))
args = (predict_thermaltime, X_train, y_train)
brute(rmse_loss, ranges=ranges, args=args)
array([48.26315789, 14.47368421, 0. ])
That works! If you look carefully at the results, you will see that the optimization algorithm finds a clever solution to the problem: it sets the total forcing required to 0, and the first day of the growing season very close to the typical spring onset. While this is a perfectly valid solution, especially considering the fake training data we used, it illustrates that you need to be really careful in setting the expected bounds of the parameters, and in interpreting the results.
Another thing to note is how scipy creates the search space. The brute
function has an additional parameter Ns
that determines the number of 'grid points' used for each of the parameters. The default is 20: the ranges we supplied for t1
, T
, and F
are divided in 20 points, such that the algorithm executes our loss function 20 * 20 * 20
times. By increasing the number of points, we get a better estimate, but the number of function calls and therefore the training time increases exponentially.
There are two ways around this:
- Set the points yourself
- Use a more clever algorithm
The first solution is quite simple. We can also define the ranges as slices with a custom step
:
ranges = (slice(-67, 298, 10), slice(-25, 25, 2), slice(0, 1000, 50))
brute(rmse_loss, ranges=ranges, args=args)
array([ 43., -13., 105.])
Basin hopping¶
For the second solution, again, we follow the pyphenology developers. They support "differential evolution" and "basin_hopping". Let's try the basin hopping algorithm. This algorithm works in two steps, delegating calls to our loss function to a local minimization function. Thus, we need to pass in our bounds and args as minimzer_kwargs
instead. We must also pick a method for the local minizer. For consistency with pyphenology, we choose 'LL-BFGS-B'.
from scipy.optimize import basinhopping
initial_guess = (1, 5, 500)
minimizer_args = {
"method": "L-BFGS-B",
"args": (predict_thermaltime, X_train, y_train),
"bounds": ((-67, 298), (-25, 25), (0, 1000)),
}
basinhopping(rmse_loss, x0=initial_guess, minimizer_kwargs=minimizer_args)
message: ['requested number of basinhopping iterations completed successfully'] success: True fun: 5.2379373393649695 x: [ 3.063e+00 2.137e+00 5.034e+02] nit: 100 minimization_failures: 0 nfev: 472 njev: 118 lowest_optimization_result: message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL success: True status: 0 fun: 5.2379373393649695 x: [ 3.063e+00 2.137e+00 5.034e+02] nit: 0 jac: [ 0.000e+00 0.000e+00 0.000e+00] nfev: 4 njev: 1 hess_inv: <3x3 LbfgsInvHessProduct with dtype=float64>
This also works, but notice that the result is still very close to the initial guess. If we change the initial guess to something that's more like the clever solution of the brute method, we find another fit much smaller RMSE value:
initial_guess = (45, 15, 0)
minimizer_args = {
"method": "L-BFGS-B",
"args": (predict_thermaltime, X_train, y_train),
"bounds": ((-67, 298), (-25, 25), (0, 1000)),
}
bh = basinhopping(rmse_loss, x0=initial_guess, minimizer_kwargs=minimizer_args)
print(f"params: {bh.x}, minimum: {bh.fun}")
params: [48.16010104 13.83079594 0.77096103], minimum: 3.2652253589931624
Notice that we can pass additional arguments to the basinhopping algorithm to tweak its performance. For example, the default of 100 iterations is quite small. Using the settings from pyphenology, we get a result that looks much more realistic (and takes much longer to train):
from scipy.optimize import basinhopping
initial_guess = (1, 5, 500)
minimizer_args = {
"method": "L-BFGS-B",
"args": (predict_thermaltime, X_train, y_train),
"bounds": ((-67, 298), (-25, 25), (0, 1000)),
}
optimizer_params = {"niter": 50000, "T": 0.5, "stepsize": 0.5, "disp": False}
basinhopping(
rmse_loss, x0=initial_guess, **optimizer_params, minimizer_kwargs=minimizer_args
)
message: ['requested number of basinhopping iterations completed successfully'] success: True fun: 2.893246475688441 x: [ 4.712e+01 1.243e+01 3.338e+01] nit: 50000 minimization_failures: 0 nfev: 200072 njev: 50018 lowest_optimization_result: message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL success: True status: 0 fun: 2.893246475688441 x: [ 4.712e+01 1.243e+01 3.338e+01] nit: 0 jac: [ 0.000e+00 0.000e+00 0.000e+00] nfev: 4 njev: 1 hess_inv: <3x3 LbfgsInvHessProduct with dtype=float64>
Great! Now we have a good grip on the different ways to fit our model. A final thing to point out is that the basinhopping algorithm returns floats for each parameter. This doesn't really make sense for the start of the growing season, so in the post-processing we should probably round that to the nearest int.
Bringing everything together¶
If we bring everything from above over to the fit
method, we get something like this:
def fit(self, X: ArrayLike, y: ArrayLike):
"""Fit the model to the available observations.
Parameters:
X: 2D Array of shape (n_samples, n_features).
Daily mean temperatures for each unique site/year (n_samples) and
for each DOY (n_features). The first feature should correspond to
the first DOY, and so forth up to (max) 366.
y: 1D Array of length n_samples
Observed DOY of the spring onset for each unique site/year.
Returns:
Fitted model
"""
X, y = self._validate_data(X, y)
# TODO: check additional assumptions about input
initial_guess = (1, 5, 500)
bounds = ((-67, 298), (-25, 25), (0, 1000))
minimizer_args = {
"method": "L-BFGS-B",
"args": (predict_thermaltime, X, y),
"bounds": bounds,
}
optimizer_params = {"niter": 50000, "T": 0.5, "stepsize": 0.5, "disp": False}
bh = basinhopping(
rmse_loss,
x0=initial_guess,
minimizer_kwargs=minimizer_args,
**optimizer_params,
)
self.t1_, self.T_, self.F_ = bh.x
return self
This is not yet ready:
Initial guess
andbounds
are hardcoded and closely related to the equally hardcodedthermaltime
and the loss function. If we want to reuse this for other physics-based models, it would be nice to try and separate these a bit better.- The
optimizer_params
are hardcoded as well. These are typical (hyper) parameters that can be passed to the__init__
method of our class, since these settings are not data-dependent. - Currently the only optimization routine supported is the basinhopping algorithm.
Putting it to the test¶
Below we pasted a slighly more polished version, as a complete code. We
generalized the loss function a bit better, and the core_model-related
parameters are grouped together at the top. The fit method is mostly concerned
with formulating the call to the basinhopping algorithm. We used the
staticmethod
decorator to make the thermaltime function a method on the class.
Notice that, in contrast with standard class methods, it doesn't need access to
self
.
import numpy as np
from sklearn.base import (
BaseEstimator,
RegressorMixin,
check_is_fitted,
)
from numpy.typing import ArrayLike
from scipy.optimize import basinhopping
def rmse_loss(params, f, X, y):
"""RMSE loss between thermaltime predictions and observations.
f is a prediction model of the form f(X, *params)
params is a tuple with the parameters to f
"""
y_pred = f(X, *params)
sq_err = (y_pred - y) ** 2
return np.mean(sq_err) ** 0.5
LOSS_FUNCTIONS = {
"RMSE": rmse_loss,
}
class ThermalTime(RegressorMixin, BaseEstimator):
"""Thermal Time Model
The classic growing degree day model using a fixed temperature threshold
above which forcing accumulates.
This implementation uses scipy's basinhopper optimization algorithm for
fitting the model.
"""
_core_params_names = ("t1", "T", "F")
_core_params_defaults = (1, 5, 500)
_core_params_bounds = ((-67, 298), (-25, 25), (0, 1000))
def __init__(
self,
niter=50000,
T=0.5,
stepsize=0.5,
disp=False,
minimizer_method="L-BFGS-B",
loss_function="RMSE",
):
self.niter = niter
self.T = T
self.stepsize = stepsize
self.disp = disp
self.minimizer_method = minimizer_method
self.loss_function = loss_function
@staticmethod
def _core_model(X, t1: int = 1, T: int = 5, F: int = 500):
"""Make prediction with the thermaltime model.
Args:
X: array-like, shape (n_samples, n_features).
Daily mean temperatures for each unique site/year (n_samples) and for
each DOY (n_features). The first feature should correspond to
the first DOY, and so forth up to (max) 366.
t1: The DOY at which forcing accumulating beings (should be within [-67,298])
T: The threshold above which forcing accumulates (should be within [-25,25])
F: The total forcing units required (should be within [0,1000])
"""
# This allows us to pass both 1D and 2D arrays of temperature
# Copying X to safely modify it later on (may not be necessary, but readable)
X_2d = np.atleast_2d(np.copy(X))
# DOY starts at 1, where python array index start at 0
# TODO: Make this an optional argument?
doy = np.arange(X_2d.shape[1]) + 1
# Exclude days before the start of the growing season
X_2d[:, doy < t1] = 0
# Exclude days with temperature below threshold
X_2d[X_2d < T] = 0
# Accumulate remaining data
S = np.cumsum(X_2d, axis=-1)
# Find first entry that exceeds the total forcing units required.
doy = np.argmax(S > F, axis=-1)
return doy
# TODO: consider adding DOY index series to fit/predict as optional argument
def fit(self, X: ArrayLike, y: ArrayLike):
"""Fit the model to the available observations.
Parameters:
X: 2D Array of shape (n_samples, n_features).
Daily mean temperatures for each unique site/year (n_samples) and
for each DOY (n_features). The first feature should correspond to
the first DOY, and so forth up to (max) 366.
y: 1D Array of length n_samples
Observed DOY of the spring onset for each unique site/year.
Returns:
Fitted model
"""
X, y = self._validate_data(X, y)
# TODO: check additional assumptions about input
loss_function = LOSS_FUNCTIONS[self.loss_function]
# Perform the fit
bh = basinhopping(
loss_function,
x0=self._core_params_defaults,
niter=self.niter,
T=self.T,
stepsize=self.stepsize,
disp=self.disp,
minimizer_kwargs={
"method": self.minimizer_method,
"args": (self._core_model, X, y),
"bounds": self._core_params_bounds,
},
)
# Store the fitted parameters
self.core_params_ = bh.x
return self
def predict(self, X: ArrayLike):
"""Predict values of y given new predictors
Parameters:
X: array-like, shape (n_samples, n_features).
Daily mean temperatures for each unique site/year (n_samples) and
for each DOY (n_features). The first feature should correspond to
the first DOY, and so forth up to (max) 366.
Returns:
y: array-like, shape (n_samples,)
Predicted DOY of the spring onset for each sample in X.
"""
X = self._validate_data(X)
check_is_fitted(self, "core_params_")
return self._core_model(X, *self.core_params_)
While still not perfect, this implementation is good enough to see if we can use it in our pycaret framework. For that, we first need to combine our data in a pandas dataframe.
import pandas as pd
df = pd.DataFrame(X_train)
df["y"] = y_train
df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 15.218994 | 6.738299 | 13.954624 | 29.450302 | -3.373456 | 17.349404 | 11.489199 | -0.946024 | 9.447005 | 14.721903 | ... | 17.564457 | 8.209593 | 10.716687 | 2.374732 | 22.605012 | 2.639938 | 25.742186 | 9.589601 | -4.323429 | 48.583246 |
1 | 22.773274 | 19.317239 | -0.591967 | 13.633791 | 4.640599 | 2.588708 | 11.031384 | 10.298080 | -1.175578 | -20.222365 | ... | -1.453457 | 7.593247 | 4.730469 | 11.483520 | 14.818011 | 25.799254 | 15.006352 | 3.442988 | -3.945578 | 50.461037 |
2 | 16.166114 | 6.776026 | 3.693059 | 8.057823 | 5.018413 | 23.824257 | 0.922434 | -1.134289 | 17.514239 | 10.534730 | ... | -1.523276 | 15.888692 | -3.116856 | 3.443013 | 4.458324 | 12.345812 | 8.165304 | 21.404692 | 5.190300 | 52.001860 |
3 | -4.454030 | 12.983264 | 18.690565 | 14.266377 | 13.291256 | 9.783686 | 18.153094 | 15.724669 | -13.733181 | 7.353611 | ... | 11.454641 | 11.915009 | 10.903428 | 8.284341 | 0.811677 | 9.052229 | 7.078002 | 0.597021 | 10.312079 | 57.328526 |
4 | 11.336732 | 5.656002 | 9.542669 | 15.932466 | 22.050319 | 5.539453 | 10.183360 | 4.398088 | 4.344155 | 10.000675 | ... | 6.672839 | 29.814395 | 8.529970 | 1.693037 | 1.629311 | 0.112947 | 12.897052 | 24.386493 | 3.704314 | 44.556960 |
5 rows × 366 columns
from pycaret.regression import RegressionExperiment
exp = RegressionExperiment()
exp.setup(data=df, target="y")
exp.compare_models(["lr", "rf", "dummy", ThermalTime()], cross_validation=False)
Description | Value | |
---|---|---|
0 | Session id | 8334 |
1 | Target | y |
2 | Target type | Regression |
3 | Original data shape | (10, 366) |
4 | Transformed data shape | (10, 366) |
5 | Transformed train set shape | (7, 366) |
6 | Transformed test set shape | (3, 366) |
7 | Numeric features | 365 |
8 | Preprocess | True |
9 | Imputation type | simple |
10 | Numeric imputation | mean |
11 | Categorical imputation | mode |
12 | Fold Generator | KFold |
13 | Fold Number | 10 |
14 | CPU Jobs | -1 |
15 | Use GPU | False |
16 | Log Experiment | False |
17 | Experiment Name | reg-default-name |
18 | USI | c9d5 |
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
3 | ThermalTime | 3.7702 | 20.0271 | 4.4752 | 0.2647 | 0.0856 | 0.0736 | 64.4000 |
1 | Random Forest Regressor | 4.1401 | 25.6182 | 5.0614 | 0.0594 | 0.0977 | 0.0818 | 0.6500 |
0 | Linear Regression | 4.2148 | 25.9996 | 5.0990 | 0.0454 | 0.0977 | 0.0823 | 0.0600 |
2 | Dummy Regressor | 4.4073 | 27.8321 | 5.2756 | -0.0218 | 0.1014 | 0.0863 | 0.0500 |
ThermalTime()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ThermalTime()
Unfortunately, but not surprisingly, our new model is not very skillful. If you execute this notebook several times, scores will vary from pretty okay to very bad. But this is to be expected of a training set of 10 samples based on some vaguely invented data...
Towards a generic implementation¶
We hope this tutorial has provided a good basis for developing this further.
There's still a number of things that could be improved further. For example, we
mentioned the option to pass in additional arguments to the fit
method based,
e.g. to use a deviating DOY index. If you look at the code in
springtime.models,
you will see that we have generalized the code a bit further. To add a new
model, as long as it's similar enough to the thermaltime model, you only need to
implement the analogue of the thermaltime_predict
method.
Compared to pyphenology, an important difference is that we based our main class off the correpsonding scipy estimator, and you can pass in different phenology models. By contrast, in pyphenology, there is one main class for each phenology model, while you can specify the optimization algorithm. There is something to say for both choices; we feel the present implementation aligns a bit better with the way in which sklearn defines hyperparameters of a model. Thus, you can now use this model in springtime by importing it like so:
from springtime.models.basinhopper import BasinHopper
model = BasinHopper(core_model="thermaltime", loss_function="RMSE")
model.fit(X_train, y_train)
# model.predict(X_new)
BasinHopper()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BasinHopper()
And that's it!