Adding a (physics-based) model¶

So far we've only looked at machine learning models. We are very keen to know how these "smart" approaches compare to more traditional, physics-based models.

PyPhenology is a nice python package with a collection of physics-based models. We would like to compare those models, ideally within the same pycaret framework. However, pyPhenology is not consistent with the scikit-learn API. On the other hand, it is quite possible to cast their equations to a form that does adhere to these standards.

In this notebook, we will walk you through the steps to create a custom estimator, following the scikit-learn documentation. We will show how this is done for pyPhenology's ThermalTime model. At the end of the chapter, you should be able to repeat the trick for the other pyPhenology models as well.

The first blow is half the battle¶

As a starting point, we copied the scikit-learn project template and updated it with some of the information from the pyphenology ThermalTime class. Specifically, we:

Added RegressorMixin from scikit-learn. This contains some methods specific to regression estimators.
Replaced check_X_y and check_array with the newer _validate_data() (see SLEP010).
Merged docstrings of ThermalTime and sklearn template
Changed to google-style docstrings and added type hints to method signatures instead of in the docstrings
Set the default values of the parameters to sensible ints instead of valid ranges

In [1]:

Copied!





import numpy as np
from sklearn.base import (
    BaseEstimator,
    RegressorMixin,
    check_is_fitted,
)
from numpy.typing import ArrayLike


class ThermalTime(RegressorMixin, BaseEstimator):
    """Thermal Time Model

    The classic growing degree day model using a fixed temperature threshold
    above which forcing accumulates.
    """

    def __init__(self):
        pass

    # TODO: consider adding DOY index series to fit/predict as optional argument
    def fit(self, X: ArrayLike, y: ArrayLike):
        """Fit the model to the available observations.

        Parameters:
            X: 2D Array of shape (n_samples, n_features).
                Daily mean temperatures for each unique site/year (n_samples) and
                for each DOY (n_features). The first feature should correspond to
                the first DOY, and so forth up to (max) 366.
            y: 1D Array of length n_samples
                Observed DOY of the spring onset for each unique site/year.

        Returns:
            Fitted model
        """
        X, y = self._validate_data(X, y)
        # TODO: check additional assumptions about input

        # TODO: convert to proper fit; for now set some default values
        self.t1_: int = 0
        self.T_: int = 5
        self.F_: int = 500

        # `fit` should always return `self`
        return self

    def predict(self, X: ArrayLike):
        """Predict values of y given new predictors

        Parameters:
            X: array-like, shape (n_samples, n_features).
               Daily mean temperatures for each unique site/year (n_samples) and
               for each DOY (n_features). The first feature should correspond to
               the first DOY, and so forth up to (max) 366.

        Returns:
            y: array-like, shape (n_samples,)
               Predicted DOY of the spring onset for each sample in X.
        """
        X = self._validate_data(X)
        check_is_fitted(self, ["t1_", "T_", "F_"])

        # TODO: Implement real predictions
        return np.ones(X.shape[0], dtype=np.int64)
import numpy as np
from sklearn.base import (
    BaseEstimator,
    RegressorMixin,
    check_is_fitted,
)
from numpy.typing import ArrayLike


class ThermalTime(RegressorMixin, BaseEstimator):
    """Thermal Time Model

    The classic growing degree day model using a fixed temperature threshold
    above which forcing accumulates.
    """

    def __init__(self):
        pass

    # TODO: consider adding DOY index series to fit/predict as optional argument
    def fit(self, X: ArrayLike, y: ArrayLike):
        """Fit the model to the available observations.

        Parameters:
            X: 2D Array of shape (n_samples, n_features).
                Daily mean temperatures for each unique site/year (n_samples) and
                for each DOY (n_features). The first feature should correspond to
                the first DOY, and so forth up to (max) 366.
            y: 1D Array of length n_samples
                Observed DOY of the spring onset for each unique site/year.

        Returns:
            Fitted model
        """
        X, y = self._validate_data(X, y)
        # TODO: check additional assumptions about input

        # TODO: convert to proper fit; for now set some default values
        self.t1_: int = 0
        self.T_: int = 5
        self.F_: int = 500

        # `fit` should always return `self`
        return self

    def predict(self, X: ArrayLike):
        """Predict values of y given new predictors

        Parameters:
            X: array-like, shape (n_samples, n_features).
               Daily mean temperatures for each unique site/year (n_samples) and
               for each DOY (n_features). The first feature should correspond to
               the first DOY, and so forth up to (max) 366.

        Returns:
            y: array-like, shape (n_samples,)
               Predicted DOY of the spring onset for each sample in X.
        """
        X = self._validate_data(X)
        check_is_fitted(self, ["t1_", "T_", "F_"])

        # TODO: Implement real predictions
        return np.ones(X.shape[0], dtype=np.int64)

Phew! That's a big start! Notice that we're not passing in anything during initialization (yet). By convention of scikit-learn, the parameters of the model are only set during fit. Fitted parameters can be recognized by their trailing underscore.

This class can already be used, although it doesn't actually fit or predict anything (useful) yet.

However, before we proceed, let's see whether we already adhere to the scikit-learn API.

Checking sklearn compliance¶

Scikit-learn provide a nice compliance checker. With a bit of extra code we can print out which tests it fails (see below).

In [2]:

Copied!





from sklearn.utils.estimator_checks import check_estimator

# This bit of code allows us to run the checks in a notebook
checks = check_estimator(ThermalTime(), generate_only=True)
passed_checks = 0
failed_checks = 0
for estimator, check in checks:
    name = check.func.__name__
    try:
        check(estimator)
        passed_checks += 1
    except Exception as exc:
        print(f"Check {name} failed with exception: {exc}")
        failed_checks += 1
print(f"Passed checks: {passed_checks}, failed checks: {failed_checks}")
from sklearn.utils.estimator_checks import check_estimator

# This bit of code allows us to run the checks in a notebook
checks = check_estimator(ThermalTime(), generate_only=True)
passed_checks = 0
failed_checks = 0
for estimator, check in checks:
    name = check.func.__name__
    try:
        check(estimator)
        passed_checks += 1
    except Exception as exc:
        print(f"Check {name} failed with exception: {exc}")
        failed_checks += 1
print(f"Passed checks: {passed_checks}, failed checks: {failed_checks}")

Check check_regressors_train failed with exception: 
Check check_regressors_train failed with exception: 
Check check_regressors_train failed with exception: 
Passed checks: 37, failed checks: 3

So far, so good. Most of the checks passed, and if we dive deep into what's being check, we can figure out that the others failed because the predictions were not that good. That makes sense...

We can inform the sklearn checker that this is expect by overriding the poor_score tag. Add the following method to your class:

def _more_tags(self):
        # Pass checks related to performance of model as the thermaltime model
        # cannot be expected to perform well for random data.
        # https://scikit-learn.org/stable/developers/develop.html#estimator-tags
        return {
            'poor_score': True
        }

Then, all checks should pass.

Using pytest and source files¶

While it is possible to do all this in a notebook, a neater and more convenient way is to use pytest. To this end:

Store the class definition above in a new file called thermaltime.py

Create a new file called test_thermaltime.py and add the following content

from thermaltime import ThermalTime

from sklearn.utils.estimator_checks import parametrize_with_checks

@parametrize_with_checks([ThermalTime(),])
def test_sklearn_compatible_estimator(estimator, check):
    check(estimator)

Install pytest: pip install pytest
Run pytest: pytest test_thermaltime.py

So far, so good! Next, we need to

Provide an implementation for the predict method
Provide an implementation for the fit method
Consider doing more validation of the input, since now we simply assume that the data is in ordere columns from DOY 1 up to (max) 366.
Consider allowing an additional argument to fit and predict that contains the column indices in case they're not neatly formatted from 1 to (max) 366. This is allowed by scikit-learn as long as it's an optional argument.

Implementing predict¶

We'll start by implementing the predict method. This is relatively straightforward. We'll write a simple function that takes both X and the parameters, and returns the expected DOY. For ease of reference, we copied the docstrings from above. This implementation should be exactly the same as in pyPhenology.

In [3]:

Copied!





# Note: Copy this function to your file thermaltime.py


def predict_thermaltime(X, t1: int = 1, T: int = 5, F: int = 500):
    """Make prediction with the thermaltime model.

    X: array-like, shape (n_samples, n_features).
       Daily mean temperatures for each unique site/year (n_samples) and for
       each DOY (n_features). The first feature should correspond to
       the first DOY, and so forth up to (max) 366.
    t1: The DOY at which forcing accumulating beings (should be within [-67,298])
    T: The threshold above which forcing accumulates (should be within [-25,25])
    F: The total forcing units required (should be within [0,1000])
    """
    # This allows us to pass both 1D and 2D arrays of temperature
    # Copying X to safely modify it later on (may not be necessary, but readable)
    X_2d = np.atleast_2d(np.copy(X))

    # DOY starts at 1, where python array index start at 0
    # TODO: Make this an optional argument?
    doy = np.arange(X_2d.shape[1]) + 1

    # Exclude days before the start of the growing season
    X_2d[:, doy < t1] = 0

    # Exclude days with temperature below threshold
    X_2d[X_2d < T] = 0

    # Accumulate remaining data
    S = np.cumsum(X_2d, axis=-1)

    # Find first entry that exceeds the total forcing units required.
    doy = np.argmax(S > F, axis=-1)

    return doy
# Note: Copy this function to your file thermaltime.py


def predict_thermaltime(X, t1: int = 1, T: int = 5, F: int = 500):
    """Make prediction with the thermaltime model.

    X: array-like, shape (n_samples, n_features).
       Daily mean temperatures for each unique site/year (n_samples) and for
       each DOY (n_features). The first feature should correspond to
       the first DOY, and so forth up to (max) 366.
    t1: The DOY at which forcing accumulating beings (should be within [-67,298])
    T: The threshold above which forcing accumulates (should be within [-25,25])
    F: The total forcing units required (should be within [0,1000])
    """
    # This allows us to pass both 1D and 2D arrays of temperature
    # Copying X to safely modify it later on (may not be necessary, but readable)
    X_2d = np.atleast_2d(np.copy(X))

    # DOY starts at 1, where python array index start at 0
    # TODO: Make this an optional argument?
    doy = np.arange(X_2d.shape[1]) + 1

    # Exclude days before the start of the growing season
    X_2d[:, doy < t1] = 0

    # Exclude days with temperature below threshold
    X_2d[X_2d < T] = 0

    # Accumulate remaining data
    S = np.cumsum(X_2d, axis=-1)

    # Find first entry that exceeds the total forcing units required.
    doy = np.argmax(S > F, axis=-1)

    return doy

Let's see how we can use this model:

In [4]:

Copied!

# 10 degrees every day:
X_test = np.ones(365) * 10

# Predicted spring onset:
predict_thermaltime(X_test)
# 10 degrees every day:
X_test = np.ones(365) * 10

# Predicted spring onset:
predict_thermaltime(X_test)

Out[4]:

array([50])

In [5]:

Copied!

# Also check for 2D X inputs:
X_test = np.ones((10, 365)) * 10
predict_thermaltime(X_test)
# Also check for 2D X inputs:
X_test = np.ones((10, 365)) * 10
predict_thermaltime(X_test)

Out[5]:

array([50, 50, 50, 50, 50, 50, 50, 50, 50, 50])

Good, it seems this works nicely, both for indivual prediction and for 2D arrays of inputs.

Adding tests¶

These quick checks are super useful! We can quickly add a few more and add them to our test file (test_thermaltime.py). Note: also copy the predict_thermaltime function from above to your file thermaltime.py.

In [6]:

Copied!





# Note: Copy these tests to your file thermaltime.py, uncomment the imports, and remove the bottom part.

# Note: these imports must be uncommented in your test file
# import numpy as np
# from thermaltime import ThermalTime, thermaltime


def test_1d_base_case():
    # 10 degrees every day:
    X_test = np.ones(365) * 10
    assert predict_thermaltime(X_test) == 50


def test_late_growing_season():
    # If the growing season starts later, the spring onset is later as well.
    X_test = np.ones(365) * 10
    assert predict_thermaltime(X_test, t1=11) == 60


def test_higher_threshold():
    # If the total accumulated forcing required is higher, spring onset is later.
    X_test = np.ones(365) * 10
    assert predict_thermaltime(X_test, F=600) == 60


def test_exclude_cold_days():
    # If some days are below the minimum growing T, spring onset is later.
    X_test = np.ones(365) * 10
    X_test[[1, 4, 8, 12, 17, 24, 29, 33, 38, 42]] = 3
    assert predict_thermaltime(X_test) == 60


def test_lower_temperature_threshold():
    # If the minimum growing T is lower, fewer days are exluded. However, the
    # accumulated temperature rises more slowly.
    X_test = np.ones(365) * 10

    X_test[[1, 4, 8, 12, 17, 24, 29, 33, 38, 42]] = 5
    assert predict_thermaltime(X_test, T=2) == 55


def test_2d():
    # Should be able to predict for multiple samples at once
    X_test = np.ones((10, 365)) * 10
    expected = np.ones(10) * 50
    result = predict_thermaltime(X_test)
    assert np.all(result == expected)


# Note: The following lines are not needed in your test file. Pytest will
# automatically call all functions starting with "test_".
test_1d_base_case()
test_late_growing_season()
test_higher_threshold()
test_exclude_cold_days()
test_lower_temperature_threshold()
test_2d()
# Note: Copy these tests to your file thermaltime.py, uncomment the imports, and remove the bottom part.

# Note: these imports must be uncommented in your test file
# import numpy as np
# from thermaltime import ThermalTime, thermaltime


def test_1d_base_case():
    # 10 degrees every day:
    X_test = np.ones(365) * 10
    assert predict_thermaltime(X_test) == 50


def test_late_growing_season():
    # If the growing season starts later, the spring onset is later as well.
    X_test = np.ones(365) * 10
    assert predict_thermaltime(X_test, t1=11) == 60


def test_higher_threshold():
    # If the total accumulated forcing required is higher, spring onset is later.
    X_test = np.ones(365) * 10
    assert predict_thermaltime(X_test, F=600) == 60


def test_exclude_cold_days():
    # If some days are below the minimum growing T, spring onset is later.
    X_test = np.ones(365) * 10
    X_test[[1, 4, 8, 12, 17, 24, 29, 33, 38, 42]] = 3
    assert predict_thermaltime(X_test) == 60


def test_lower_temperature_threshold():
    # If the minimum growing T is lower, fewer days are exluded. However, the
    # accumulated temperature rises more slowly.
    X_test = np.ones(365) * 10

    X_test[[1, 4, 8, 12, 17, 24, 29, 33, 38, 42]] = 5
    assert predict_thermaltime(X_test, T=2) == 55


def test_2d():
    # Should be able to predict for multiple samples at once
    X_test = np.ones((10, 365)) * 10
    expected = np.ones(10) * 50
    result = predict_thermaltime(X_test)
    assert np.all(result == expected)


# Note: The following lines are not needed in your test file. Pytest will
# automatically call all functions starting with "test_".
test_1d_base_case()
test_late_growing_season()
test_higher_threshold()
test_exclude_cold_days()
test_lower_temperature_threshold()
test_2d()

After you've copied the code to your files, you can run pytest again to check that all new tests pass.

Now that we're confident our new predict function works, the last thing we need to do is update the predict method on the class. Change it to look like this:

def predict(self, X: ArrayLike):
        """Predict values of y given new predictors

        Parameters:
            X: array-like, shape (n_samples, n_features).
               Daily mean temperatures for each unique site/year (n_samples) and
               for each DOY (n_features). The first feature should correspond to
               the first DOY, and so forth up to (max) 366.

        Returns:
            y: array-like, shape (n_samples,)
               Predicted DOY of the spring onset for each sample in X.
        """
        X = self._validate_data(X)
        check_is_fitted(self, ["t1_", "T_", "F_"])

        return predict_thermaltime(X, self.t1_, self.T_, self.F_)

Implementing the `fit` method¶

Now that we can make predictions, we can also think about optimizing the parameters of the model.

Generate training data¶

Our aim is to minimize the difference between the predictions based on the training data X and the target data y. Let's first prepare some training data to test the method once we have it.

In [7]:

Copied!





# Prepare some training data. Take a base temperature of 10 degrees and add a
# random fluctuation on top of it. The corresponding spring onset should
# correlate with the random temperature fluctations.

temp_signal = np.random.randn(10, 365)
X_train = np.ones((10, 365)) * 10 + temp_signal * 10
y_train = np.ones(10) * 50 + temp_signal.mean(axis=1) * 100

# Check that the values are within somewhat realistic ranges
print(X_train.min(), X_train.max())
print(y_train.min(), y_train.max())
# Prepare some training data. Take a base temperature of 10 degrees and add a
# random fluctuation on top of it. The corresponding spring onset should
# correlate with the random temperature fluctations.

temp_signal = np.random.randn(10, 365)
X_train = np.ones((10, 365)) * 10 + temp_signal * 10
y_train = np.ones(10) * 50 + temp_signal.mean(axis=1) * 100

# Check that the values are within somewhat realistic ranges
print(X_train.min(), X_train.max())
print(y_train.min(), y_train.max())

-25.002365102085015 50.045377945651
44.55695961294945 57.3285262390998

Exploring algorithms¶

Scipy has a lot of optimization routines. Looking at the list of optimizers, it seems curve-fit has the exact signature we are looking for. However, if we try it, we will quickly find out that it is not fit for purpose.

In [8]:

Copied!





from scipy.optimize import curve_fit

initial_guess = [1, 5, 500]
lower_bounds = [-67, -25, 0]
upper_bounds = [298, 25, 1000]

curve_fit(
    predict_thermaltime,
    X_train,
    y_train,
    p0=initial_guess,
    bounds=(lower_bounds, upper_bounds),
)
from scipy.optimize import curve_fit

initial_guess = [1, 5, 500]
lower_bounds = [-67, -25, 0]
upper_bounds = [298, 25, 1000]

curve_fit(
    predict_thermaltime,
    X_train,
    y_train,
    p0=initial_guess,
    bounds=(lower_bounds, upper_bounds),
)

Out[8]:

(array([  1.00000001,   5.        , 500.        ]),
 array([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]))

This gives terrible fits! Or rather, it doesn't seem to do much fitting at all. If you play around with the initial settings you'll see that most of the output is rubbish. That's probably because our prediction model is not a nice, continuous functional form. Also, curve-fit doesn't really like integer parameters such as the DOY (though our implementation doesn't really care either).

Clearly, we need to look further.

Reformulating the problem¶

Alternatively, following pyphenology, we can try the global optimizers instead. Looking at the documentation of the brute-force optimizer, it seems we need to reformulate our problem a little bit. First, we need to define a loss function, i.e. the function that scipy can minimize.

Intuitively, our loss function could look something like this:

In [9]:

Copied!





def rmse_loss(X, y, t1: int = 1, T: int = 5, F: int = 500):
    y_pred = predict_thermaltime(X, t1=t1, T=T, F=F)
    sq_err = (y_pred - y) ** 2
    return np.mean(sq_err) ** 0.5


# Try it with the default parameters:
rmse_loss(X_train, y_train)
def rmse_loss(X, y, t1: int = 1, T: int = 5, F: int = 500):
    y_pred = predict_thermaltime(X, t1=t1, T=T, F=F)
    sq_err = (y_pred - y) ** 2
    return np.mean(sq_err) ** 0.5


# Try it with the default parameters:
rmse_loss(X_train, y_train)

Out[9]:

5.6190548414478645

However, this is not consistent with the form expected by scipy's global optimizers: they want something of the form rmse_loss(x, *args). So we need to reformulate our loss function in terms of x and args.

Nota bene: where the brute-force example finds the minimum of the function in the x-y space given fixed parameters, we want to do the opposite: we want to find the minimum in the parameter space, given the (fixed for each training batch) X and y. So, x is a tuple of our models params, *args corresponds to X and y, and our loss function becomes something like this:

In [10]:

Copied!





def rmse_loss(params, X, y):
    """RMSE loss between thermaltime predictions and observations.

    params is a tuple with the parameters to thermaltime: (t1, T, F)
    """
    y_pred = predict_thermaltime(X, *params)
    sq_err = (y_pred - y) ** 2
    return np.mean(sq_err) ** 0.5


# Try it with the default parameters:
rmse_loss([1, 5, 500], X_train, y_train)
def rmse_loss(params, X, y):
    """RMSE loss between thermaltime predictions and observations.

    params is a tuple with the parameters to thermaltime: (t1, T, F)
    """
    y_pred = predict_thermaltime(X, *params)
    sq_err = (y_pred - y) ** 2
    return np.mean(sq_err) ** 0.5


# Try it with the default parameters:
rmse_loss([1, 5, 500], X_train, y_train)

Out[10]:

5.6190548414478645

While the above is good enough, we can stil tweak our loss function a bit further while we're at it. Currently, our loss has the predict_thermaltime function hardcoded. It would be much nicer if this was more flexible, so it could work for any (roughly similar) model, for example, all the pyphenology models.

One way to do this is to pass the predictor function as one of the arguments.

In [11]:

Copied!





def rmse_loss(params, f, X, y):
    """RMSE loss between thermaltime predictions and observations.

    f is a prediction model of the form f(X, *params)
    params is a tuple with the parameters to f
    """
    y_pred = f(X, *params)
    sq_err = (y_pred - y) ** 2
    return np.mean(sq_err) ** 0.5


# Try it with the default parameters:
rmse_loss([1, 5, 500], predict_thermaltime, X_train, y_train)
def rmse_loss(params, f, X, y):
    """RMSE loss between thermaltime predictions and observations.

    f is a prediction model of the form f(X, *params)
    params is a tuple with the parameters to f
    """
    y_pred = f(X, *params)
    sq_err = (y_pred - y) ** 2
    return np.mean(sq_err) ** 0.5


# Try it with the default parameters:
rmse_loss([1, 5, 500], predict_thermaltime, X_train, y_train)

Out[11]:

5.6190548414478645

Brute force optimization¶

Now we're ready to run the brute force optimization.

In [12]:

Copied!

from scipy.optimize import brute

ranges = ((-67, 298), (-25, 25), (0, 1000))
args = (predict_thermaltime, X_train, y_train)

brute(rmse_loss, ranges=ranges, args=args)
from scipy.optimize import brute

ranges = ((-67, 298), (-25, 25), (0, 1000))
args = (predict_thermaltime, X_train, y_train)

brute(rmse_loss, ranges=ranges, args=args)

Out[12]:

array([48.26315789, 14.47368421,  0.        ])

That works! If you look carefully at the results, you will see that the optimization algorithm finds a clever solution to the problem: it sets the total forcing required to 0, and the first day of the growing season very close to the typical spring onset. While this is a perfectly valid solution, especially considering the fake training data we used, it illustrates that you need to be really careful in setting the expected bounds of the parameters, and in interpreting the results.

Another thing to note is how scipy creates the search space. The brute function has an additional parameter Ns that determines the number of 'grid points' used for each of the parameters. The default is 20: the ranges we supplied for t1, T, and F are divided in 20 points, such that the algorithm executes our loss function 20 * 20 * 20 times. By increasing the number of points, we get a better estimate, but the number of function calls and therefore the training time increases exponentially.

There are two ways around this:

Set the points yourself
Use a more clever algorithm

The first solution is quite simple. We can also define the ranges as slices with a custom step:

In [13]:

Copied!

ranges = (slice(-67, 298, 10), slice(-25, 25, 2), slice(0, 1000, 50))

brute(rmse_loss, ranges=ranges, args=args)
ranges = (slice(-67, 298, 10), slice(-25, 25, 2), slice(0, 1000, 50))

brute(rmse_loss, ranges=ranges, args=args)

Out[13]:

array([ 43., -13., 105.])

Basin hopping¶

For the second solution, again, we follow the pyphenology developers. They support "differential evolution" and "basin_hopping". Let's try the basin hopping algorithm. This algorithm works in two steps, delegating calls to our loss function to a local minimization function. Thus, we need to pass in our bounds and args as minimzer_kwargs instead. We must also pick a method for the local minizer. For consistency with pyphenology, we choose 'LL-BFGS-B'.

In [14]:

Copied!





from scipy.optimize import basinhopping

initial_guess = (1, 5, 500)
minimizer_args = {
    "method": "L-BFGS-B",
    "args": (predict_thermaltime, X_train, y_train),
    "bounds": ((-67, 298), (-25, 25), (0, 1000)),
}
basinhopping(rmse_loss, x0=initial_guess, minimizer_kwargs=minimizer_args)
from scipy.optimize import basinhopping

initial_guess = (1, 5, 500)
minimizer_args = {
    "method": "L-BFGS-B",
    "args": (predict_thermaltime, X_train, y_train),
    "bounds": ((-67, 298), (-25, 25), (0, 1000)),
}
basinhopping(rmse_loss, x0=initial_guess, minimizer_kwargs=minimizer_args)

Out[14]:

                    message: ['requested number of basinhopping iterations completed successfully']
                    success: True
                        fun: 5.2379373393649695
                          x: [ 3.063e+00  2.137e+00  5.034e+02]
                        nit: 100
      minimization_failures: 0
                       nfev: 472
                       njev: 118
 lowest_optimization_result:  message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
                              success: True
                               status: 0
                                  fun: 5.2379373393649695
                                    x: [ 3.063e+00  2.137e+00  5.034e+02]
                                  nit: 0
                                  jac: [ 0.000e+00  0.000e+00  0.000e+00]
                                 nfev: 4
                                 njev: 1
                             hess_inv: <3x3 LbfgsInvHessProduct with dtype=float64>

This also works, but notice that the result is still very close to the initial guess. If we change the initial guess to something that's more like the clever solution of the brute method, we find another fit much smaller RMSE value:

In [15]:

Copied!





initial_guess = (45, 15, 0)
minimizer_args = {
    "method": "L-BFGS-B",
    "args": (predict_thermaltime, X_train, y_train),
    "bounds": ((-67, 298), (-25, 25), (0, 1000)),
}

bh = basinhopping(rmse_loss, x0=initial_guess, minimizer_kwargs=minimizer_args)
print(f"params: {bh.x}, minimum: {bh.fun}")
initial_guess = (45, 15, 0)
minimizer_args = {
    "method": "L-BFGS-B",
    "args": (predict_thermaltime, X_train, y_train),
    "bounds": ((-67, 298), (-25, 25), (0, 1000)),
}

bh = basinhopping(rmse_loss, x0=initial_guess, minimizer_kwargs=minimizer_args)
print(f"params: {bh.x}, minimum: {bh.fun}")

params: [48.16010104 13.83079594  0.77096103], minimum: 3.2652253589931624

Notice that we can pass additional arguments to the basinhopping algorithm to tweak its performance. For example, the default of 100 iterations is quite small. Using the settings from pyphenology, we get a result that looks much more realistic (and takes much longer to train):

In [16]:

Copied!





from scipy.optimize import basinhopping

initial_guess = (1, 5, 500)
minimizer_args = {
    "method": "L-BFGS-B",
    "args": (predict_thermaltime, X_train, y_train),
    "bounds": ((-67, 298), (-25, 25), (0, 1000)),
}
optimizer_params = {"niter": 50000, "T": 0.5, "stepsize": 0.5, "disp": False}
basinhopping(
    rmse_loss, x0=initial_guess, **optimizer_params, minimizer_kwargs=minimizer_args
)
from scipy.optimize import basinhopping

initial_guess = (1, 5, 500)
minimizer_args = {
    "method": "L-BFGS-B",
    "args": (predict_thermaltime, X_train, y_train),
    "bounds": ((-67, 298), (-25, 25), (0, 1000)),
}
optimizer_params = {"niter": 50000, "T": 0.5, "stepsize": 0.5, "disp": False}
basinhopping(
    rmse_loss, x0=initial_guess, **optimizer_params, minimizer_kwargs=minimizer_args
)

Out[16]:

                    message: ['requested number of basinhopping iterations completed successfully']
                    success: True
                        fun: 2.893246475688441
                          x: [ 4.712e+01  1.243e+01  3.338e+01]
                        nit: 50000
      minimization_failures: 0
                       nfev: 200072
                       njev: 50018
 lowest_optimization_result:  message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
                              success: True
                               status: 0
                                  fun: 2.893246475688441
                                    x: [ 4.712e+01  1.243e+01  3.338e+01]
                                  nit: 0
                                  jac: [ 0.000e+00  0.000e+00  0.000e+00]
                                 nfev: 4
                                 njev: 1
                             hess_inv: <3x3 LbfgsInvHessProduct with dtype=float64>

Great! Now we have a good grip on the different ways to fit our model. A final thing to point out is that the basinhopping algorithm returns floats for each parameter. This doesn't really make sense for the start of the growing season, so in the post-processing we should probably round that to the nearest int.

Bringing everything together¶

If we bring everything from above over to the fit method, we get something like this:

def fit(self, X: ArrayLike, y: ArrayLike):
        """Fit the model to the available observations.

        Parameters:
            X: 2D Array of shape (n_samples, n_features).
                Daily mean temperatures for each unique site/year (n_samples) and
                for each DOY (n_features). The first feature should correspond to
                the first DOY, and so forth up to (max) 366.
            y: 1D Array of length n_samples
                Observed DOY of the spring onset for each unique site/year.

        Returns:
            Fitted model
        """
        X, y = self._validate_data(X, y)
        # TODO: check additional assumptions about input

        initial_guess = (1, 5, 500)
        bounds = ((-67, 298), (-25, 25), (0, 1000))

        minimizer_args = {
            "method": "L-BFGS-B",
            "args": (predict_thermaltime, X, y),
            "bounds": bounds,
        }
        optimizer_params = {"niter": 50000, "T": 0.5, "stepsize": 0.5, "disp": False}

        bh = basinhopping(
            rmse_loss,
            x0=initial_guess,
            minimizer_kwargs=minimizer_args,
            **optimizer_params,
        )

        self.t1_, self.T_, self.F_ = bh.x

        return self

This is not yet ready:

Initial guess and bounds are hardcoded and closely related to the equally hardcoded thermaltime and the loss function. If we want to reuse this for other physics-based models, it would be nice to try and separate these a bit better.
The optimizer_params are hardcoded as well. These are typical (hyper) parameters that can be passed to the __init__ method of our class, since these settings are not data-dependent.
Currently the only optimization routine supported is the basinhopping algorithm.

Putting it to the test¶

Below we pasted a slighly more polished version, as a complete code. We generalized the loss function a bit better, and the core_model-related parameters are grouped together at the top. The fit method is mostly concerned with formulating the call to the basinhopping algorithm. We used the staticmethod decorator to make the thermaltime function a method on the class. Notice that, in contrast with standard class methods, it doesn't need access to self.

In [17]:

Copied!





import numpy as np
from sklearn.base import (
    BaseEstimator,
    RegressorMixin,
    check_is_fitted,
)
from numpy.typing import ArrayLike
from scipy.optimize import basinhopping


def rmse_loss(params, f, X, y):
    """RMSE loss between thermaltime predictions and observations.

    f is a prediction model of the form f(X, *params)
    params is a tuple with the parameters to f
    """
    y_pred = f(X, *params)
    sq_err = (y_pred - y) ** 2
    return np.mean(sq_err) ** 0.5


LOSS_FUNCTIONS = {
    "RMSE": rmse_loss,
}


class ThermalTime(RegressorMixin, BaseEstimator):
    """Thermal Time Model

    The classic growing degree day model using a fixed temperature threshold
    above which forcing accumulates.

    This implementation uses scipy's basinhopper optimization algorithm for
    fitting the model.
    """

    _core_params_names = ("t1", "T", "F")
    _core_params_defaults = (1, 5, 500)
    _core_params_bounds = ((-67, 298), (-25, 25), (0, 1000))

    def __init__(
        self,
        niter=50000,
        T=0.5,
        stepsize=0.5,
        disp=False,
        minimizer_method="L-BFGS-B",
        loss_function="RMSE",
    ):
        self.niter = niter
        self.T = T
        self.stepsize = stepsize
        self.disp = disp
        self.minimizer_method = minimizer_method
        self.loss_function = loss_function

    @staticmethod
    def _core_model(X, t1: int = 1, T: int = 5, F: int = 500):
        """Make prediction with the thermaltime model.

        Args:
            X: array-like, shape (n_samples, n_features).
                Daily mean temperatures for each unique site/year (n_samples) and for
                each DOY (n_features). The first feature should correspond to
                the first DOY, and so forth up to (max) 366.
            t1: The DOY at which forcing accumulating beings (should be within [-67,298])
            T: The threshold above which forcing accumulates (should be within [-25,25])
            F: The total forcing units required (should be within [0,1000])
        """
        # This allows us to pass both 1D and 2D arrays of temperature
        # Copying X to safely modify it later on (may not be necessary, but readable)
        X_2d = np.atleast_2d(np.copy(X))

        # DOY starts at 1, where python array index start at 0
        # TODO: Make this an optional argument?
        doy = np.arange(X_2d.shape[1]) + 1

        # Exclude days before the start of the growing season
        X_2d[:, doy < t1] = 0

        # Exclude days with temperature below threshold
        X_2d[X_2d < T] = 0

        # Accumulate remaining data
        S = np.cumsum(X_2d, axis=-1)

        # Find first entry that exceeds the total forcing units required.
        doy = np.argmax(S > F, axis=-1)

        return doy

    # TODO: consider adding DOY index series to fit/predict as optional argument
    def fit(self, X: ArrayLike, y: ArrayLike):
        """Fit the model to the available observations.

        Parameters:
            X: 2D Array of shape (n_samples, n_features).
                Daily mean temperatures for each unique site/year (n_samples) and
                for each DOY (n_features). The first feature should correspond to
                the first DOY, and so forth up to (max) 366.
            y: 1D Array of length n_samples
                Observed DOY of the spring onset for each unique site/year.

        Returns:
            Fitted model
        """
        X, y = self._validate_data(X, y)
        # TODO: check additional assumptions about input

        loss_function = LOSS_FUNCTIONS[self.loss_function]

        # Perform the fit
        bh = basinhopping(
            loss_function,
            x0=self._core_params_defaults,
            niter=self.niter,
            T=self.T,
            stepsize=self.stepsize,
            disp=self.disp,
            minimizer_kwargs={
                "method": self.minimizer_method,
                "args": (self._core_model, X, y),
                "bounds": self._core_params_bounds,
            },
        )

        # Store the fitted parameters
        self.core_params_ = bh.x

        return self

    def predict(self, X: ArrayLike):
        """Predict values of y given new predictors

        Parameters:
            X: array-like, shape (n_samples, n_features).
               Daily mean temperatures for each unique site/year (n_samples) and
               for each DOY (n_features). The first feature should correspond to
               the first DOY, and so forth up to (max) 366.

        Returns:
            y: array-like, shape (n_samples,)
               Predicted DOY of the spring onset for each sample in X.
        """
        X = self._validate_data(X)
        check_is_fitted(self, "core_params_")

        return self._core_model(X, *self.core_params_)
import numpy as np
from sklearn.base import (
    BaseEstimator,
    RegressorMixin,
    check_is_fitted,
)
from numpy.typing import ArrayLike
from scipy.optimize import basinhopping


def rmse_loss(params, f, X, y):
    """RMSE loss between thermaltime predictions and observations.

    f is a prediction model of the form f(X, *params)
    params is a tuple with the parameters to f
    """
    y_pred = f(X, *params)
    sq_err = (y_pred - y) ** 2
    return np.mean(sq_err) ** 0.5


LOSS_FUNCTIONS = {
    "RMSE": rmse_loss,
}


class ThermalTime(RegressorMixin, BaseEstimator):
    """Thermal Time Model

    The classic growing degree day model using a fixed temperature threshold
    above which forcing accumulates.

    This implementation uses scipy's basinhopper optimization algorithm for
    fitting the model.
    """

    _core_params_names = ("t1", "T", "F")
    _core_params_defaults = (1, 5, 500)
    _core_params_bounds = ((-67, 298), (-25, 25), (0, 1000))

    def __init__(
        self,
        niter=50000,
        T=0.5,
        stepsize=0.5,
        disp=False,
        minimizer_method="L-BFGS-B",
        loss_function="RMSE",
    ):
        self.niter = niter
        self.T = T
        self.stepsize = stepsize
        self.disp = disp
        self.minimizer_method = minimizer_method
        self.loss_function = loss_function

    @staticmethod
    def _core_model(X, t1: int = 1, T: int = 5, F: int = 500):
        """Make prediction with the thermaltime model.

        Args:
            X: array-like, shape (n_samples, n_features).
                Daily mean temperatures for each unique site/year (n_samples) and for
                each DOY (n_features). The first feature should correspond to
                the first DOY, and so forth up to (max) 366.
            t1: The DOY at which forcing accumulating beings (should be within [-67,298])
            T: The threshold above which forcing accumulates (should be within [-25,25])
            F: The total forcing units required (should be within [0,1000])
        """
        # This allows us to pass both 1D and 2D arrays of temperature
        # Copying X to safely modify it later on (may not be necessary, but readable)
        X_2d = np.atleast_2d(np.copy(X))

        # DOY starts at 1, where python array index start at 0
        # TODO: Make this an optional argument?
        doy = np.arange(X_2d.shape[1]) + 1

        # Exclude days before the start of the growing season
        X_2d[:, doy < t1] = 0

        # Exclude days with temperature below threshold
        X_2d[X_2d < T] = 0

        # Accumulate remaining data
        S = np.cumsum(X_2d, axis=-1)

        # Find first entry that exceeds the total forcing units required.
        doy = np.argmax(S > F, axis=-1)

        return doy

    # TODO: consider adding DOY index series to fit/predict as optional argument
    def fit(self, X: ArrayLike, y: ArrayLike):
        """Fit the model to the available observations.

        Parameters:
            X: 2D Array of shape (n_samples, n_features).
                Daily mean temperatures for each unique site/year (n_samples) and
                for each DOY (n_features). The first feature should correspond to
                the first DOY, and so forth up to (max) 366.
            y: 1D Array of length n_samples
                Observed DOY of the spring onset for each unique site/year.

        Returns:
            Fitted model
        """
        X, y = self._validate_data(X, y)
        # TODO: check additional assumptions about input

        loss_function = LOSS_FUNCTIONS[self.loss_function]

        # Perform the fit
        bh = basinhopping(
            loss_function,
            x0=self._core_params_defaults,
            niter=self.niter,
            T=self.T,
            stepsize=self.stepsize,
            disp=self.disp,
            minimizer_kwargs={
                "method": self.minimizer_method,
                "args": (self._core_model, X, y),
                "bounds": self._core_params_bounds,
            },
        )

        # Store the fitted parameters
        self.core_params_ = bh.x

        return self

    def predict(self, X: ArrayLike):
        """Predict values of y given new predictors

        Parameters:
            X: array-like, shape (n_samples, n_features).
               Daily mean temperatures for each unique site/year (n_samples) and
               for each DOY (n_features). The first feature should correspond to
               the first DOY, and so forth up to (max) 366.

        Returns:
            y: array-like, shape (n_samples,)
               Predicted DOY of the spring onset for each sample in X.
        """
        X = self._validate_data(X)
        check_is_fitted(self, "core_params_")

        return self._core_model(X, *self.core_params_)

While still not perfect, this implementation is good enough to see if we can use it in our pycaret framework. For that, we first need to combine our data in a pandas dataframe.

In [18]:

Copied!

import pandas as pd

df = pd.DataFrame(X_train)
df["y"] = y_train
df.head()
import pandas as pd

df = pd.DataFrame(X_train)
df["y"] = y_train
df.head()

Out[18]:

	0	1	2	3	4	5	6	7	8	9	...	356	357	358	359	360	361	362	363	364	y
0	15.218994	6.738299	13.954624	29.450302	-3.373456	17.349404	11.489199	-0.946024	9.447005	14.721903	...	17.564457	8.209593	10.716687	2.374732	22.605012	2.639938	25.742186	9.589601	-4.323429	48.583246
1	22.773274	19.317239	-0.591967	13.633791	4.640599	2.588708	11.031384	10.298080	-1.175578	-20.222365	...	-1.453457	7.593247	4.730469	11.483520	14.818011	25.799254	15.006352	3.442988	-3.945578	50.461037
2	16.166114	6.776026	3.693059	8.057823	5.018413	23.824257	0.922434	-1.134289	17.514239	10.534730	...	-1.523276	15.888692	-3.116856	3.443013	4.458324	12.345812	8.165304	21.404692	5.190300	52.001860
3	-4.454030	12.983264	18.690565	14.266377	13.291256	9.783686	18.153094	15.724669	-13.733181	7.353611	...	11.454641	11.915009	10.903428	8.284341	0.811677	9.052229	7.078002	0.597021	10.312079	57.328526
4	11.336732	5.656002	9.542669	15.932466	22.050319	5.539453	10.183360	4.398088	4.344155	10.000675	...	6.672839	29.814395	8.529970	1.693037	1.629311	0.112947	12.897052	24.386493	3.704314	44.556960

5 rows × 366 columns

In [19]:

Copied!

from pycaret.regression import RegressionExperiment

exp = RegressionExperiment()
exp.setup(data=df, target="y")
exp.compare_models(["lr", "rf", "dummy", ThermalTime()], cross_validation=False)
from pycaret.regression import RegressionExperiment

exp = RegressionExperiment()
exp.setup(data=df, target="y")
exp.compare_models(["lr", "rf", "dummy", ThermalTime()], cross_validation=False)

	Description	Value
0	Session id	8334
1	Target	y
2	Target type	Regression
3	Original data shape	(10, 366)
4	Transformed data shape	(10, 366)
5	Transformed train set shape	(7, 366)
6	Transformed test set shape	(3, 366)
7	Numeric features	365
8	Preprocess	True
9	Imputation type	simple
10	Numeric imputation	mean
11	Categorical imputation	mode
12	Fold Generator	KFold
13	Fold Number	10
14	CPU Jobs	-1
15	Use GPU	False
16	Log Experiment	False
17	Experiment Name	reg-default-name
18	USI	c9d5

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
3	ThermalTime	3.7702	20.0271	4.4752	0.2647	0.0856	0.0736	64.4000
1	Random Forest Regressor	4.1401	25.6182	5.0614	0.0594	0.0977	0.0818	0.6500
0	Linear Regression	4.2148	25.9996	5.0990	0.0454	0.0977	0.0823	0.0600
2	Dummy Regressor	4.4073	27.8321	5.2756	-0.0218	0.1014	0.0863	0.0500

Out[19]:

ThermalTime()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Unfortunately, but not surprisingly, our new model is not very skillful. If you execute this notebook several times, scores will vary from pretty okay to very bad. But this is to be expected of a training set of 10 samples based on some vaguely invented data...

Towards a generic implementation¶

We hope this tutorial has provided a good basis for developing this further. There's still a number of things that could be improved further. For example, we mentioned the option to pass in additional arguments to the fit method based, e.g. to use a deviating DOY index. If you look at the code in springtime.models, you will see that we have generalized the code a bit further. To add a new model, as long as it's similar enough to the thermaltime model, you only need to implement the analogue of the thermaltime_predict method.

Compared to pyphenology, an important difference is that we based our main class off the correpsonding scipy estimator, and you can pass in different phenology models. By contrast, in pyphenology, there is one main class for each phenology model, while you can specify the optimization algorithm. There is something to say for both choices; we feel the present implementation aligns a bit better with the way in which sklearn defines hyperparameters of a model. Thus, you can now use this model in springtime by importing it like so:

In [20]:

Copied!

from springtime.models.basinhopper import BasinHopper

model = BasinHopper(core_model="thermaltime", loss_function="RMSE")
model.fit(X_train, y_train)
# model.predict(X_new)
from springtime.models.basinhopper import BasinHopper

model = BasinHopper(core_model="thermaltime", loss_function="RMSE")
model.fit(X_train, y_train)
# model.predict(X_new)

Out[20]:

BasinHopper()