In this tutorial we'll use Stan, via CmdStanPy, to perform Bayesian Linear Regression. Visit the CmdStanPy Github page for information on how to install.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from cmdstanpy import CmdStanModel
from sklearn.datasets import load_boston
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

The dataset we are using comes from Lesson 1 of Andrew Ng's Coursera course on Machine Learning. I've chosen this since it only has one feature so we can easily visualize what is happening. You can find the contents here.

df = pd.read_csv("ex1data1.txt", header=None, names=["population", "profit"])
df.head()
population profit
0 6.1101 17.5920
1 5.5277 9.1302
2 8.5186 13.6620
3 7.0032 11.8540
4 5.8598 6.8233
df.plot('population', 'profit', lw=0, marker='o', legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f764059d970>

Let's do the typical train/test split of the data, with population being our feature, and profit our target.

X_train, X_test, y_train, y_test = train_test_split(df["population"], df["profit"], random_state=0)

Data dictionary and model file

Now we need to create a data dictionary for Stan. The values in the data dictionary correspond with the values in the model file that I have written located in model.stan. Here are the contents of that file, which will be loaded and compiled later in this notebook. I'm generally following this example in the Stan User Guide.

data {
    int<lower=0> N;
    int<lower=0> K;
    matrix[N, K] x;
    vector[N] y;
    int<lower=0> N_new;
    matrix[N_new, K] x_new;
}
parameters {
    vector[K] beta;
    real<lower=0> sigma;
}
model {
    y ~ normal(x*beta, sigma);
}
generated quantities {
    vector[N] y_train_pred;
    vector[N_new] y_new;
    for (n in 1:N) {
        y_train_pred[n] = normal_rng(x[n]*beta, sigma);
    }
    for (n in 1:N_new) {
        y_new[n] = normal_rng(x_new[n]*beta, sigma);
    }
}
N_train = X_train.values.shape[0]
N_test = X_test.values.shape[0]
X_train = np.hstack((np.ones((N_train,1)), X_train.values.reshape(-1,1)))
X_test = np.hstack((np.ones((N_test,1)), X_test.values.reshape(-1,1)))
y_train = y_train.values
K = 2
data_dict = {"N": N_train, "K": K, "x": X_train, "y": y_train, "N_new": N_test, "x_new": X_test}

Here are the meanings of each variable or parameter:

  • N = Number of data points in training set.
  • K = Number of columns in our dataset. We have two columns, since we are going to create an intercept column.
  • x = N x K matrix. The actual data.
  • y = N sized vector. The target variable.
  • N_new = Number of data points in our test set.
  • x_new = N_new x K matrix. The test data.
  • beta = Parameters to be trained.
  • sigma = Parameter to be trained, standard deviation in the normal distribution

What about the blocks in the Stan file? You will want to check out the reference here on this. But, in short:

Variables in the data block are where variables that are read in as data are declared.

Variables in the parameters block are where variables that are sampled by Stan's MCMC algorithm are declared.

The model block defines our model. Here we are using sampling notation to indicate our target variable, y, has a distribution corresponding with the right-hand side. In this case, it is normally distributed with a mean of X$\beta$ and a standard deviation of $\sigma$. We use this distribution because the major assumption is that the errors are normally distributed around a mean of zero and a standard deviation of $\sigma$. This is a direct consequence of that assumption, which again is described in more detail here.

The generated quantities block does not adjust the learned parameters. We use it here to make predictions on our training set (in order to get the R$^{2}$ metric on it) as well as predicting on our test set for evaluation.

Next we load and compile the Stan file as described above.

model = CmdStanModel(stan_file="model.stan")
INFO:cmdstanpy:found newer exe file, not recompiling
INFO:cmdstanpy:compiled model file: /home/wes/Documents/data-science/bayesian-regression/model

Sampling

Now that we have compiled the Stan code, we can do sampling. The sample() method performs MCMC sampling in the following process for each chain:

  1. Draw a value from the auxillary moment ($\rho$) distribution.
  2. Use the leapfrog iterator to update $\rho$ and parameters $\theta$.
  3. Measure the Hamiltonian of the new state.
  4. Use Metroplis-Hastings to compare the new Hamiltonian and the previous Hamiltonian. Accept or reject.
  5. Save parameters $\theta$. If accepted, the new state becomes the previous state, and the process is repeated.

For more details, see this section of the Stan reference manual.

fit = model.sample(data=data_dict)
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 3
INFO:cmdstanpy:finish chain 4

Results

The following gives a summary of each parameter and of each sample in the generated quantities.

df_summary = fit.summary()
df_summary.head(10)
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
name
lp__ -114.13800 0.033104 1.233760 -116.576000 -113.80600 -112.83200 1388.96 1834.80 1.000320
beta[1] -4.06342 0.020626 0.827210 -5.422190 -4.06876 -2.64867 1608.43 2124.71 1.000920
beta[2] 1.20494 0.002254 0.089512 1.054030 1.20780 1.34983 1577.67 2084.08 1.001430
sigma 3.04292 0.005602 0.258644 2.656110 3.02426 3.49347 2131.48 2815.65 1.001030
y_train_pred[1] 9.97330 0.048726 3.076020 4.879570 9.95605 15.03880 3985.34 5264.57 0.999336
y_train_pred[2] 4.91241 0.049360 3.056540 -0.181923 4.98684 9.75746 3834.55 5065.38 0.999112
y_train_pred[3] 2.54711 0.047909 3.034660 -2.349200 2.58090 7.57581 4012.16 5300.00 0.999527
y_train_pred[4] 2.80746 0.049370 3.090090 -2.509290 2.82996 7.85182 3917.56 5175.03 0.999559
y_train_pred[5] 3.86753 0.048720 3.072960 -1.087240 3.76285 8.91217 3978.38 5255.37 1.001370
y_train_pred[6] 6.61481 0.048569 3.100650 1.529320 6.58784 11.64820 4075.50 5383.66 1.000040

Here are all the samples from all the chains:

df_samples = fit.get_drawset()
df_samples.head()
lp__ accept_stat__ stepsize__ treedepth__ n_leapfrog__ divergent__ energy__ beta.1 beta.2 sigma ... y_new.16 y_new.17 y_new.18 y_new.19 y_new.20 y_new.21 y_new.22 y_new.23 y_new.24 y_new.25
0 -114.345 0.940039 0.268831 3.0 7.0 0.0 115.687 -4.87938 1.23486 2.80795 ... 4.038810 0.820634 1.564890 0.89525 3.42023 0.159389 22.1371 0.955606 9.501570 6.104850
1 -114.143 0.993854 0.268831 2.0 3.0 0.0 114.673 -4.33896 1.29027 2.83174 ... -0.155484 0.652295 1.579160 3.95726 4.14529 -0.440211 21.8035 10.706000 11.092300 1.476410
2 -114.119 0.993898 0.268831 4.0 15.0 0.0 116.779 -4.20318 1.18980 2.63140 ... 2.484450 1.607140 4.684500 5.79743 4.11147 4.977450 21.1929 0.983426 12.367600 7.188790
3 -113.834 0.900557 0.268831 3.0 11.0 0.0 115.391 -3.66850 1.11506 3.13985 ... 2.165080 -0.119275 0.539258 2.41229 4.41436 1.592670 21.3086 8.015130 0.810029 0.892391
4 -114.232 0.941207 0.268831 3.0 15.0 0.0 115.936 -4.54294 1.31403 2.90515 ... -1.166400 4.986770 1.633330 5.03816 5.23080 3.522200 20.4852 -1.422330 12.302300 8.824610

5 rows × 107 columns

As expected, each parameter's conditional distribution is normal. The dashed lines in each plot separate out each chain that was sampled.

def plot_dist(df, param):
    _, ax = plt.subplots(1, 2, figsize=(12,4))
    df[param].plot.density(ax=ax[0])
    df[param].plot(ax=ax[1])
    for i in [1000, 2000, 3000]:
        ax[1].axvline(i, color="black", ls="--")
    plt.tight_layout()
plot_dist(df_samples, "beta.1")
plot_dist(df_samples, "beta.2")
plot_dist(df_samples, "sigma")

Model evalulation

The model is evaluated using the means drawn from the posterior predictive distribution and comparing those with the actual values.

y_test_pred = df_summary.loc["y_new[1]":,"Mean"].values
y_train_pred = df_summary.loc["y_train_pred[1]":"y_train_pred[72]","Mean"].values

Here is the training set R$^{2}$:

r2_score(y_train, y_train_pred)
0.7328456670720971

Here is the test set R$^{2}$:

r2_score(y_test, y_test_pred)
0.5840665444879212

Inference: generated quantities in same file

We've already made predictions both on the training and test sets using generated quantities block in our Stan code.

y_train_pred = df_samples.loc[:,"y_train_pred.1":"y_train_pred.72"]
y_test_pred = df_samples.loc[:, "y_new.1":]

Note that we actually have 4,000 values for each example in our dataset. In the Bayesian point of view, one samples from a normal distribution with a mean of X$\beta$ and standard deviation $\sigma$ (where $\beta$ and $\sigma$ were previously learned in our MCMC sampling) repeatedly and then takes the mean of that as the prediction. This is the posterior predictive mean.

y_test_pred
y_new.1 y_new.2 y_new.3 y_new.4 y_new.5 y_new.6 y_new.7 y_new.8 y_new.9 y_new.10 ... y_new.16 y_new.17 y_new.18 y_new.19 y_new.20 y_new.21 y_new.22 y_new.23 y_new.24 y_new.25
0 6.08919 0.831190 2.771020 1.19964 2.946870 0.176526 2.269890 12.43920 -0.387497 6.639150 ... 4.038810 0.820634 1.564890 0.895250 3.42023 0.159389 22.1371 0.955606 9.501570 6.104850
1 12.44440 5.118900 2.246190 2.82562 -0.079497 -0.518796 1.880710 7.30614 4.465200 5.113740 ... -0.155484 0.652295 1.579160 3.957260 4.14529 -0.440211 21.8035 10.706000 11.092300 1.476410
2 11.89770 3.838600 5.517290 1.51142 -0.398633 0.979285 7.150900 4.49293 2.036590 1.760600 ... 2.484450 1.607140 4.684500 5.797430 4.11147 4.977450 21.1929 0.983426 12.367600 7.188790
3 9.53034 -1.520570 7.847520 1.49295 0.917469 4.955140 2.428350 3.39252 5.642480 1.699600 ... 2.165080 -0.119275 0.539258 2.412290 4.41436 1.592670 21.3086 8.015130 0.810029 0.892391
4 13.55290 6.947870 -0.947383 3.79690 3.489320 0.367080 0.305908 12.57860 4.602460 3.596310 ... -1.166400 4.986770 1.633330 5.038160 5.23080 3.522200 20.4852 -1.422330 12.302300 8.824610
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3995 12.68620 1.036800 4.300930 4.01384 0.033404 4.478290 5.797940 5.23659 4.104620 5.144340 ... 1.530610 -2.970480 3.818540 -0.302308 9.37741 2.563720 15.6282 1.196290 3.683390 6.422640
3996 10.12300 5.759870 8.194520 -1.85087 1.570890 -3.516320 1.199860 12.75210 5.988360 8.680570 ... -0.882405 4.475370 -0.016239 5.493630 11.20450 2.397010 22.5335 5.389330 5.559980 5.270210
3997 13.25500 2.881080 5.618130 3.52824 2.178100 1.196510 -0.754878 13.79700 4.082110 0.358575 ... -0.214590 4.076230 4.240610 1.991630 3.35809 -0.671965 29.6522 2.818600 13.254700 -0.115525
3998 12.08630 -0.392227 -0.154641 4.51602 2.124500 0.875612 3.252300 9.75291 6.481030 5.653030 ... 3.016780 -1.002880 2.177480 6.925280 4.95899 1.920580 21.8189 4.591150 6.822760 0.935449
3999 7.96612 4.395740 -1.077040 0.24359 1.732280 10.293700 4.920210 9.94235 5.173920 10.439600 ... 8.242310 2.557110 1.761830 4.149350 3.29892 -4.806080 20.8138 6.282450 13.570600 8.343440

4000 rows × 25 columns

def plot_pred(X_train, X_test, y_train_pred, y_test_pred, y_test):
    
    test_mean = np.mean(y_test_pred, axis=0)
    test_lower = np.percentile(y_test_pred, 2.5, axis=0)
    test_upper = np.percentile(y_test_pred, 97.5, axis=0)
    plt.plot(X_test[:,1:].ravel(), np.mean(y_test_pred, axis=0), lw=0, marker='o', color='C3', label="Test post. pred. mean")
    plt.plot(X_test[:,1:].ravel(), y_test, lw=0, marker='o', color='C4', label="Test actual")

    sort_mask = np.argsort(X_train[:,1:].ravel())
    y_upper = np.percentile(y_train_pred, 97.5, axis=0)[sort_mask]
    y_lower = np.percentile(y_train_pred, 2.5, axis=0)[sort_mask]
    plt.fill_between(X_train[:,1:].ravel()[sort_mask], y_lower, y_upper, color="C1", alpha=0.1, label="Train post. pred 95% confid.")

    plt.plot(X_train[:,1:].ravel()[sort_mask], np.mean(y_train_pred, axis=0)[sort_mask], label="Train post. pred. mean", color="C1")
    plt.legend()
    plt.ylabel("Profit in $10,000s")
    plt.xlabel("Population of city in 10,000s")
    plt.show()

Here's the plot with the posterior predictive distribution's means for each test point. The shaded error indicates the 95% confidence interval. One could calculate this for each point one predicts for the test set. In this case we just did it for the series of points in the training set and filled in the entire range to illustrate it.

plot_pred(X_train, X_test, y_train_pred, y_test_pred, y_test)

Inference: generated quantities in separate file

This method we take the learned parameters from MCMC sampling and use them as data inputs for another Stan file where we can generate our new quantities. Here are the contents of our new file. This time beta and sigma are in the data block since they are not going to be learned parameters. Additionally we have no model. The purpose of the file is simply to generate predictions on new data.

data {
    int<lower=0> N;
    int<lower=0> K;
    matrix[N, K] x;
    vector[K] beta;
    real<lower=0> sigma;
}
parameters {
}
model {
}
generated quantities {
    vector[N] y;
    for (n in 1:N) {
        y[n] = normal_rng(x[n]*beta, sigma);
    }
}

Here are our learned parameters from our previous MCMC sampling;

beta, sigma = df_summary.loc["beta[1]":"beta[2]","Mean"].values, df_summary.loc["sigma", "Mean"]

Here is our data dictionary. We are making predictions on our test set.

data_dict = {"N": N_test, "K": K, "x": X_test, "beta": beta, "sigma": sigma}
predict = CmdStanModel(stan_file="predict.stan")
INFO:cmdstanpy:found newer exe file, not recompiling
INFO:cmdstanpy:compiled model file: /home/wes/Documents/data-science/bayesian-regression/predict

Now for the sampling. We're not doing any MCMC sampling here though; we're just generating new quantities. Note that fixed_param is defined as true.

predict_fit = predict.sample(data_dict, fixed_param=True)
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:finish chain 1
predict_fit.summary().head()
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
name
lp__ 0.00000 NaN 0.00000 0.000000 0.00000 0.00000 NaN NaN NaN
y[1] 11.21580 0.087441 2.96484 6.347790 11.08770 16.03550 1149.680 41859.8 0.999105
y[2] 3.72469 0.112697 3.13389 -1.419590 3.73300 8.73753 773.297 28155.7 0.999482
y[3] 6.21616 0.095375 3.00444 1.420470 6.28273 11.15670 992.327 36130.6 0.999321
y[4] 3.77119 0.094859 3.00454 -0.931101 3.70973 8.62259 1003.220 36527.3 1.001740
predict_fit.get_drawset().head()
lp__ accept_stat__ y.1 y.2 y.3 y.4 y.5 y.6 y.7 y.8 ... y.16 y.17 y.18 y.19 y.20 y.21 y.22 y.23 y.24 y.25
0 0.0 0.0 9.93461 0.795845 9.37373 4.86905 -1.936690 2.730590 6.71166 4.63234 ... 2.47424 -2.106800 4.328990 -4.09508 3.22460 0.928246 19.1007 2.54161 -0.72615 4.27538
1 0.0 0.0 9.78002 3.650580 2.46062 5.53808 0.517296 11.795200 3.18090 4.58967 ... 7.63182 -3.622690 6.968730 2.36065 3.72439 1.174960 19.6114 3.16476 6.78070 6.59602
2 0.0 0.0 16.65800 8.105250 10.15530 3.86963 3.501780 3.688170 4.22701 8.99956 ... -1.82703 3.957610 1.580840 1.65086 1.26456 5.537600 24.4481 1.57349 10.43740 3.81600
3 0.0 0.0 11.98580 1.890660 0.21170 4.94291 7.858740 0.943565 4.44094 13.05240 ... -3.27595 -0.240402 0.048488 -2.81462 6.59352 -1.051190 21.3794 6.60292 2.83685 4.78816
4 0.0 0.0 8.98131 8.780330 8.03854 -3.58606 5.412100 1.770760 2.06025 9.23470 ... 1.95720 5.884040 8.074480 4.45466 -1.19269 1.124830 23.8451 13.15670 2.52032 9.40728

5 rows × 27 columns

df_fit_samples = pd.DataFrame(predict_fit.sample[:,0,:], columns=predict_fit.column_names)
y_test_pred = df_fit_samples.loc[:, "y.1":]
plot_pred(X_train, X_test, y_train_pred, y_test_pred, y_test)

Inference: using numpy

Lastly, we don't actually need to use Stan to generate the posterior preditive distribution's mean. We can do that using numpy by writing the function ourself, which we do here.

y_test_pred = np.zeros((N_test, 1000))
for i in range(N_test):
    y_test_pred[i,:] = np.random.normal(np.matmul(X_test[i], beta), sigma, size=1000)

Note that we need to transpose our predictions to use our plot function from earlier.

plot_pred(X_train, X_test, y_train_pred, y_test_pred.transpose(), y_test)