A full scope description of GPs is well beyond a short presentation. For a comprehensive introduction to the theory, I recommend Rasmussen & Williams (2006), which you can dowload freely. Instead I will show a qualitative motivation for using GPs. This is when you have some correlated noise in the data and you would like to model this, in order not to introduce any biases in your parameter estimation or underestimate the error bars. Other approaches to this problem include Carter & Winn (2009)'s Wavelet Transform method. Both of these methods have their advantages. For implementation of the latter please see Szilard, who uses them to great effect.
I will simulate some data that includes Poisson noise, as well as some residual systematics and correlate with data points. The effect of this correlated noise is, by definition, impossible to fully grasp, but ignoring it will introduce significant biases into your parameter inferences.
As a base model, I choose a single Gaussian feature with 3 parameters: amplitude $\alpha$, centre $\ell$ and the variance $\sigma^2$. You can think of this as an abosption line or an exoplanet transit :-)
#import the necessary python modules
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from matplotlib import *
plt.rc('text', usetex=True)
matplotlib.rc('font', family='serif')
import george
from george import kernels
#define the base model, a single gaussian
def model(params, t):
amp, loc, sig2 = params
return amp * np.exp(-0.5 * (t - loc) ** 2 / sig2)
#define a function that will generate our synthetic data
def generate_data(params, N, rng=(-5, 5)):
gp = george.GP(0.1 * kernels.ExpSquaredKernel(3.3)) #create red noise from george.GP class (later explained)
t = rng[0] + np.diff(rng) * np.sort(np.random.rand(N)) #create unevenly sampled x values
y = gp.sample(t) #create correlated noise values by sampling the gp object
y += model(params, t) #add values from the model
yerr = 0.05 + 0.05 * np.random.rand(N) #create Poisson (white) noise with some arbitrary chars.
y += yerr * np.random.randn(N) #add white noise to the data with correlated noise
return t, y, yerr
#now lets generate the data
np.random.seed(1234) #seed the generator in order to generate the same data every time the code runs
truth = [-1.0, 0.1, 0.4] #model parameters used to simulate data (=> true)
t, y, yerr = generate_data(truth, 50) #generate the data
rcParams['figure.figsize'] = 10, 7
plt.figure(1)
plt.errorbar(t, y, yerr=yerr, fmt=".k", capsize=0)
plt.ylabel(r"$y$",fontsize=20)
plt.xlabel(r"$t$",fontsize=20)
plt.xlim(-5, 5)
plt.title("synthetic data")
#plt.savefig('synthetic_data.pdf', dpi=150) #uncomment if you would like to save the plot
plt.show()
You will notice that I have drawn the red noise from a Guassian Process. Therefore in what will follow, I will use a different kernel to mitigate this fact. Please bare in mind that the model parameters used and therefore our target values are:
$$ \alpha = -1,~~~~~~ \ell = 0.1,~~~~~~ \sigma^2=0.4 $$Before starting the analysis I would like to refresh my memory of the Byesian framework in which we will be working. That is:
$$ {\tt ~P ( ~pars ~| ~data(y,t) , ~noise) \propto ~P(pars) \times ~P( ~y| ~t, ~noise, ~pars)} ~~~~~~~~~~~(1)$$Namely, our prediction of what we are looking for, is dictated by our prior knowledge of the parameters and the likelihood of the outcome (posterior probability = prior probability $\times$ the likelihood function / maginal likelihood.)
This is also written as log posterior = log prior + log likelihood - log(marg. liklihood)
Now lets consider the standard case, that we are assuming only white or Poisson noise in our data. Therefore, the log-likelihood (standard practice to avoid negative prob and changing multipication to addition) function of the data, given the parameters $\theta$ is:
$$ \ln P\left({y_n}|{t_n},{\sigma_n^2},\theta\right) = -\frac{1}{2} \sum_{n=1}^N \frac{[y_n-f_{\theta}(t_n)]^2}{\sigma_n^2} + \text{const} ~~~~~~~~~~~(2)$$Looking at the data, it seems that there is a systematic trend, and subsequently we include this in our descriptive model to be fitted to the data. This is quite common in transit light curves due to airmass variations. Therefore, our final analytical model is:
$$f_{\theta}(t) = mt + b + \alpha \exp\left(-\frac{[t-\ell]^2}{2\sigma^2}\right) ~~~~~~~~~~~(3)$$where $\theta$ is the 5-dimensional parameter vector $\{ m,b,\alpha,\ell,\sigma^2\}$.
#Defining the likelihood function which also includes a linear trend
def lnlike_ind(p, t, y, invar):
m = model(p[2:], t) + p[0] * t + p[1]
return -0.5 * np.sum((y - m) ** 2 * invar) #invar = 1/sigma**2
Now to fit this model using MCMC, we also have to define our priors. Depending your situations, you might want to define these based on some previous knowledge of your parameters. Here, I will just assume simple box priors.
#Priors of the base model
def lnprior_base(p):
amp, loc, sig2 = p
if not -10 < amp < 10:
return -np.inf
if not -5 < loc < 5:
return -np.inf
if not 0 < sig2 < 3.0:
return -np.inf
return 0.0
#add the priors of the linear trend to the prior function of the base model
def lnprior_ind(p):
m, b = p[:2]
if not -10 < m < 10:
return -np.inf
if not -10 < b < 10:
return -np.inf
return lnprior_base(p[2:])
We also need to define the Posterior function:
#Defining the posterior probability distribution
def lnprob_ind(p, t, y, invar):
lp = lnprior_ind(p)
if not np.isfinite(lp):
return -np.inf
return lp + lnlike_ind(p, t, y, invar)#note that I do not include the log(marg. likelihood)
Now that the model is implemented, we run the MCMC. Namely we initialize the walkers and run both a burn-in and a production chain. In reality, you should run multiple, independent chains and check their consistency and mutual convergence.
import emcee
initial = np.array([0, 0, -1.0, 0.1, 0.4]) #initial guess of the inferred parameters.
ndim = len(initial)
nwalkers = 32 #number of walkers.
data = (t, y,1/yerr**2) #need to put data in a tuple.
p0 = [np.array(initial) + 1e-8 * np.random.randn(ndim)
for i in xrange(nwalkers)] #initialize walkers.
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob_ind, args=data)
#Settling-in phase
print("Running burn-in...")
p0, _, _ = sampler.run_mcmc(p0, 500)
#reset the sampler and run the production phase
sampler.reset()
print("Running production...")
sampler.run_mcmc(p0, 1000)
samples = sampler.flatchain
np.shape(samples)
# Plot the samples in data space. We draw 24 posterior samples randomly.
print("Making plots")
samples = sampler.flatchain #flatten the chain (stitch all 32 walkers of 1000 long together)
rcParams['figure.figsize'] = 10, 7
plt.figure(2)
plt.ylabel(r"$y$",fontsize=15)
plt.xlabel(r"$t$",fontsize=15)
plt.xlim(-5, 5)
plt.errorbar(t, y, yerr=yerr, fmt=".k", capsize=0)
x = np.linspace(-5, 5, 500)
for s in samples[np.random.randint(len(samples), size=24)]:
plt.plot(x, model(s[2:], x)+s[0]*x+s[1], color="#4682b4", alpha=0.3)
plt.title(r'Results~assuming~uncorrelated~noise')
#plt.savefig("final_fit_WN.pdf", dpi=150)
#now we can look at the parameter evolutions in the production chain
from matplotlib.ticker import MaxNLocator
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.clf()
fig, axes = plt.subplots(3, 1, sharex=True, figsize=(8, 6))
for i in range(3):
axes[i].yaxis.label.set_size(20)
axes[i].yaxis.set_major_locator(MaxNLocator(3))
axes[i].plot(sampler.chain[:, :, i+2].T, color="k", alpha=0.4)
axes[i].axhline(truth[i], color="#888888", lw=2)
axes[0].set_ylabel(r"$\alpha$")
axes[1].set_ylabel(r"$l$")
axes[2].set_ylabel(r"$\sigma^2$")
axes[2].set_xlabel("step number",fontsize=20)
fig.tight_layout(h_pad=0.0)
plt.show()
#And now the final results. From the analysis of the chain, we obtain our poterior distributions.
import corner
plt.figure()
plt.rcParams['axes.titlesize'] = 17
plt.rcParams['axes.labelsize'] = 17
labels = [r"$\alpha$", r"$\ell$", r"$\sigma^2$"]
fig = corner.corner(samples[:, 2:], bins=50,truths=truth,labels=labels,show_titles=True,
quantiles=(0.16,0.5,0.84))
#fig.savefig("posteriors.pdf", dpi=150)
Looking at the final fit, at first this seems quite satisfactory. However, closer look at the final posterior probability distributions, it is clear that our final results for the $\ell$ parameter is WRONG based on the current error estimates.
Now instead of assuming that the noise is white, we generalise our likelihood function to include covariances between the data points (red noise).
adapted from Neale Gibson's papers A Gaussian Process (GP) is a non-parametric method for regression, used extensively for regression and classification problems in the machine learning community. A GP is formally defined as an collection of random variables y, any finite number of which have a joint Gaussian distribution, e.g.:
$$\boldsymbol{y} \sim N(\mu,\Sigma)$$Generally speaking, when we fit a model or a mean function with parameters $\theta$ to data t and y, we construct a likelihood function under the assumption of independent Gaussian uncertainties, $\sigma_i$.
$$ \mathrm{likelihood} = \prod_{n=1}^N \frac{1}{\sqrt{2\pi\sigma^2}} ~~ \exp \left(-\frac{[y_n-f_{\theta}(t_n)]^2}{2\sigma_n^2} \right) ~~~~~~~~ (4.1) $$This likelihood is central to inferring the probabilty distributions of our parameters $\theta$. We can optimise it with respect to $\theta$, or construct a posterior distribution for $\theta$ via Bayes theorem after defining priors. Note that the term inside the exponential is $-\frac{1}{2}\chi^2$, so optimising the likelihood is equivalent to minimising $\chi^2$ if the uncertainties are held fixed. This is already a Gaussian process, albeit a trivial one.
More generally, we can use a multi-variate Gaussian distribution as our likelihood, where we consider covariance between the data points:
$$ \mathrm{likelihood} = \frac{1}{|2\pi\Sigma|^{1/2}} ~~ \exp \left( -\frac{1}{2} \boldsymbol{r}^T \Sigma^{-1}\boldsymbol{r} \right) ~~~~~~~~ (4.2)$$with $\Sigma$ being the covariance matrix and $\boldsymbol{r} = \{y_n-f_\theta (t_n)\}$.
Based on the description above, we can rewrite our likelihood function (2) as a matrix equation (essentially just rewriting equation (4.2)):
$$ \ln P\left({y_n}|{t_n},{\sigma_n^2},\theta\right) = -\frac{1}{2} \boldsymbol{r}^\mathrm{T}\Sigma^{-1}\boldsymbol{r} -\frac{1}{2} \ln |\Sigma| - \frac{N}{2} \ln 2\pi ~~~~~~~~ (5)$$If we assume purely white noise, as we did in the previous section, then what we are saying is that the covariance matrix, $\Sigma$, is diagonal. Subsequently, we include the presence of red noise by populating the off-diagonal elements (essentially introducing co-variance).
Ideally, we would like to make every off-diagonal element a free parameter, but that would be too many free parameters. So instead we "model" these elements using a covariance function, or a kernel. In this framework, the covariance matrix is given by:
$$ \Sigma_{ij} = \sigma_i^2\delta_{ij} + k(t_i,t_j) ~~~~~~~~ (6)$$where $\delta$ is the Kronecker delta and $k$ is the kernel.
There are lots of different kernels that you can use. The use of each form is usually dictated by the form of systematics present in the data. For instance, if the systematics are highly correlated with some physical variable (orbital phase of HST), better known as optical state parameters, Matern family of kernels is preffered. Here those physical variants will be inputs of the kernel, i.e. the elements of the covariance matrix will be modelled based on physical measurements of systematic causations. Examples of covariance kernels are:
You should always think about your data and which kernel suites the modelling best. If in doubt, try different ones and distinguish goodness through statistical tests (BIC, ratio test, ...). Also consider defining your own if none of standard forms suit your purpose.
To model the systematics in our previously defined data, I will choose the Matern-3/2 kernel. Of course it would be ideal to use the SE kernel, where systematics in simulated data were drawn from. But to make the exercise a bit more realistic, I use a different kernel. This kernel is given by:
$$ k(t_i,t_j) = \zeta^2 \left(1+\frac{\sqrt{3}~|t_i-t_j|}{\tau} \right) ~ \exp \left(-\frac{\sqrt{3}~|t_i-t_j|}{\tau} \right) ~~~~~~~(7)$$where $\zeta$ and $\tau$ are the parameters of kernel, describing the amplitude and smoothness of the function. We are now ready to redfine the log-likelihood function which includes the GP. The python module, george, used already to simulate the data can be used to create our GP object. There are other freely availabel modules that enable you to define a GP, such as sklearn from scikit, or Neale Gibson's GeePea (which I personally use).
import george
from george import kernels
#likelihood function with a GP
def lnlike_gp(p, t, y, yerr):
a, tau = np.exp(p[:2])
gp = george.GP(a * kernels.Matern32Kernel(tau)) #Create the base Gaussian Process object.
gp.compute(t, yerr) #compute the covariance matrix and factorize it for a set of times
#and uncertainties.
return gp.lnlikelihood(y - model(p[2:], t))
#addings priors for the kernel parameters to the general base prior
def lnprior_gp(p):
lna, lntau = p[:2]
if not -5 < lna < 5:
return -np.inf
if not -5 < lntau < 5:
return -np.inf
return lnprior_base(p[2:])
"""
Calcualte posterior function. You should notice that the linear trend function is no longer
added to the model. This trend will now be accounted for by our GP model. However, if there
is a physical measurable that is the cause of this trend (airmass for instance), then the
trend can still be included.
"""
def lnprob_gp(p, t, y, yerr):
lp = lnprior_gp(p)
if not np.isfinite(lp):
return -np.inf
return lp + lnlike_gp(p, t, y, yerr)
#make the entire fitting process a funcational form, for general purpose use.
def fit_gp(initial, data, nwalkers=32):
ndim = len(initial)
p0 = [np.array(initial) + 1e-8 * np.random.randn(ndim)
for i in xrange(nwalkers)]
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob_gp, args=data)
print("Running burn-in")
p0, lnp, _ = sampler.run_mcmc(p0, 500)
sampler.reset()
print("Running second burn-in")
p = p0[np.argmax(lnp)]
p0 = [p + 1e-8 * np.random.randn(ndim) for i in xrange(nwalkers)]
p0, _, _ = sampler.run_mcmc(p0, 500)
sampler.reset()
print("Running production")
p0, _, _ = sampler.run_mcmc(p0, 1000)
return sampler
print("Fitting GP")
data = (t, y, yerr)
truth_gp = [0.0, 0.0] + truth #initial guesses for the parameters
sampler = fit_gp(truth_gp, data)
print("Making plots")
samples = sampler.flatchain
x = np.linspace(-5, 5, 500)
plt.figure()
plt.errorbar(t, y, yerr=yerr, fmt=".k", capsize=0)
#draw 100 random instances of the GP model from the posterior samples
for s in samples[np.random.randint(len(samples), size=100)]:
gp = george.GP(np.exp(s[0]) * kernels.Matern32Kernel(np.exp(s[1])))
gp.compute(t, yerr)
m = gp.sample_conditional(y - model(s[2:], t), x) + model(s[2:], x)
plt.plot(x, m, color="#4682b4", alpha=0.2)
plt.ylabel(r"$y$",fontsize=20)
plt.xlabel(r"$t$",fontsize=20)
plt.xlim(-5, 5)
plt.title("results with Gaussian process noise model",fontsize=15)
#plt.savefig("GP-model-fit.pdf", dpi=150)
# Make the corner plot.
plt.figure()
plt.rcParams['axes.titlesize'] = 17
plt.rcParams['axes.labelsize'] = 17
fig = corner.corner(samples[:, 2:], bins=50, truths=truth, labels=labels,show_titles=True,
quantiles=(0.16,0.5,0.84))
#fig.savefig("GP-posteriors.pdf", dpi=150)
And Voila!! The model that includes correlated noise is clearly less precise (does not underestimate the error bars) and it is more accurate. Moral of the story, correlated noise modelling is less biased.
If you are by any chance inspired to start using python, I highly recommend using spyder as your editor. It enables you to have multiple running environments (be it python or ipython), as well as cell creation, lbl compilation, debugging features and many more. At the beginning I went through a few different options, and this by far and away is my favourite.