# Chapter 7: Bayesian inference using Markov chain Monte Carlo (MCMC) sampling

## Section: Some theory

Markov chain Monte Carlo (MCMC) is a numerical method for approximating high-dimensional integrals and/or summations. The method was first described by (???) and later modified by (???). The algorithm that we will implement in this chapter is often referred to as the Metropolis-Hastings algorithm.

Use of the Metropolis-Hastings algorithm was mostly restricted to the field of statistical physics until it was discovered by statisticians in the mid 1980s (???). Statisticians, as a group, were largely unaware of the numerical method until (???) provided a description of the method as a way to approximate intractable probability densities. It is fair to say that MCMC has transformed the field of statistics, and in particular, Bayesian statistics.

Bayesian statisticians are interested in the posterior probability distribution of a parameter, $$\theta$$, which can be calculated using Bayes’s theorem as $\Pr(\theta | D) = {\Pr(D | \theta) \, \Pr(\theta) \over \Pr(D)}$ Here, $$\Pr(\theta | D)$$ is the posterior probability distribution of the parameter. The posterior probability is a conditional probability: the probability of the parameter conditioned on the observations, $$D$$. The other factors in the equation include the likelihood [$$\Pr(D | \theta)$$], prior probability distribution of the parameter [$$\Pr(\theta)$$], and marginal likelihood [$$\Pr(D)$$].

How does Bayesian analysis work in practice? Consider an experiment in which a coin is repeatedly tossed with the objective to estimate the probability that heads appears on a single toss of the coin, a parameter we call $$\theta$$. We observe $$x$$ heads on $$n$$ tosses of the coin. In a Bayesian analysis, the objective is to calculate the posterior probability of the parameter, which for coin tossing is $f(\theta | x) = {f(x | \theta) f(\theta) \over \int_{0}^{1} f(x | \theta) f(\theta) \, d\theta}$ The likelihood, $$f(x | \theta)$$, is given by the binomial probability distribution, $f(x | \theta) = {n \choose x} \theta^x (1-\theta)^{n-x}$ where the binomial coefficient is $${n \choose x} = {n! \over x! (n-x)!}$$. In addition to the likelihood function, however, we must also specify a prior probability distribution for the parameter, $$f(\theta)$$. This prior distribution should describe the investigator’s beliefs about the hypothesis before the experiment was performed. The problem, of course, is that different people may have different prior beliefs. In this case, it makes sense to use a prior distribution that is flexible, allowing different people to specify different prior probability distributions and also allowing for an easy investigation of the sensitivity of the results to the prior assumptions. A Beta distribution is often used as a prior probability distribution for the binomial parameter. The Beta distribution has two parameters, $$\alpha$$ and $$\beta$$. Depending upon the specific values chosen for $$\alpha$$ and $$\beta$$, one can generate a large number of different prior probability distributions for the parameter $$\theta$$. Figure shows several possible prior distributions for coin tossing. Figure a shows a uniform prior distribution for the probability of heads appearing on a single toss of a coin. In effect, a person who adopts this prior distribution is claiming total ignorance of the dynamics of coin tossing. Figure b shows a prior distribution for a person who has some experience tossing coins; anyone who has tossed a coin realizes that it is impossible to predict which side will face up when tossed, but that heads appears about as frequently as tails, suggesting more prior weight on values around $$\theta = 0.5$$ than on values near $$\theta = 0$$ or $$\theta = 1$$. Lastly, Figure c shows a prior distribution for a person who suspects he is being tricked. Perhaps the coin that is being tossed is from a friend with a long history of practical jokes, or perhaps this friend has tricked the investigator with a two-headed coin in the past. Figure c, then, might represent the ‘trick-coin’ prior distribution. Figure 7.1. The Beta distribution can take a variety of shapes depending on the values of the parameters $$\alpha$$ and $$\beta$$. Here, (a) $$\alpha = \beta = 1$$, (b) $$\alpha = \beta = 4$$, and (c) $$\alpha = \beta = 1/2$$

Besides being flexible, the Beta prior probability distribution has one other admirable property: when combined with a binomial likelihood, the posterior distribution also has a Beta probability distribution (but with the parameters changed). Prior distributions that have this property — that is, the posterior probability distribution has the same functional form as the prior distribution — are called conjugate priors in the Bayesian literature. The Bayesian treatment of the coin tossing experiment can be summarized as follows:

Prior [$$f(\theta)$$, Beta] Likelihood [$$f(x | \theta)$$, Binomial] Posterior [$$f(\theta | x)$$, Beta]
$${\Gamma(\alpha+\beta) \over \Gamma(\alpha) \Gamma(\beta)} \theta^{\alpha-1} (1-\theta)^{\beta-1}$$ $${n \choose x} \theta^x (1-\theta)^{n-x}$$ $${\Gamma(\alpha+\beta + n) \over \Gamma(\alpha + x) \Gamma(\beta + n - x)} \theta^{\alpha + x - 1} (1-\theta)^{\beta + n - x - 1}$$

[The gamma function, not to be confused with the gamma probability distribution, is defined as $$\Gamma(y) = \int_{0}^{\infty} u^{y-1} e^{-u} \, du$$. $$\Gamma(n) = (n - 1)!$$ for integer $$n = 1, 2, 3, \ldots$$ and $$\Gamma({1 \over 2}) = \pi$$.] We started with a Beta prior distribution with parameters $$\alpha$$ and $$\beta$$. We used a binomial likelihood, and after some use of our undergraduate calculus we calculated the posterior probability distribution, which is also a Beta distribution but with parameters $$\alpha + x$$ and $$\beta+n-x$$. Figure 7.2. As more data are collected for the coin-tossing example, the investigator’s prior opinions play a smaller role in the conclusions. (a) The prior distribution, likelihood function, and posterior probability density when $$\alpha = \beta = 4$$, $$n = 10$$ and $$x = 8$$. (b) The prior distribution, likelihood function, and posterior probability density when $$\alpha = \beta = 4$$, $$n = 100$$ and $$x = 80$$

Figure 7.2 shows the relationship between the prior probability distribution, likelihood, and posterior probability distribution of $$\theta$$ for two different cases, differing only in the number of coin tosses. The posterior probability of a parameter is a compromise between the prior probability and the likelihood. When the number of observations is small, as is the case for Figure 7.2a, the posterior probability distribution is similar to the prior distribution. However, when the number of observations is large, as is the case for Figure 7.2b, the posterior probability distribution is dominated by the likelihood.

Above, we casually claimed that the posterior distribution was a Beta distribution. Working out the math for this problem is informative. The posterior probability distribution for the coin tossing problem can be calculated analytically. Here, we assume a binomial probability distribution for the likelihood, and a Beta prior distribution on $$\theta$$, which means the posterior probability distribution of the parameter $$\theta$$ is: $f(\theta | x) = \frac{ {n \choose x} \theta^x (1-\theta)^{n-x} \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha-1} (1-\theta)^{\beta-1} }{ \int_{0}^{1} {n \choose x} \theta^x (1-\theta)^{n-x} \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha-1} (1-\theta)^{\beta-1} d\theta }$ Note that the denominator involves integration over all possible values of the parameter $$\theta$$. Specifically, the probability of heads on a single toss of a coin, $$\theta$$, can take values between 0 and 1.

Evaluating the integral in the denominator depends upon the following observation: The Beta probability distribution, like all continuous probability distributions, evaluates to one when integrated over all possible values for the parameter: $\begin{eqnarray*} \int_{0}^{1} \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha-1} (1-x)^{\beta-1}dx & = & 1 \\ {\Gamma(\alpha+\beta) \over \Gamma(\alpha)\Gamma(\beta)} \int_{0}^{1} x^{\alpha-1} (1-x)^{\beta-1} dx & = & 1 \\ \int_{0}^{1} x^{\alpha-1} (1-x)^{\beta-1} dx & = & {\Gamma(\alpha)\Gamma(\beta) \over \Gamma(\alpha+\beta) } \end{eqnarray*}$ The denominator in the equation for the posterior probability distribution of $$\theta$$, above, is $\begin{eqnarray*} \int_{0}^{1} {n \choose x} \theta^x (1-\theta)^{n-x} {\Gamma(\alpha+\beta) \over \Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha-1} (1-\theta)^{\beta-1} d\theta \\ \frac{ {n \choose x} \Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta) } \int_{0}^{1} \theta^x (1-\theta)^{n-x} \theta^{\alpha-1} (1-\theta)^{\beta-1} d\theta \\ \frac{ {n \choose x} \Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta) } \int_{0}^{1} \theta^{\alpha+x-1} (1-\theta)^{\beta+n-x-1} d\theta \end{eqnarray*}$ Because ${0}^{1} x^{-1} (1-x)^{-1} dx = ()() / (+)$, this means that ${0}^{1} x^{+x-1} (1-x)^{+n-x-1} dx = (+x)(+n-x) /(++ n)$. The posterior probability distribution for $$\theta$$, then, is $\begin{eqnarray*} f(\theta | x) & = & \frac{ {n \choose x} \theta^x (1-\theta)^{n-x} {\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha-1} (1-\theta)^{\beta-1} \over \frac{ {n \choose x} \Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta) } { \Gamma(\alpha+x)\Gamma(\beta+n-x) \over \Gamma(\alpha+\beta + n) } }\\ & = & {\Gamma(\alpha+\beta+n) \theta^x (1-\theta)^{n-x} \theta^{\alpha-1} (1-\theta)^{\beta-1} \over \Gamma(\alpha+x)\Gamma(\beta+n-x)}\\ & = & { \Gamma(\alpha+\beta+n) \theta^{\alpha+x-1} (1-\theta)^{\beta+n-x-1} \over \Gamma(\alpha+x)\Gamma(\beta+n-x)} \end{eqnarray*}$ which is a Beta probability distribution with parameters $$\alpha+x$$ and $$\beta+n-x$$.

I will make one comment about the analytical solution of the posterior probability distribution for the coin-tossing problem: it involved hard math! Certainly, working through the math was at the edge of my ability. However, for problems that are only slightly more complicated than coin tossing (i.e., almost all other statistical problems), it is not possible to obtain the posterior probability distribution analytically. Until the advent of computers and algorithms for approximating high-dimensional integrals/summation, people were up a river without a paddle if their problem was somewhat complicated (i.e., problems that are actually interesting).

## Section: Markov chain Monte Carlo (MCMC)

Even though we know the posterior probability distribution for the coin-tossing problem, we will act as if we don’t and approximate the posterior probability distribution using MCMC. The goal of MCMC is to construct a Markov chain that has as its state space the parameters of the model and a stationary distribution that is the posterior probability of the parameters. How do we construct a Markov chain that has these properties? The answer: Use the Metropolis-Hastings algorithm. The steps of the Metropolis-Hastings algorithm are as follows:

1. Call the current state of the Markov chain $$\theta$$. If this is the first cycle of the chain, then initialize $$\theta$$ to some value. (Perhaps initialize $$\theta$$ by choosing a value from the prior distribution.)
2. Propose a new value for $$\theta$$, called $$\theta'$$. The details of the proposal mechanism are at the discretion of the programmer. However, he/she must follow some rules. For one, the proposal mechanism that is coded must involve random numbers. That is, the proposal mechanism cannot be a deterministic one. Some other requirements are that every state must be potentially reachable given enough applications of the proposal mechanism and the proposal mechanism cannot result in a periodic chain. (The last requirement is generally not an issue.) Finally, you should be able to calculate the probability of proposing the proposed state, $$q(\theta \rightarrow \theta')$$ as well as the imagined reverse move, which is not actually made in computer memory, $$q(\theta' \rightarrow \theta)$$.
3. Calculate the probability of accepting the proposed state as the next state of the Markov chain. This probability is the ratio of the posterior probabilities of the proposed and current states times the Hastings ratio: $R = \min \left(1, {f(\theta' | X) \over f(\theta | X)} \times \underbrace{q(\theta' \rightarrow \theta) \over q(\theta \rightarrow \theta')}_{\mbox{Hastings Ratio}} \right)$ It does not appear that we can solve this. After all, the entire point of MCMC is to approximate probability distributions that we cannot solve analytically. Here, things seem especially intractable, because we need to calculate the ratio of two probability distributions that we cannot calculate analytically! Relax. Let’s rewrite the ratio of the posterior probabilities using Bayes’s theorem: $R = \min \left(1, {f(X | \theta') f(\theta')/ f(X) \over f(X | \theta) f(\theta)/f(X)} \times {q(\theta' \rightarrow \theta) \over q(\theta \rightarrow \theta')} \right)$ Note that the marginal likelihood, $$f(X)$$, cancels. Importantly, it was the marginal likelihood that we cannot calculate analytically. The portions that are left over, after cancelling $$f(X)$$, are readily calculable. In the end, our acceptance probability is: $R = \min \left(1, \underbrace{f(X | \theta') \over f(X | \theta)}_{\mbox{Likelihood Ratio}} \times \underbrace{f(\theta') \over f(\theta)}_{\mbox{Prior Ratio}} \times \underbrace{q(\theta' \rightarrow \theta) \over q(\theta \rightarrow \theta')}_{\mbox{Hastings Ratio}} \right)$
4. Generate a uniform(0,1) random variable called $$u$$. If $$u < R$$, then accept the proposed state as the next state of the Markov chain, setting $$\theta = \theta'$$. Otherwise, the chain remains in its current state.
5. Go to Step # 2.

The steps of the Metropolis-Hastings algorithm are repeated many thousands, or millions, of times. The states that are visited form a Markov chain. The key to MCMC is that the sampled states are valid, albeit dependent, draws from the posterior probability distribution.

## Section 7.3: Implementing the Markov chain Monte Carlo algorithm

### Subsection 7.3.1: Defining the Model class

Create a new file called Model.h Work through the code below which needs to be added to the to file.

#ifndef Model_h
#define Model_h

#include <stdio.h>
#include <vector>

// forward declarations
class Alignment;
class ExponentialDistribution;
class PhyloCTMC;
class PureBirthProcess;
class RateMatrix_JC;
class Tree;
class UniformDistribution;

/**
* \class Model
*
* \brief This class represents the phylogenetic model.
*
*
*
* \author Sebastian Höhna
*
*/
class Model {

public:

Model(const std::string &fn);
virtual                                        ~Model();

// parameters
double                                          getBirthRate(void) const;
double                                          getClockRate(void) const;

void                                            setBirthRate( double r );
void                                            setClockRate( double r );

// distributions
double                                          getBirthRatePrior(void);
double                                          getClockRatePrior(void);
double                                          getLikelihood(void);
double                                          getTreePrior(void);

private:

Alignment*                                      my_alignment;
double*                                         my_birth_rate;
double*                                         my_clock_rate;
RateMatrix_JC*                                  my_rate_matrix;
Tree*                                           my_tree;

PureBirthProcess*                               tree_prior;
ExponentialDistribution*                        birth_rate_prior;
UniformDistribution*                            clock_rate_prior;
PhyloCTMC*                                      likelihood;

};

#endif /* Model_h */

### Subsection 7.3.2: Implementing the Model class

Create a new file called Model.cpp Work through the code below which needs to be added to the to file.

#include "Model.h"

#include "Alignment.h"
#include "ExponentialDistribution.h"
#include "PhyloCTMC.h"
#include "PureBirthProcess.h"
#include "RateMatrix_JC.h"
#include "Tree.h"
#include "UniformDistribution.h"

Model::Model(const std::string &fn)
{
Tree* my_tree = new Tree(trees);

// read in the data (i.e., the alignment)

// create a transition rate matrix
my_rate_matrix = new RateMatrix_JC();

// create the clock rate prior
double* min_clock_rate = new double(0.0);
double* max_clock_rate = new double(100.0);
clock_rate_prior = new UniformDistribution(min_clock_rate, max_clock_rate);

my_clock_rate = new double(1.0);
clock_rate_prior->setValue( my_clock_rate );

// now set up the tree prior
double* birth_rate_mean = new double(1.0);
birth_rate_prior = new ExponentialDistribution(birth_rate_mean);

// specify the birth rate
my_birth_rate = new double(1.0);
birth_rate_prior->setValue( my_birth_rate );

tree_prior = new PureBirthProcess( my_alignment->getTaxonNames(), my_birth_rate );
tree_prior->setValue( my_tree );

likelihood = new PhyloCTMC( my_tree, my_rate_matrix, my_clock_rate );
likelihood->setValue( my_alignment );

}

Model::~Model()
{
delete my_alignment;
delete my_birth_rate;
delete my_clock_rate;
delete my_rate_matrix;
delete my_tree;

delete tree_prior;
delete birth_rate_prior;
delete clock_rate_prior;
delete likelihood;
}

double Model::getBirthRate(void) const
{
return *my_birth_rate;
}

double Model::getBirthRatePrior(void)
{
return birth_rate_prior->lnProbability();
}

double Model::getClockRate(void) const
{
return *my_clock_rate;
}

double Model::getClockRatePrior(void)
{
return clock_rate_prior->lnProbability();
}

double Model::getLikelihood(void)
{
return likelihood->lnProbability();
}

double Model::getTreePrior(void)
{
return tree_prior->lnProbability();
}

void Model::setBirthRate( double r )
{
*my_birth_rate = r;
}
void Model::setClockRate( double r )
{
*my_clock_rate = r;
}

### Subsection 7.3.3: Defining the MCMC class

Create a new file called MCMC.h Work through the code below which needs to be added to the to file.

#ifndef MCMC_h
#define MCMC_h

#include <stdio.h>
#include <vector>
#include <string>
#include <iostream>
#include <fstream>

#include "Model.h"
#include "RandomNumberGenerator.h"

// forward declarations

/**
* \class MCMC
*
* \brief This class represents the MCMC algorithm.
*
*
*
* \author Sebastian Höhna
*
*/
class MCMC {

public:

MCMC(const std::string& filename);
MCMC(const MCMC& m);
virtual                                        ~MCMC();

void                                            run(int n_gen);

private:
// our moves
void                                            monitorParameters(int g);
void                                            updateBirthRate(void);
void                                            updateClockRate(void);

Model                                           my_model;
RandomNumberGenerator                           rng;
std::ofstream                                   parameter_file;
std::ofstream                                   tree_file;

};

#endif /* MCMC_h */

### Subsection 7.3.4: Implementing the MCMC class

Create a new file called MCMC.cpp Work through the code below which needs to be added to the to file.

#include "MCMC.h"

#include <math.h>       /* isnan, sqrt */

MCMC::MCMC( const std::string &filename ) :
my_model( filename ),
parameter_file( "my_analysis.log" ),
tree_file( "my_analysis.trees" )
{

}

MCMC::MCMC( const MCMC &m ) :
my_model( "" ),
parameter_file( "my_analysis.log" ),
tree_file( "my_analysis.trees" )
{

}

MCMC::~MCMC( void )
{
// nothing to do here
}

void MCMC::monitorParameters( int gen )
{
parameter_file << gen;
parameter_file << "\t";
parameter_file << my_model.getBirthRate();
parameter_file << "\t";
parameter_file << my_model.getClockRate();
parameter_file << "\t";
parameter_file << std::endl;
}

void MCMC::updateBirthRate( void )
{
// get the current birth rate
double current_rate = my_model.getBirthRate();

// draw a new random uniform number
double u = rng.uniform01();

// define the window size
double delta = 7.32;

// now we compute the new proposed values
double new_rate = current_rate + (u-0.5) * delta;

// compute the prior ratio and likelihood ratio
double current_prior = my_model.getBirthRatePrior();
double current_likelihood = my_model.getTreePrior();

// set the new value
my_model.setBirthRate( new_rate );

double new_prior = my_model.getBirthRatePrior();
double new_likelihood = my_model.getTreePrior();

double acceptance_prob = new_prior - current_prior + new_likelihood - current_likelihood;

u = rng.uniform01();

if ( isnan(acceptance_prob) || u > acceptance_prob )
{
// reject -> reset the value
my_model.setBirthRate( current_rate );
}
else
{
// accept! nothing to do, because we changed our value already
}

}

void MCMC::updateClockRate( void )
{
}

void MCMC::run( int num_generations )
{
// we assume that the model and all parameters are initialized.

int thinning = 10;

for ( int current_gen=0; current_gen<num_generations; current_gen++ )
{
// draw new values:
updateBirthRate();

updateClockRate();

// now we monitor the variables
if ( current_gen % thinning == 0 )
{
monitorParameters( current_gen );
}

}

std::cout << "Hooray, we finished our MCMC run!" << std::endl;
}

## Section 7.4: Exercises

• Implement the method updateClockRate using the same logic as we used for updateBirthRate.
• Experiment with the window size parameter, which we arbitrarily set to 7.32. That is, run the MCMC with different values for the window size and look at the results in the software Tracer by loading the file called my_analysis.log.
• In our MCMC, we implement a sliding window move. Now let’s implement a scaling move. To do so, you will propose new values by $$\theta' = \theta * e^{u-0.5}$$, where $$u \sim \text{unif}(0,1)$$. Note that you will also need to compute the Hastings ratio, which is $$e^{u-0.5}$$ (or $$u-0.5$$ since we need to add the log-transformed Hastings ratio).
• Compare which MCMC works best (e.g., using only the sliding window move, only the scaling move, or both moves combined). Also check if the samples generated from MCMC runs using the different moves give (approximately) the same posterior distribution.