The Substitution Rate Matrix

All model-based methods of phylogenetic inference, including maximum likelihood and Bayesian estimation of phylogeny, currently assume that character change occurs according to a continuous-time Markov chain. At the heart of any continuous-time Markov chain is a matrix of rates, specifying the rate of change from one state to another. For example, the instantaneous rate of change under the model described by [hereafter called the HKY85 model](1985) is \[ {\mathbf Q} = \{q_{ij}\} = \left( \begin{array}{cccc} - & \pi_C & \kappa \pi_G & \pi_T \\ \pi_A & - & \pi_G & \kappa \pi_T \\ \kappa \pi_A & \pi_C & - & \pi_T \\ \pi_A & \kappa \pi_C & \pi_G & - \\ \end{array} \right) \mu \] This matrix specifies the rate of change from one nucleotide to another; the rows and columns of the matrix are ordered \(A, C, G, T\), so that the rate of change from \(C \rightarrow G\) is \(q_{CG} = \pi_G\). Similarly, the rates of change between \(C \rightarrow T\), \(G \rightarrow A\), and \(T \rightarrow C\), are \(q_{CT} = \kappa \pi_T\), \(q_{GA} = \kappa \pi_A\), and \(q_{TG} = \pi_G\), respectively. The diagonals of the rate matrix, denoted with the dashes, are specified such that each row sums to zero. Finally, the rate matrix is rescaled such that the mean rate of substitution is one. This can be accomplished by setting \(\mu = -1 / \sum_{i\in \{A,C,G,T\}} \pi_i q_{ii}\). This rescaling of the rate matrix such that the mean rate is one allows the branch lengths on the phylogenetic tree to be interpreted as expected number of nucleotide substitutions per site.

We will make a few important points about the rate matrix. First, the rate matrix may have free parameters. For example, the HKY85 model has the parameters \(\kappa\), \(\pi_A\), \(\pi_C\), \(\pi_G\), and \(\pi_T\). The parameter \(\kappa\) is the transition/transversion rate bias; when \(\kappa = 1\) transitions occur at the same rate as transversions. Typically, the transition/transversion rate ratio, estimated using maximum likelihood or Bayesian inference, is greater than one; transitions occur at a higher rate than transversions. The other parameters — \(\pi_A\), \(\pi_C\), \(\pi_G\), and \(\pi_T\) — are the base frequencies, and have a biological interpretation as the frequency of the different nucleotides and are also, incidentally, the stationary probabilities of the process (more on stationary probabilities later). Second, the rate matrix, \({\mathbf Q}\), can be used to calculate the transition probabilities and the stationary distribution of the substitution process. The transition probabilities and stationary distribution play a key role in calculating the likelihood, and we will spend more time here developing an intuitive understanding of these concepts.

Subsection 5.1.1.: Transition probabilities

Let us consider a specific example of a rate matrix, with all of the parameters of the model taking specific values. For example, if we use the HKY85 model and fix the parameters to \(\kappa = 5\), \(\pi_A = 0.4\), \(\pi_C = 0.3\), \(\pi_G = 0.2\), and \(\pi_T = 0.1\), we get the following matrix of instantaneous rates \[ {\mathbf Q} = \{q_{ij}\} = \left( \begin{array}{rrrr} -0.886 & 0.190 & 0.633 & 0.063 \\ 0.253 & -0.696 & 0.127 & 0.316 \\ 1.266 & 0.190 & -1.519 & 0.063 \\ 0.253 & 0.949 & 0.127 & -1.329 \\ \end{array} \right) \] Note that these numbers are not special in any particular way. That is to say, they are not based upon any observations from a real data set, but are rather arbitrarily picked to illustrate a point. The point is that one can interpret the rate matrix in the physical sense of specifying how changes occur on a phylogenetic tree. Consider the very simple case of a single branch on a phylogenetic tree. Let’s assume that the branch is \(v=0.5\) in length and that the ancestor of the branch is the nucleotide \(G\). The situation we have is something like that shown in Figure a. How can we simulate the evolution of the site starting from the \(G\) at the ancestor? The rate matrix tells us how to do this. First of all, because the current state of the process is \(G\), the only relevant row of the rate matrix is the third one: \[ {\mathbf Q} = \{q_{ij}\} = \left( \begin{array}{cccc} \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ 1.266 & 0.190 & -1.519 & 0.063 \\ \cdot & \cdot & \cdot & \cdot \\ \end{array} \right) \] The overall rate of change away from nucleotide \(G\) is \(q_{GA} + q_{GC} + q_{GT} = 1.266 + 0.190 + 0.063 = 1.519\). Equivalently, the rate of change away from nucleotide \(G\) is simply \(-q_{GG} = 1.519\). In a continuous-time Markov model, the waiting time between substitutions is exponentially distributed. The exact shape of the exponential distribution is determined by its rate, which is the same as the rate of the corresponding process in the \({\mathbf Q}\) matrix. For instance, if we are in state \(G\), we wait an exponentially distributed amount of time with rate 1.519 until the next substitution occurs. One can easily construct exponential random variables from uniform random variables using the equation

\[ t = -\frac{1}{\lambda} \log_e(u) \] where \(\lambda\) is the rate and \(u\) is a uniform(0,1) random number. For example, my calculator has a uniform(0,1) random number generator. The first number it generated is \(u = 0.794\). This means that the next time at which a substitution occurs is 0.152 up from the root of the tree (using \(\lambda = 1.519\); Figure b). The rate matrix also specifies the probabilities of a change from \(G\) to the nucleotides \(A\), \(C\), and \(T\).
These probabilities are \[ \begin{array}{ccc} G \rightarrow A: \frac{1.266}{1.519}=0.833, & G \rightarrow C: \frac{0.190}{1.519}=0.125, & G \rightarrow T: \frac{0.063}{1.519}=0.042 \\ \end{array} \] To determine what nucleotide the process changes to we would generate another uniform(0,1) random number (again called \(u\)). If \(u\) is between 0 and 0.833, we will say that we had a change from \(G\) to \(A\). If the random number is between 0.833 and 0.958 we will say that we had a change from \(G\) to \(C\). Finally, if the random number \(u\) is between 0.958 and 1.000, we will say we had a change from \(G\) to \(T\). The next number generated on our calculator was \(u = 0.102\), which means the change was from \(G\) to \(A\). The process is now in a different state (the nucleotide \(A\)) and the relevant row of the rate matrix is \[ {\mathbf Q} = \{q_{ij}\} = \left( \begin{array}{cccc} -0.886 & 0.190 & 0.633 & 0.063 \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \end{array} \right) \] We wait an exponentially distributed amount of time with parameter \(\lambda = 0.886\) until the next substitution occurs. When the substitution occurs, it is to a \(C\), \(G\), or \(T\) with probabilities \(\frac{0.190}{0.886} = 0.214\), \(\frac{0.633}{0.886} = 0.714\), and \(\frac{0.063}{0.886} = 0.072\), respectively. This process of generating random and exponentially distributed times until the next substitution occurs and then determining (randomly) what nucleotide the change is to is repeated until the process exceeds the length of the branch. The state the process is in when it passes the end of the branch is recorded. In the example of Figure~, the process started in state \(G\) and ended in state \(A\). (The next uniform random variable generated on our calculator was \(u = 0.371\), which means that the next substitution would occur 1.119 units above the substitution from \(G \rightarrow A\). The process is in the state \(A\) when it passed the end of the branch.) The only non-random part of the entire procedure was the initial decision to start the process in state \(G\). All other aspects of the simulation used a uniform random number generator and our knowledge of the rate matrix to simulate a single realization of the HKY85 process of DNA substitution.

Clearly, these numbers are only crude approximations to the true transition probabilities; after all, each row in the table is based on only 100 Monte Carlo simulations. However, they do illustrate the meaning of the transition probabilities; the transition probability, \(p_{ij}(v)\), is the probability that the substitution process ends in nucleotide \(j\) conditioned on it starting in nucleotide \(i\) after an evolutionary amount of time \(v\). The table of approximate transition probabilities, above, can be interpreted as a matrix of probabilities, usually denoted \({\mathbf P}(v)\). Fortunately, we do not need to rely on Monte Carlo simulation to approximate the transition probability matrix. Instead, we can calculate the transition probability matrix exactly using matrix exponentiation: \[ {\mathbf P}(v) = e^{Qv} \] For the case we have been simulating, the exact transition probabilities (to four decimal places) are \[ {\mathbf P}(0.5) = \{p_{ij}(0.5)\} = \left( \begin{array}{rrrr} 0.7079 & 0.0813 & 0.1835 & 0.0271 \\ 0.1085 & 0.7377 & 0.0542 & 0.0995 \\ 0.3670 & 0.0813 & 0.5244 & 0.0271 \\ 0.1085 & 0.2985 & 0.0542 & 0.5387 \\ \end{array} \right) \] The transition probability matrix accounts for all the possible ways the process could end up in nucleotide \(j\) after starting in nucleotide \(i\). In fact, each of the infinite possibilities is weighted by its probability under the substitution model.

Subsection 5.1.2: Stationary distribution

The transition probabilities provide the probability of ending in a particular nucleotide after some specific amount of time (or opportunity for substitution, \(v\)). These transition probabilities are conditioned on starting in a particular nucleotide. What do the transition probability matrices look like as \(v\) increases? The following transition probability matrices show the effect of increasing branch length: \[ \begin{array}{cc} {\mathbf P}(0.00) = \left( \begin{array}{rrrr} 1.000 & 0.000 & 0.000 & 0.000 \\ 0.000 & 1.000 & 0.000 & 0.000 \\ 0.000 & 0.000 & 1.000 & 0.000 \\ 0.000 & 0.000 & 0.000 & 1.000 \\ \end{array} \right) & {\mathbf P}(0.01) = \left( \begin{array}{rrrr} 0.991 & 0.002 & 0.006 & 0.001 \\ 0.003 & 0.993 & 0.001 & 0.003 \\ 0.013 & 0.002 & 0.985 & 0.001 \\ 0.003 & 0.009 & 0.001 & 0.987 \\ \end{array} \right) \end{array} \] \[ \begin{array}{cc} {\mathbf P}(0.10) = \left( \begin{array}{rrrr} 0.919 & 0.018 & 0.056 & 0.006 \\ 0.024 & 0.934 & 0.012 & 0.029 \\ 0.113 & 0.018 & 0.863 & 0.006 \\ 0.025 & 0.086 & 0.012 & 0.877 \\ \end{array} \right) & {\mathbf P}(0.50) = \left( \begin{array}{rrrr} 0.708 & 0.081 & 0.184 & 0.027 \\ 0.106 & 0.738 & 0.054 & 0.100 \\ 0.367 & 0.081 & 0.524 & 0.027 \\ 0.109 & 0.299 & 0.054 & 0.539 \\ \end{array} \right) \end{array} \] \[ \begin{array}{cc} {\mathbf P}(1.00) = \left( \begin{array}{rrrr} 0.580 & 0.141 & 0.232 & 0.047 \\ 0.188 & 0.587 & 0.094 & 0.131 \\ 0.464 & 0.141 & 0.348 & 0.047 \\ 0.188 & 0.394 & 0.094 & 0.324 \\ \end{array} \right) & {\mathbf P}(5.00) = \left( \begin{array}{rrrr} 0.411 & 0.287 & 0.206 & 0.096 \\ 0.383 & 0.319 & 0.192 & 0.106 \\ 0.411 & 0.287 & 0.206 & 0.096 \\ 0.383 & 0.319 & 0.192 & 0.107 \\ \end{array} \right) \end{array} \] \[ \begin{array}{cc} {\mathbf P}(10.0) = \left( \begin{array}{rrrr} 0.401 & 0.299 & 0.200 & 0.099 \\ 0.399 & 0.301 & 0.199 & 0.100 \\ 0.401 & 0.299 & 0.200 & 0.099 \\ 0.399 & 0.301 & 0.199 & 0.100 \\ \end{array} \right) & {\mathbf P}(100) = \left( \begin{array}{rrrr} 0.400 & 0.300 & 0.200 & 0.100 \\ 0.400 & 0.300 & 0.200 & 0.100 \\ 0.400 & 0.300 & 0.200 & 0.100 \\ 0.400 & 0.300 & 0.200 & 0.100 \\ \end{array} \right) \\ \end{array} \] (Each matrix was calculated under the HKY85 model with \(\kappa = 5\), \(\pi_A = 0.4\), \(\pi_C = 0.3\), \(\pi_G = 0.2\), and \(\pi_T = 0.1\).) Note that as the length of a branch, \(v\), increases, the probability of ending up in a particular nucleotide converges to a single number, regardless of the starting state. For example, the probability of ending up in \(C\) is about 0.300 when the branch length is \(v=100\). This is true regardless of whether the process starts in \(A\), \(C\), \(G\), or \(T\). The substitution process has in a sense ‘forgotten’ its starting state.

The stationary distribution is the probability of observing a particular state when the branch length increases without limit (\(v \rightarrow \infty\)). The stationary probabilities of the four nucleotides are \(\pi_A = 0.4\), \(\pi_C = 0.3\), \(\pi_G = 0.2\), and \(\pi_T = 0.1\) for the example discussed above. The models typically used in phylogenetic analyses have the stationary probabilities built into the rate matrix, \({\mathbf Q}\). You will notice that the rate matrix for the HKY85 model has parameters \(\pi_A\), \(\pi_C\), \(\pi_G\), and \(\pi_T\), and that the stationary frequencies of the four nucleotides for our example match the input values for our simulations. Building the stationary frequency of the process into the rate matrix, while somewhat unusual, makes calculating the likelihood function easier. For one, specifying the stationary distribution saves the time of figuring out what the stationary distribution is (which involves solving the equation \(\pi {\mathbf Q} = {\mathbf 0}\), which simply says that, if we start with the nucleotide frequencies reflecting the stationary distribution, the process will have no effect on the nucleotide frequencies). For another, it allows one to more easily specify a time reversible substitution model. [A time reversible substitution model has the property that \(\pi_i q_{ij} = \pi_j q_{ji}\) for all \(i, j \in (A,C,G,T)\), \(i \neq j\).] Practically speaking, time reversibility means that we can work with unrooted trees instead of rooted trees (assuming that the molecular clock is not enforced).

Section 5.2: Definition of the Jukes-Cantor Rate Matrix class

In todays class we will first start with the implementation of the Jukes-Cantor substitution model. The Jukes-Cantor substitution model has the rate matrix defined as \[ {\mathbf Q} = \{q_{ij}\} = \left( \begin{array}{cccc} 1 & \frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\ \frac{1}{3} & 1 & \frac{1}{3} & \frac{1}{3} \\ \frac{1}{3} & \frac{1}{3} & 1 & \frac{1}{3} \\ \frac{1}{3} & \frac{1}{3} & \frac{1}{3} & 1 \\ \end{array} \right) \] You can see in the rate matrix that the rates of substitutions do not depend on the current state. They are all equal. In fact, the Jukes-Cantor substitution model has only one free parameter: the overall rate of evolution, which is confounded with the branch lengths and thus usually omitted. One nice feature of the Jukes-Cantor substitution model is that we can compute the transition probabilities fairly easy, as given by \[ {\mathbf P}_{ij}(\nu) = \begin{cases} {1\over4} + {3\over4}e^{-4\nu/3} & \mbox{ if } i = j \\ {1\over4} - {1\over4}e^{-4\nu/3} & \mbox{ if } i \neq j \end{cases} \] Let’s put this into C++ code.

Create a new file called RateMatrix_JC.h which defines the rate matrix class. There shouldn’t be much new in this class and you should be familiar with the syntax by now. The one main function provided by the class is the method calculateTransitionProbabilities, which compute the matrix of transition probabilities.

#ifndef RateMatrix_JC_h
#define RateMatrix_JC_h

#include <stdio.h>
#include <vector>

/**
 * \class RateMatrix_JC
 *
 * \brief Represents the rate matrix class for the Jukes-Cantor (1969) model.
 *
 * This class provides the transition probability calculation using for the Jukes-Cantor (1969) rate matrix.
 * The purpose is simply to calculate these transition probabilities for a given branch length.
 * For the Jukes-Cantor rate matrix we know the analytical solution for the transition probabilities.
 * The JC has no parameter but can be applied to any number of states.
 * The resulting rate matrix is computed by:
 *
 *      |   -     1/3    1/3    1/3  |
 *      |                            |
 *      |  1/3     -     1/3    1/3  |
 * Q =  |                            |
 *      |  1/3    1/3     -     1/3  |
 *      |                            |
 *      |  1/3    1/3    1/3     -   |
 *
 *
 * \author Sebastian Höhna
 *
 */
class RateMatrix_JC {

public:

    RateMatrix_JC();
    virtual                                        ~RateMatrix_JC();

    std::vector<std::vector<double> >               calculateTransitionProbabilities(double time, double rate) const;       //!< Calculate the transition matrix

};

#endif /* RateMatrix_JC_h */

Section 5.3: Implementation of the Jukes-Cantor Rate Matrix class

Now continue with the implementation of the Jukes-Cantor rate matrix class by actually implementing it. Create a new file called RateMatrix_JC.cpp. As usual, this file will contain our implementation.

First, we start with the necessary header files and libraries that we need to include. Here, this is only the definition of the the RateMatrix_JC and cmath for computing exponential numbers.

#include "RateMatrix_JC.h"

#include <cmath>

The next step is to add the default constructor of the class that doesn’t actually need to do anything.

/**
 * Constructor of the Jukes-Cantor Rate Matrix class.
 *
 * There are no variables that need to be initialized.
 *
 */
RateMatrix_JC::RateMatrix_JC( void )
{

}

Similarly, we add the destructor for completeness although there is no allocated memory to free.

/**
 * Destructor.
 *
 * Since we do not have any member variables, there is no memory to free here.
 *
 */
RateMatrix_JC::~RateMatrix_JC( void )
{

}

Now we get the juicy part: the implementation of the calculateTransitionProbabilities method. However, this method is not too complicated. We simply need to create a transition probability matrix, which we define as being of type std::vector<std::vector >, that is, a vector of a vector of real numbers. Then, we compute the entries for each cell by either using \(1/4 - 1/4 * exp(-v*4/3)\) for off diagonal entries and \(1/4 + 3/4 * exp(-v*4/3)\) for entries along the diagonal.

/**
 * Calculate the transition probabilities.
 *
 * The transition probabilities are computed as
 *  for off diagonals:      1/4 - 1/4 * exp(-v*4/3)
 *  for diagonals:          1/4 + 3/4 * exp(-v*4/3)
 * Where v = t * r
 *
 * @param   time            The branch time (t) for which to compute the transition probabilities (bl = t * r).
 * @param   rate            The branch rate (r) for which to compute the transition probabilities (bl = t * r).
 * @return                  The transition probability matrix.
 */
std::vector<std::vector<double> > RateMatrix_JC::calculateTransitionProbabilities(double time, double rate) const
{

    size_t num_states = 4;

    std::vector<std::vector<double> > P = std::vector<std::vector<double> >(num_states, std::vector<double>(num_states, 0.0));

    double v = rate * time;

    // calculate the transition probabilities
    double bf = 1.0 / num_states;
    double one_minus_bf = 1.0 - bf;
    double p_ii = bf + one_minus_bf * exp(-v/one_minus_bf);
    double p_ij = bf - bf * exp(-v/one_minus_bf);
    for (size_t i=0; i<num_states; i++)
    {
        P[i][i] = p_ii;
        for (size_t j=i+1; j<num_states; j++)
        {
            P[i][j] = p_ij;
            P[j][i] = p_ij;
        }

    }

    return P;
}

This concludes our rate matrix class. That wasn’t too bad :) This could actually be sufficient for computing the phylogenetic likelihood function and this simple model. Let’s have a look at the transition probabilities before we proceed.

Section 5.4: Testing your rate matrix

// create a transition rate matrix
RateMatrix_JC *my_rate_matrix = new RateMatrix_JC();
std::vector<std::vector<double> > P;
P = my_rate_matrix->calculateTransitionProbabilities( 0.0, 1.0 );
for (size_t i=0; i<4; i++)
{
    for (size_t j=0; j<4; j++)
        std::cout << P[i][j] << "  ";
    std::cout << std::endl;
}
P = my_rate_matrix->calculateTransitionProbabilities( 1.0, 1.0 );
for (size_t i=0; i<4; i++)
{
    for (size_t j=0; j<4; j++)
        std::cout << P[i][j] << "  ";
    std::cout << std::endl;
}
P = my_rate_matrix->calculateTransitionProbabilities( 10.0, 1.0 );
for (size_t i=0; i<4; i++)
{
    for (size_t j=0; j<4; j++)
        std::cout << P[i][j] << "  ";
    std::cout << std::endl;
}

Don’t forget to also include the new classes that we just wrote today. Thus, add the two include statements at the beginning of the main.cpp file.

#include "RateMatrix_JC.h"

Now compile and run your BabyBayes. What are the transition probabilities for the different branch lengths?

Section 5.5: Simulating DNA Sequences

In the previous section, we covered some of the most basic assumptions made in a phylogenetic analysis; DNA sequences are assumed to evolve on a phylogenetic tree (with branch lengths) under a continuous-time Markov model of DNA substitution. The substitution process is assumed to be independent across sites. These assumptions apply to maximum likelihood, Bayesian inference, and distance methods (when the distances are ‘corrected’ under some evolutionary model).

In this section, we will test our knowledge obtained in the previous section by simulating DNA sequences on a phylogenetic tree. Simulation is often used in phylogenetics. Simulation has been used to elucidate the statistical properties of different phylogenetic methods, and it can be used to generate the null distribution of a test statistic in phylogenetic hypothesis testing. In other words, learning to simulate DNA sequences is not a wasted effort. Not only can you strengthen your intuition of phylogenetic methods, but you may also be able to apply simulation for your own research.

How exactly should one simulate evolution on a phylogeneitc tree? One basic point is that an alignment should be simulated on a site-by-site basis. That is, we first simulate the data at the first site, then the second site, and so on. We can take this approach because of the assumption of independence of the substitution process across sites; to simulate the data at a particular site (column in the alignment), we don’t need to know the results of the simulation at any other site.

Another basic point is that we must know all of the parameters of the simulation: we have to decide on the precise phylogenetic tree on which to simulate the DNA sequences; we need to know the branch lengths on this tree; and, finally, we must pick a substitution model. The substitution model is a matrix of rates, specifying the rate of change from one nucleotide to another. In other words, we are taking a God-like view of the situation. We know everything about the evolutionary history and process. Of course, in reality we never know everything about how organisms evolved, but must make strong assumptions about how evolution occurred in order to estimate (make educated guesses) at the underlying evolutionary history. However, pretending to be a God, even for a little while, is a great feeling.

In the following, we will evolve DNA sequences on the four-taxon tree shown in Figure . We will also assume that DNA substitution occurs according to the JC69 model.

Now, we are ready to simulate data on the tree of Figure . We will go over two different methods for simulating data, each of which takes advantage of our knowledge of continuous-time Markov chains.

The first simulation method takes advantage of our knowledge of the stationary distribution and of our ability to calculate transition probabilities. There are only three different lengths of branches on our model tree (0.1, 0.2, and 0.3). The transition probabilities are \[ \begin{array}{cc} {\mathbf P}(0.10) = \left( \begin{array}{rrrr} 0.919 & 0.018 & 0.056 & 0.006 \\ 0.024 & 0.934 & 0.012 & 0.029 \\ 0.113 & 0.018 & 0.863 & 0.006 \\ 0.025 & 0.086 & 0.012 & 0.877 \\ \end{array} \right) & {\mathbf P}(0.20) = \left( \begin{array}{rrrr} 0.851 & 0.035 & 0.100 & 0.011 \\ 0.047 & 0.876 & 0.023 & 0.052 \\ 0.201& 0.035 & 0.750 & 0.011 \\ 0.047 & 0.156 & 0.023 & 0.771 \\ \end{array} \right) \end{array} \] \[ {\mathbf P}(0.30) = \left( \begin{array}{rrrr} 0.795 & 0.051 & 0.135 & 0.017 \\ 0.069 & 0.824 & 0.034 & 0.071 \\ 0.270 & 0.051 & 0.659 & 0.017 \\ 0.069 & 0.214 & 0.034 & 0.681 \\ \end{array} \right) \] Instead of drawing exponential random variables, and generating the process continuously across the entire tree (see the exercise section), our simulation jumps from node to node on the tree. First, we generate the nucleotide at the root of the tree (the first split leading to all four taxa) by drawing from the stationary distribution. Then, we use the transition probabilities to simulate from one end to the other of each branch on the tree. We start from the root of the tree, and simulate up the tree to progressively higher branches until we have simulated a nucleotide at each tip of the tree.

Section 5.5.1: Definition of the PhyloCTMC class

Create a new file called PhyloCTMC.h and write the following class definition.

#ifndef PhyloCTMC_h
#define PhyloCTMC_h

#include "RandomNumberGenerator.h"

#include <stdio.h>
#include <vector>

// forward declaration(s)
class Alignment;
class RateMatrix_JC;
class Tree;
class TreeNode;

/**
 * \class PhyloCTMC
 *
 * \brief A Continuous Time Markov chain (CTMC) process for simulating alignments and computing probabilities.
 *
 * This class represents a CTMC process along a phylogeny.
 * This class contains the main probability an likelihood computation function.
 *
 *
 * \author Sebastian Höhna
 *
*/
class PhyloCTMC {

public:
    PhyloCTMC(const Tree* phy, const RateMatrix_JC* q);
    virtual                    ~PhyloCTMC();

    double                      lnProbability(void) const ;
    Alignment*                  rv(RandomNumberGenerator *rng);
    void                        setValue(const Alignment *x);

private:
    void                        simulateAlignmentRecursively( std::vector<std::vector<char> >& m,
                                                              const std::vector<char>& ps,
                                                              const TreeNode* n,
                                                              RandomNumberGenerator *rng );

    const Tree*                 phylogeny;
    const RateMatrix_JC*        Q;
    const Alignment*            value;

};

#endif /* PhyloCTMC_h */

Section 5.5.2: Implementation of the PhyloCTMC class

First, add the include statements for all our header files for other classes that we need.

#include "PhyloCTMC.h"

#include "Alignment.h"
#include "RateMatrix_JC.h"
#include "Tree.h"
#include "TreeNode.h"

#include <cmath>

/**
 * Constructor of the PhyloCTMC.
 *
 * @param   phy         The phylogeny parameter.
 * @param   q           The rate matrix parameter.
 */
PhyloCTMC::PhyloCTMC(const Tree* phy, const RateMatrix_JC* q) :
    phylogeny( phy ),
    Q( q )
{

}

/**
 * Destructor of the pure birth process.
 */
PhyloCTMC::~PhyloCTMC( void )
{
    // we do not own the rate parameter nor the value, so we have nothing to delete
}

Here we also add a stub of the probability computation. However, we will actually add the content next week because it is slightly more involved.

/**
 * Compute the log-probability of the values given the CTMC process.
 *
 * More soon ...
 *
 * @return              Returns the log-probability.
 */
double PhyloCTMC::lnProbability( void ) const
{
    // initialize the log probability
    double ln_prob = 0.0;

    // we will add code here next week

    // return the computed log probability
    return ln_prob;
}

The important method for today is the simulation routine for sequence alignments.

/**
 * Draw a random alignment from a CTMC process along the phylogeny.
 *
 * @param   rng         The random number generator used for the simulation.
 * @return              The random alignment.
 */
Alignment* PhyloCTMC::rv(RandomNumberGenerator *rng)
{
    // we assume a rooted phylogeny with 2*n-1 nodes
    size_t num_taxa  = (phylogeny->getNumberOfNodes()+1) / 2;
    size_t num_sites = 100;

    // create an empty alignment
    Alignment* a = new Alignment( num_taxa, num_sites );
    std::vector<std::vector<char> > data_matrix = std::vector<std::vector<char> >( num_taxa, std::vector<char>(num_sites, '?') );

    // now we want to start our simulation for the root node
    // so we need to get the root node first
    const TreeNode* root = phylogeny->getRootNode();
    // create and empty sequence
    std::vector<char> root_sequence = std::vector<char>(num_sites, '?');
    // and then simulate each character from the prior
    // the prior under the model are the stationary frequencies,
    // which are all 1/4 for the Jukes-Cantor model
    for ( size_t i=0; i<num_sites; i++ )
    {
        double u = rng->uniform01();
        if ( u < 0.25 )
        {
            root_sequence[i] = 'A';
        }
        else if ( u < 0.5 )
        {
            root_sequence[i] = 'C';
        }
        else if ( u < 0.75 )
        {
            root_sequence[i] = 'G';
        }
        else
        {
            root_sequence[i] = 'T';
        }
    }

    // simulate the sequence along the left and right branches of the root
    simulateAlignmentRecursively( data_matrix, root_sequence, root->getLeftChild(), rng );
    simulateAlignmentRecursively( data_matrix, root_sequence, root->getRightChild(), rng );

    // collect the taxon names
    // we start with an empty vector
    std::vector<std::string> names = std::vector<std::string>(num_taxa, "");
    // then we get all nodes on ask for the names of the tips
    const std::vector<TreeNode*>& nodes = phylogeny->getNodes();
    for (size_t i=0; i<num_taxa; i++)
    {
        names[i] = nodes[i]->getName();
    }

    // fill the alignment with data
    a->setTaxonNames( names );
    a->setMatrix( data_matrix );


    // and return the new alignment
    return a;
}

/**
 * Draw a random alignment from a CTMC process along the pylogeny.
 *
 * @param   rng         The random number generator used for the simulation.
 */
void PhyloCTMC::simulateAlignmentRecursively( std::vector<std::vector<char> >& data, const std::vector<char>& parent_sequence, const TreeNode* node, RandomNumberGenerator *rng )
{

    size_t num_sites = parent_sequence.size();

    // create and empty sequence
    std::vector<char> sequence = std::vector<char>(num_sites, '?');

    double time = node->getBranchLength();
    std::vector<std::vector<double> > P = Q->calculateTransitionProbabilities(time, 1.0);

    // and then simulate each character using the transition probabilities
    for ( size_t i=0; i<num_sites; i++ )
    {
        // first, we need to get the transition probabilities for this starting character
        std::vector<double> this_transition_probability;

        switch (parent_sequence[i])
        {
            case 'A':
                this_transition_probability = P[0];
                break;

            case 'C':
                this_transition_probability = P[1];
                break;

            case 'G':
                this_transition_probability = P[2];
                break;

            case 'T':
                this_transition_probability = P[3];
                break;

            default:
                break;
        }

        // simulate the
        double u = rng->uniform01();
        if ( u < this_transition_probability[0] )
        {
            sequence[i] = 'A';
        }
        else if ( u < (this_transition_probability[0]+this_transition_probability[1]) )
        {
            sequence[i] = 'C';
        }
        else if ( u < (this_transition_probability[0]+this_transition_probability[1]+this_transition_probability[2]) )
        {
            sequence[i] = 'G';
        }
        else
        {
            sequence[i] = 'T';
        }
    }

    // finally, we need to set the sequence into the data matrix (if this node is a tip node)
    // or continue with the recursive simulation if this is an interior node
    if ( node->isTip() )
    {
        data[ node->getIndex() ] = sequence;
    }
    else
    {
        simulateAlignmentRecursively( data, sequence, node->getLeftChild(), rng );
        simulateAlignmentRecursively( data, sequence, node->getRightChild(), rng );
    }


}

Subsection 5.5.3: Testing your Alignment Simulator

RandomNumberGenerator *rng = new RandomNumberGenerator();

PhyloCTMC ctmc = PhyloCTMC( &trees[0], my_rate_matrix );
Alignment *sim_align = ctmc.rv(rng);
sim_align->print(std::cout);

Don’t forget to also include the new classes that we just wrote today. Thus, add the two include statements at the beginning of the main.cpp file.

#include "PhyloCTMC.h"

Now compile and run your BabyBayes. How does your simulated alignment look? How many times did you observe the patter ‘AAAA’, ‘CCCC’, ‘GGGG’, ‘TTTT’? Compare with your neighbor. How many times did you observe the patter ‘AACC’, ‘CCTT’, ‘GGTT’? Perhaps you want to write some code to check this, and the frequency of all \(4^4\) possible patterns.

Section 5.6: Exercises

Subsection 5.6.1: The Felsenstein 1981 Model

In our example, we used the very simple Jukes-Cantor substitution model. In your first exercise, implement the Felsenstein 1981 substitution model. This F81 model has unequal stationary frequencies. The resulting Q matric looks like. \[ {\mathbf Q} = \{q_{ij}\} = \left( \begin{array}{cccc} 1 - \pi_A & \pi_C & \pi_G & \pi_T \\ \pi_A & 1 - \pi_C & \pi_G & \pi_T \\ \pi_A & \pi_C & 1 - \pi_G & \pi_T \\ \pi_A & \pi_C & \pi_G & 1 - \pi_T \\ \end{array} \right) \] You can compute the transition probabilities using the following equations: \[ \beta = 1/(1-\pi_A^2-\pi_C^2-\pi_G^2-\pi_T^2) \] and \[ {\mathbf P}_{ij}(\nu) = \begin{cases} e^{-\beta\nu}+\pi_j\left(1- e^{-\beta\nu}\right) & \mbox{ if } i = j \\ \pi_j\left(1- e^{-\beta\nu}\right) & \mbox{ if } i \neq j \end{cases} \] Implement this rate matrix and compute transition probabilities for a range of branch lengths.

Subsection 5.6.2: The Hasegawa-Kishino-Yano Model

Similar to the JC69 and F81 models, we can develop the HKY85 substitution process. The rate matrix is \[ {\mathbf Q} = \{q_{ij}\} = \left( \begin{array}{cccc} {*} & \pi_C & \kappa\pi_G & \pi_T \\ \pi_A & {*} & \pi_G & \kappa\pi_T \\ \kappa\pi_A & \pi_C & {*} & \pi_T \\ \pi_A & \kappa\pi_C & \pi_G & {*} \\ \end{array} \right) \]

If we express the branch length, ’‘ν’’ in terms of the expected number of changes per site then: \[ \begin{eqnarray} \beta = & \frac{1}{2(\pi_A + \pi_G)(\pi_C + \pi_T) + 2\kappa[(\pi_A\pi_G) + (\pi_C\pi_T)]} \\ P_{AA}(\nu,\kappa,\pi) = & \left[\pi_A\left(\pi_A + \pi_G + (\pi_C + \pi_T)e^{-\beta\nu}\right) + \pi_G e^{-(1 + (\pi_A + \pi_G)(\kappa - 1.0))\beta\nu}\right]/(\pi_A + \pi_G)\\ P_{AC}(\nu,\kappa,\pi) = & \pi_C\left(1.0 - e^{-\beta\nu}\right)\\ P_{AG}(\nu,\kappa,\pi) = & \left[\pi_G\left(\pi_A + \pi_G + (\pi_C + \pi_T)e^{-\beta\nu}\right) - \pi_Ge^{-(1 + (\pi_A + \pi_G)(\kappa - 1.0))\beta\nu}\right] /\left(\pi_A + \pi_G\right) \\ P_{AT}(\nu,\kappa,\pi) = & \pi_T\left(1.0 - e^{-\beta\nu}\right) \end{eqnarray} \]

Subsection 5.6.3: Using Substitution Rates to Simulate Alignments

The second method only relies on our ability to generate exponentially distributed random numbers. If we generate a uniform random number on the interval (0,1), we can generate an exponential random number (with parameters \(\lambda\)) using the transformation \(t = -{1 \over \lambda} \log_e(u)\) (where \(u\) is the uniform random number and \(t\) is the exponentially distributed random number). This second method involves an addition to the tree of Figure~ which seems unusual: We take the tree of Figure~, and add a ‘tail’ to it —a branch that extends for some distance from the root of the tree. In this case, the branch at the root of the tree is \(v_0 = 10.0\) in length. Moreover, we assume that the process is in state (nucleotide) A at the very root of the tree. The situation we have is like that shown in Figure~.

We simulate the process starting at the root of the tree. The process is in state A, meaning that the only relevant row of the rate matrix is the first one: \[ {\mathbf Q} = \{q_{ij}\} = \left( \begin{array}{cccc} -0.886 & 0.190 & 0.633 & 0.063 \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \end{array} \right) \] We wait an exponentially distributed amount of time with parameter \(\lambda = 0.886\) until the next substitution occurs. When the substitution occurs, it is to a \(C\), \(G\), or \(T\) with probabilities \({0.190 \over 0.886} = 0.214\), \({0.633 \over 0.886} = 0.714\), and \({0.063 \over 0.886} = 0.072\), respectively. In the first section, we used this method for simulating along a single branch of a tree. Here we apply the method with vigor, applying it to each branch in the tree from the root to the tips. We continue to simulate up the root branch of the tree until our simulation exceeds the length of the branch. We then record the nucleotide state the process was in when it exceeded a length of 10. We write this state at the end of the root branch, where it splits into branches 1 and 6. We then repeat the simulation process for branch 1 and then branch 6, recording the state the process is in at the end of those two branches. We then concentrate our attention on branches 4 and 5, and then on branches 2 and 3. At the end, we should have nucleotides at the ends of branches 1, 2, 3, and 4.

One puzzling aspect of this simulation is why we always start the process in nucleotide A, and why we even bothered to add the tail to the root of the tree. We did this because for this method, we assume ignorance of the stationary distribution of the rate matrix. If this is the case, we can use our understanding of the rate matrix as specifying waiting times between substitutions to complete our simulation.
Hence, we always start our simulations in a particular nucleotide (in this case we chose to start in the nucleotide A), and then simulate the process for a long time along the root (tail) branch of the tree. The hope is that if we make the length of the tail branch long enough, that the process is at stationarity by the time it reaches the first split in the tree (the speciation event that eventually produces the four species at the tips of the tree).

This method relies on the idea that we can come pretty near to stationarity with a moderately long branch. We know that the stationary distribution of the HKY85 process of nucleotide substitution with the specific parameters we chose is \(\pi_A = 0.4\), \(\pi_C = 0.3\), \(\pi_G = 0.2\), \(\pi_T = 0.1\). We also know that the transition probability for a branch of \(v = 10.0\) is \[ {\mathbf P}(10.0) = \left( \begin{array}{rrrr} 0.401 & 0.299 & 0.200 & 0.099 \\ 0.399 & 0.301 & 0.199 & 0.100 \\ 0.401 & 0.299 & 0.200 & 0.099 \\ 0.399 & 0.301 & 0.199 & 0.100 \\ \end{array} \right) \] The transition probability matrix tells us that if we start in nucleotide A, then we end up in state A, C, G, and T with probabilities 0.401, 0.299, 0.200, and 0.099, respectively. These numbers are very close to the actual stationary probabilities, so perhaps this method is not such a bad one.