The Phylogenetic Model

In our previous chapters, especially in Chapter 2 when we implemented the Alignment class, we left out one method that we need today. The specific method that we need is one that compute the initial likelihoods for a given nucleotide character. For example, the nucleotide ‘A’ should give a vector of [1,0,0,0] whereas the nucleotide ‘T’ should return [0,0,0,1]. Furthermore, we code ambiguous states, such as ‘?’, by saying that if this nucleotide were actually a ‘C’, then with probability one we would have called it a ‘?’. Thus, a ‘?’ returns [1,1,1,1] but a ‘R’ which stands for a purine, coding an ‘A’ or ‘G’, returns a [1,0,1,0].

std::vector<int>                    getNucleotideStates(char nuc) const;

/**
 *
 *  Get the states for a nucleotide represented as a character
 *
 *   This function initializes a vector, nuc[MAX_NUM_STATES]. The four elements
 *   of nuc correspond to the four nucleotides in alphabetical order.
 *   We are assuming that the nucCode is a binary representation of
 *   the nucleotides that are consistent with the observation. For
 *   example, if we observe an A, then the nucCode is 1 and the
 *   function initalizes nuc[0] = 1 and the other elements of nuc
 *   to be 0.
 *
 *   Observation    nucCode        nuc
 *        A            1           1000
 *        C            2           0100
 *        G            4           0010
 *        T            8           0001
 *        R            5           1010
 *        Y           10           0101
 *        M            3           1100
 *        K           12           0011
 *        S            6           0110
 *        W            9           1001
 *        H           11           1101
 *        B           14           0111
 *        V            7           1110
 *        D           13           1011
 *        N - ?       15           1111
 *
 */
std::vector<int> Alignment::getNucleotideStates (char nuc) const
{

    std::vector<int> states(4,0);

    switch ( nuc )
    {
        case 'A':
            states[0] = 1;
            states[1] = 0;
            states[2] = 0;
            states[3] = 0;
            break;
        case 'C':
            states[0] = 0;
            states[1] = 1;
            states[2] = 0;
            states[3] = 0;
            break;
        case 'M':
            states[0] = 1;
            states[1] = 1;
            states[2] = 0;
            states[3] = 0;
            break;
        case 'G':
            states[0] = 0;
            states[1] = 0;
            states[2] = 1;
            states[3] = 0;
            break;
        case 'R':
            states[0] = 1;
            states[1] = 0;
            states[2] = 1;
            states[3] = 0;
            break;
        case 'S':
            states[0] = 0;
            states[1] = 1;
            states[2] = 1;
            states[3] = 0;
            break;
        case 'V':
            states[0] = 1;
            states[1] = 1;
            states[2] = 1;
            states[3] = 0;
            break;
        case 'T':
            states[0] = 0;
            states[1] = 0;
            states[2] = 0;
            states[3] = 1;
            break;
        case 'W':
            states[0] = 1;
            states[1] = 0;
            states[2] = 0;
            states[3] = 1;
            break;
        case 'Y':
            states[0] = 0;
            states[1] = 1;
            states[2] = 0;
            states[3] = 1;
            break;
        case 'H':
            states[0] = 1;
            states[1] = 1;
            states[2] = 0;
            states[3] = 1;
            break;
        case 'K':
            states[0] = 0;
            states[1] = 0;
            states[2] = 1;
            states[3] = 1;
            break;
        case 'D':
            states[0] = 1;
            states[1] = 0;
            states[2] = 1;
            states[3] = 1;
            break;
        case 'B':
            states[0] = 0;
            states[1] = 1;
            states[2] = 1;
            states[3] = 1;
            break;
        case '?':
        case '-':
        case 'N':
            states[0] = 1;
            states[1] = 1;
            states[2] = 1;
            states[3] = 1;
    }

    return states;
 }

This was a lot of basic copy and paste. Now we can go back to our PhyloCTMC class.

mutable std::vector<size_t>             active_likelihood_index;
mutable LikelihoodVector                conditional_likelihoods;
size_t                                  num_states;

The mutable keyword says that you are allowed this change this variable even in a constant method. In that sense, it allows you to cheat. You should use this keyword only if it is really necessary.

Here we use the mutable keyword because we will fill the conditional likelihoods while we compute the probability, which is done by the function call to lnProbability, which is a constant function. We could instead make the function lnProbability non-constant.

Next, you might have noticed that the conditional_likelihoods are of type LikelihoodVector. This is not a standard C++ type; I have simply made it up similar to when I define a new class. So what is this the LikelihoodVector?

typedef std::vector< std::vector< std::vector< std::vector<double> > > > LikelihoodVector;

It is simpler to use a LikelihoodVector than using a vector of a vector of a vector of a vector of doubles. But they are in fact identical. So it is just a different name that we define.

void                                    computeLnLikelihoodRecursively(const TreeNode* n) const;
std::vector<std::vector<double> >       computeTransitionProbabilityMatrix(const TreeNode* n) const;
void                                    initializeConditionalLikelihoods(void);

Let’s continue with the implementation which is in the file PhyloCTMC.cpp. Make sure to initialize num_states( 4 ) in the constructor.

/**
 * Constructor of the PhyloCTMC.
 *
 * @param   phy         The phylogeny parameter.
 * @param   q           The rate matrix parameter.
 */
PhyloCTMC::PhyloCTMC(const Tree* phy, const RateMatrix_JC* q, double* cr) :
    phylogeny( phy ),
    Q( q ),
    clock_rate( cr ),
    num_states( 4 )
{

}

The main computations are performed in the recursive computation of the likelihoods.

/**
 * Recursively compute the ln likelihood for the given tree and parameters.
 * Specifically, we compute the likelihood along the given branch, but make sure that we first compute likelihood for the two descendant branches
 */
void PhyloCTMC::computeLnLikelihoodRecursively(const TreeNode *node) const
{

    // to compute the likelihood, we need to get the likelihoods of our children first
    // so get the two children
    // this is safe even for tip nodes because the children will be NULL for tips
    const TreeNode* left_child    = node->getLeftChild();
    const TreeNode* right_child   = node->getRightChild();

    // now check of this is a tip node
    if ( left_child != NULL && right_child != NULL )
    {
        // if not, we need to compute the likelihoods along the two descendant branches
        // let's start with the indices of the nodes
        size_t node_index           = node->getIndex();
        size_t left_child_index     = left_child->getIndex();
        size_t right_child_index    = right_child->getIndex();

        // and recursively call this method to compute the likelihoods on the child branches
        computeLnLikelihoodRecursively( left_child );
        computeLnLikelihoodRecursively( right_child );

        // now we can get the likelihood values, as stored in the conditional likelihood vectors
        const std::vector< std::vector<double> >& cl_left     = conditional_likelihoods[active_likelihood_index[left_child_index] ][left_child_index ];
        const std::vector< std::vector<double> >& cl_right    = conditional_likelihoods[active_likelihood_index[right_child_index]][right_child_index];
              std::vector< std::vector<double> >& cl_node     = conditional_likelihoods[active_likelihood_index[node_index]       ][node_index       ];

        // don't forget that these represent the likelihoods at the beginning of the branches
        // therefore, we need to get two transition probability matrices
        std::vector<std::vector<double> > transition_probabilities_left   = std::vector<std::vector<double> >();
        std::vector<std::vector<double> > transition_probabilities_right  = std::vector<std::vector<double> >();

        // now fill the transition probability matrices
        transition_probabilities_left    = computeTransitionProbabilityMatrix(left_child);
        transition_probabilities_right   = computeTransitionProbabilityMatrix(right_child);

        // iterate over all sites/columns of the alignment
        size_t num_sites = value->getNumberOfSites();
        for (size_t c=0; c<num_sites; c++)
        {
            // get the condition likelihood for this specific site
            const std::vector<double>& cl_left_site  = cl_left [c];
            const std::vector<double>& cl_right_site = cl_right[c];
                  std::vector<double>& cl_node_site  = cl_node [c];

            // if you were using among-site-rate-variation,
            // then you need to iterate over all rate categories here

            // create two variables for the sum of probabilities for the left and right child
            double sum_left = 0.0, sum_right = 0.0;

            // iterate over all possible starting states
            for (size_t from=0; from<num_states; from++)
            {
                // make sure that the sums are always 0.0 at the start
                sum_left    = 0.0;
                sum_right   = 0.0;

                // and then iterate over all end states
                for (size_t to=0; to<num_states; to++)
                {
                    // we add the probability that we started in i and terminated in j
                    // multiplied with the probability that the child had state j
                    sum_left   += transition_probabilities_left [from][to] * cl_left_site [to];
                    sum_right  += transition_probabilities_right[from][to] * cl_right_site[to];
                } // end for loop over all end states

                // to compute the likelihood of state i at this node,
                // we need to multiply the likelihoods that we started with state i
                // for both descendant branches
                cl_node_site[from] = sum_left * sum_right;

            } // end for loop over all starting states


        } // end for loop over all sites

    } // end-if this was a not a tip node

}

Mostly for convenience, we break apart the transition probability computation in its own method. However, this makes our implementation of the Felsenstein’s pruning algorithm more generic because we can simply change the transition probability computation elsewhere.

/**
 * Compute the transition probability matrix for a given branch.
 *
 * @param   node                  The node/branch for which we need to compute the transition probability matrix.
 * @return              A matrix, represented as a vector of a vector, of the transition probabilities.
 */
std::vector<std::vector<double> > PhyloCTMC::computeTransitionProbabilityMatrix(const TreeNode *node) const
{
    // most importantly, we need to get the length of the branch to represent the time of evolution
    double branch_length = node->getBranchLength();

    // we use a global clock rate
    // this should be a variable to be estimated.
    double c = 1.0;
    if ( clock_rate != NULL)
    {
        c = *clock_rate;
    }

    // here is where you would introduce the site rates
    double rate = c;

    return Q->calculateTransitionProbabilities(branch_length, rate);
}

You might have noticed that we never used tip states so far. This is done when we initialize the conditional likelihood vector.

/**
 * Initialize the vectors of conditional likelihoods
 */
void PhyloCTMC::initializeConditionalLikelihoods( void )
{

    // for convenience, get a reference to the alignment so that we don't need to work with a pointer
    const Alignment& alignment = *value;

    // how many nodes and sites do we have?
    size_t n_nodes          = 2 * alignment.getNumberOfTaxa()-1;
    size_t n_sites          = alignment.getNumberOfSites();


    // allocate conditional likelihoods
    conditional_likelihoods = std::vector< std::vector<std::vector< std::vector< double > > > >( 2,
                                std::vector< std::vector< std::vector< double> > >( n_nodes,
                                    std::vector< std::vector< double> >(n_sites,
                                        std::vector< double >( num_states, 0) ) ) );

    // initialize the tip conditional likelihoods
    for (int i=0; i<alignment.getNumberOfTaxa(); i++)
    {
        std::vector< std::vector<double> >& cl_0 = conditional_likelihoods[0][i];
        std::vector< std::vector<double> >& cl_1 = conditional_likelihoods[1][i];
        for (int j=0; j<n_sites; j++)
        {
            std::vector<double>& cl_0_site = cl_0[j];
            std::vector<double>& cl_1_site = cl_1[j];
            char nuc = alignment.getNucleotide(i, j);
            std::vector<int> states = alignment.getNucleotideStates(nuc);

            // if we were using among-site-rate variation,
            // then we need to iterate here over the rate category too
            for (int s=0; s<num_states; s++)
            {
                cl_0_site[s] = (double)states[s];
                cl_1_site[s] = (double)states[s];
            }
        }
    }

    // also, allocate the vector of which likelihoods are active
    active_likelihood_index = std::vector<size_t>(n_nodes,0);
}

Finally, we can put the Felsenstein’s pruning algorithm together to compute the joint probability of all sites, which is our probability of observing the alignment.

/**
 * Compute the log-probability of the values given the CTMC process.
 *
 * The main idea for computing the probability of the alignment is to use a recursive version of Felsenstein's pruning algorithm.
 * We also assume that each site is i.i.d., and therefore multiply the probabilities for all sites together.
 *
 * @return              Returns the log-probability.
 */
double PhyloCTMC::lnProbability( void ) const
{
    // initialize the log probability
    double ln_prob = 0.0;

    // for the recursive computation, we start with the root node.
    const TreeNode *root = phylogeny->getRootNode();
    size_t root_index = root->getIndex();

    // start the computation of the conditional likelihoods recursively
    computeLnLikelihoodRecursively( root );

    // get the prior probabilities for the root states
    // in the Jukes-Cantor model these are equal probabilities.
    // Otherwise, they are given by the stationary frequencies,
    // which should be a parameter here too then
    std::vector<double> f = std::vector<double>(num_states, 1.0/num_states);
    //    const std::vector<double>& f = *stationary_frequencies;

    // now get the conditional likelihoods at the root node
    const std::vector< std::vector<double> >& cl_node = conditional_likelihoods[active_likelihood_index[root_index] ][root_index];

    //    double category_prob = 1.0 / num_gamma_categories;
    // iterate over all sites
    size_t num_sites = value->getNumberOfSites();
    for (size_t c=0; c<num_sites; c++)
    {
        // compute the probability for this site
        double site_prob = 0.0;

        // get the conditional likelihood for this specific site
        const std::vector<double>& cl_site = cl_node[c];

        // and compute the weighted sum
        // if we allow for among-site-rate-variation, we need to sum over the rate categories here too
        // and weight them by the prior for the rate category (usually assumed to be equal -> 1/n)

        for (size_t j=0; j<num_states; j++)
        {
            site_prob += cl_site[j] * f[j];
        }

        // add the log-probability for this site to our overall probability
        ln_prob += log(site_prob);
    }

    // return the computed log probability
    return ln_prob;
}

Last time we didn’t add the setValue method, so we need to it now. Importantly, when the value/alignment has been set, we can initialize the condition likelihoods.

void PhyloCTMC::setValue(const Alignment *x)
{
    value = x;

    initializeConditionalLikelihoods();
}

Finally, you can put everything together and test it in your main function. You should write the following lines into your main.cpp file.

// read in some trees
NewickTreeReader tree_reader = NewickTreeReader();
std::vector<Tree> trees = tree_reader.readTrees("primates.tre");
Tree* my_tree = new Tree(trees[0]);

// create a transition rate matrix
RateMatrix_JC *my_rate_matrix = new RateMatrix_JC();

// create a transition rate matrix
double *my_clock_rate = new double(1.0);

PhyloCTMC ctmc = PhyloCTMC( my_tree, my_rate_matrix, my_clock_rate );

AlignmentReader reader;
Alignment* my_alignment = new Alignment( reader.readPhylip( "primates.phy" ) );
ctmc.setValue( my_alignment );

double likelihood = ctmc.lnProbability();

std::cout << "ln(likelihood) = " << likelihood << std::endl;

delete my_alignment;
delete my_rate_matrix;
delete my_tree;

Section 6.xxx: Exercises

Section 6.xxx.1: Compute the likelihood under the F81 substitution model

Replace the Jukes-Cantor substitution matrix in the PhyloCTMC class and use the F81 substitution model instead. Don’t forget to add another parameter to the PhyloCTMC class for the stationary frequencies, we should be the same parameter as used in the F81 class.

Section 6.xxx.2: Add among site rate variation using a discretize Gamma distribution

In our first implementation, we assumed that all sites evolve under the exactly same rate of evolution. However, this is often a terrible assumption. For example, nucleotides at the first codon position are often more conserved than nucleotides at the third codon position. Therefore, we should allow for some rate variation across sites. Commonly, researchers assume that rates for sites are drawn from a gamma distribution. To make computations easier, we actually integrate over this distribution by discretization.

double q = (cat_index+0.5) / double(num_gamma_categories);
double site_rate = Statistics::Gamma::quantile(alpha, alpha, q);

For this computation you need to use the two additional class Math and Statistics. I provided these because they include a lot of technical details that are not of primary interest here.

You also need to change the code to use one additional layer in the conditional likelihood vectors. Generally speaking, you need to have one likelihood vector for each of the rate categories. Thus, you should initialize the conditional likelihoods with

    conditional_likelihoods = std::vector< std::vector< std::vector<std::vector< std::vector< double > > > > >( 2,
                                std::vector< std::vector< std::vector< std::vector< double> > > >( n_nodes,
                                    std::vector< std::vector< std::vector< double> > >(n_sites,
                                        std::vector< std::vector< double > >( num_gamma_categories,
                                            std::vector< double >( num_states, 0) ) ) ) );

Make the other changes that you iterate over the rate categories where indicate.

The Phylogenetic Model

Chapter 6: Computing the Likelihood

Section 6.1: Felsenstein’s pruning algorithm

Section 6.xxx: Exercises

Section 6.xxx.1: Compute the likelihood under the F81 substitution model

Section 6.xxx.2: Add among site rate variation using a discretize Gamma distribution

References