In our previous chapters, especially in Chapter 2 when we implemented the Alignment class, we left out one method that we need today. The specific method that we need is one that compute the initial likelihoods for a given nucleotide character. For example, the nucleotide ‘A’ should give a vector of [1,0,0,0] whereas the nucleotide ‘T’ should return [0,0,0,1]. Furthermore, we code ambiguous states, such as ‘?’, by saying that if this nucleotide were actually a ‘C’, then with probability one we would have called it a ‘?’. Thus, a ‘?’ returns [1,1,1,1] but a ‘R’ which stands for a purine, coding an ‘A’ or ‘G’, returns a [1,0,1,0].
Add this method the the Alignment.h file.
Also, add the implementation to the file Alignment.cpp.
/**
*
* Get the states for a nucleotide represented as a character
*
* This function initializes a vector, nuc[MAX_NUM_STATES]. The four elements
* of nuc correspond to the four nucleotides in alphabetical order.
* We are assuming that the nucCode is a binary representation of
* the nucleotides that are consistent with the observation. For
* example, if we observe an A, then the nucCode is 1 and the
* function initalizes nuc[0] = 1 and the other elements of nuc
* to be 0.
*
* Observation nucCode nuc
* A 1 1000
* C 2 0100
* G 4 0010
* T 8 0001
* R 5 1010
* Y 10 0101
* M 3 1100
* K 12 0011
* S 6 0110
* W 9 1001
* H 11 1101
* B 14 0111
* V 7 1110
* D 13 1011
* N - ? 15 1111
*
*/
std::vector<int> Alignment::getNucleotideStates (char nuc) const
{
std::vector<int> states(4,0);
switch ( nuc )
{
case 'A':
states[0] = 1;
states[1] = 0;
states[2] = 0;
states[3] = 0;
break;
case 'C':
states[0] = 0;
states[1] = 1;
states[2] = 0;
states[3] = 0;
break;
case 'M':
states[0] = 1;
states[1] = 1;
states[2] = 0;
states[3] = 0;
break;
case 'G':
states[0] = 0;
states[1] = 0;
states[2] = 1;
states[3] = 0;
break;
case 'R':
states[0] = 1;
states[1] = 0;
states[2] = 1;
states[3] = 0;
break;
case 'S':
states[0] = 0;
states[1] = 1;
states[2] = 1;
states[3] = 0;
break;
case 'V':
states[0] = 1;
states[1] = 1;
states[2] = 1;
states[3] = 0;
break;
case 'T':
states[0] = 0;
states[1] = 0;
states[2] = 0;
states[3] = 1;
break;
case 'W':
states[0] = 1;
states[1] = 0;
states[2] = 0;
states[3] = 1;
break;
case 'Y':
states[0] = 0;
states[1] = 1;
states[2] = 0;
states[3] = 1;
break;
case 'H':
states[0] = 1;
states[1] = 1;
states[2] = 0;
states[3] = 1;
break;
case 'K':
states[0] = 0;
states[1] = 0;
states[2] = 1;
states[3] = 1;
break;
case 'D':
states[0] = 1;
states[1] = 0;
states[2] = 1;
states[3] = 1;
break;
case 'B':
states[0] = 0;
states[1] = 1;
states[2] = 1;
states[3] = 1;
break;
case '?':
case '-':
case 'N':
states[0] = 1;
states[1] = 1;
states[2] = 1;
states[3] = 1;
}
return states;
}
This was a lot of basic copy and paste. Now we can go back to our PhyloCTMC class.
We need to add some new member variables.
mutable std::vector<size_t> active_likelihood_index;
mutable LikelihoodVector conditional_likelihoods;
size_t num_states;
The mutable keyword says that you are allowed this change this variable even in a constant method. In that sense, it allows you to cheat. You should use this keyword only if it is really necessary.
Here we use the mutable keyword because we will fill the conditional likelihoods while we compute the probability, which is done by the function call to lnProbability, which is a constant function. We could instead make the function lnProbability non-constant.
Next, you might have noticed that the conditional_likelihoods are of type LikelihoodVector. This is not a standard C++ type; I have simply made it up similar to when I define a new class. So what is this the LikelihoodVector?
You have to declare this type
It is simpler to use a LikelihoodVector than using a vector of a vector of a vector of a vector of doubles. But they are in fact identical. So it is just a different name that we define.
void computeLnLikelihoodRecursively(const TreeNode* n) const;
std::vector<std::vector<double> > computeTransitionProbabilityMatrix(const TreeNode* n) const;
void initializeConditionalLikelihoods(void);
Let’s continue with the implementation which is in the file PhyloCTMC.cpp. Make sure to initialize num_states( 4 ) in the constructor.
/**
* Constructor of the PhyloCTMC.
*
* @param phy The phylogeny parameter.
* @param q The rate matrix parameter.
*/
PhyloCTMC::PhyloCTMC(const Tree* phy, const RateMatrix_JC* q, double* cr) :
phylogeny( phy ),
Q( q ),
clock_rate( cr ),
num_states( 4 )
{
}
The main computations are performed in the recursive computation of the likelihoods.
/**
* Recursively compute the ln likelihood for the given tree and parameters.
* Specifically, we compute the likelihood along the given branch, but make sure that we first compute likelihood for the two descendant branches
*/
void PhyloCTMC::computeLnLikelihoodRecursively(const TreeNode *node) const
{
// to compute the likelihood, we need to get the likelihoods of our children first
// so get the two children
// this is safe even for tip nodes because the children will be NULL for tips
const TreeNode* left_child = node->getLeftChild();
const TreeNode* right_child = node->getRightChild();
// now check of this is a tip node
if ( left_child != NULL && right_child != NULL )
{
// if not, we need to compute the likelihoods along the two descendant branches
// let's start with the indices of the nodes
size_t node_index = node->getIndex();
size_t left_child_index = left_child->getIndex();
size_t right_child_index = right_child->getIndex();
// and recursively call this method to compute the likelihoods on the child branches
computeLnLikelihoodRecursively( left_child );
computeLnLikelihoodRecursively( right_child );
// now we can get the likelihood values, as stored in the conditional likelihood vectors
const std::vector< std::vector<double> >& cl_left = conditional_likelihoods[active_likelihood_index[left_child_index] ][left_child_index ];
const std::vector< std::vector<double> >& cl_right = conditional_likelihoods[active_likelihood_index[right_child_index]][right_child_index];
std::vector< std::vector<double> >& cl_node = conditional_likelihoods[active_likelihood_index[node_index] ][node_index ];
// don't forget that these represent the likelihoods at the beginning of the branches
// therefore, we need to get two transition probability matrices
std::vector<std::vector<double> > transition_probabilities_left = std::vector<std::vector<double> >();
std::vector<std::vector<double> > transition_probabilities_right = std::vector<std::vector<double> >();
// now fill the transition probability matrices
transition_probabilities_left = computeTransitionProbabilityMatrix(left_child);
transition_probabilities_right = computeTransitionProbabilityMatrix(right_child);
// iterate over all sites/columns of the alignment
size_t num_sites = value->getNumberOfSites();
for (size_t c=0; c<num_sites; c++)
{
// get the condition likelihood for this specific site
const std::vector<double>& cl_left_site = cl_left [c];
const std::vector<double>& cl_right_site = cl_right[c];
std::vector<double>& cl_node_site = cl_node [c];
// if you were using among-site-rate-variation,
// then you need to iterate over all rate categories here
// create two variables for the sum of probabilities for the left and right child
double sum_left = 0.0, sum_right = 0.0;
// iterate over all possible starting states
for (size_t from=0; from<num_states; from++)
{
// make sure that the sums are always 0.0 at the start
sum_left = 0.0;
sum_right = 0.0;
// and then iterate over all end states
for (size_t to=0; to<num_states; to++)
{
// we add the probability that we started in i and terminated in j
// multiplied with the probability that the child had state j
sum_left += transition_probabilities_left [from][to] * cl_left_site [to];
sum_right += transition_probabilities_right[from][to] * cl_right_site[to];
} // end for loop over all end states
// to compute the likelihood of state i at this node,
// we need to multiply the likelihoods that we started with state i
// for both descendant branches
cl_node_site[from] = sum_left * sum_right;
} // end for loop over all starting states
} // end for loop over all sites
} // end-if this was a not a tip node
}
Mostly for convenience, we break apart the transition probability computation in its own method. However, this makes our implementation of the Felsenstein’s pruning algorithm more generic because we can simply change the transition probability computation elsewhere.
/**
* Compute the transition probability matrix for a given branch.
*
* @param node The node/branch for which we need to compute the transition probability matrix.
* @return A matrix, represented as a vector of a vector, of the transition probabilities.
*/
std::vector<std::vector<double> > PhyloCTMC::computeTransitionProbabilityMatrix(const TreeNode *node) const
{
// most importantly, we need to get the length of the branch to represent the time of evolution
double branch_length = node->getBranchLength();
// we use a global clock rate
// this should be a variable to be estimated.
double c = 1.0;
if ( clock_rate != NULL)
{
c = *clock_rate;
}
// here is where you would introduce the site rates
double rate = c;
return Q->calculateTransitionProbabilities(branch_length, rate);
}
You might have noticed that we never used tip states so far. This is done when we initialize the conditional likelihood vector.
/**
* Initialize the vectors of conditional likelihoods
*/
void PhyloCTMC::initializeConditionalLikelihoods( void )
{
// for convenience, get a reference to the alignment so that we don't need to work with a pointer
const Alignment& alignment = *value;
// how many nodes and sites do we have?
size_t n_nodes = 2 * alignment.getNumberOfTaxa()-1;
size_t n_sites = alignment.getNumberOfSites();
// allocate conditional likelihoods
conditional_likelihoods = std::vector< std::vector<std::vector< std::vector< double > > > >( 2,
std::vector< std::vector< std::vector< double> > >( n_nodes,
std::vector< std::vector< double> >(n_sites,
std::vector< double >( num_states, 0) ) ) );
// initialize the tip conditional likelihoods
for (int i=0; i<alignment.getNumberOfTaxa(); i++)
{
std::vector< std::vector<double> >& cl_0 = conditional_likelihoods[0][i];
std::vector< std::vector<double> >& cl_1 = conditional_likelihoods[1][i];
for (int j=0; j<n_sites; j++)
{
std::vector<double>& cl_0_site = cl_0[j];
std::vector<double>& cl_1_site = cl_1[j];
char nuc = alignment.getNucleotide(i, j);
std::vector<int> states = alignment.getNucleotideStates(nuc);
// if we were using among-site-rate variation,
// then we need to iterate here over the rate category too
for (int s=0; s<num_states; s++)
{
cl_0_site[s] = (double)states[s];
cl_1_site[s] = (double)states[s];
}
}
}
// also, allocate the vector of which likelihoods are active
active_likelihood_index = std::vector<size_t>(n_nodes,0);
}
Finally, we can put the Felsenstein’s pruning algorithm together to compute the joint probability of all sites, which is our probability of observing the alignment.
/**
* Compute the log-probability of the values given the CTMC process.
*
* The main idea for computing the probability of the alignment is to use a recursive version of Felsenstein's pruning algorithm.
* We also assume that each site is i.i.d., and therefore multiply the probabilities for all sites together.
*
* @return Returns the log-probability.
*/
double PhyloCTMC::lnProbability( void ) const
{
// initialize the log probability
double ln_prob = 0.0;
// for the recursive computation, we start with the root node.
const TreeNode *root = phylogeny->getRootNode();
size_t root_index = root->getIndex();
// start the computation of the conditional likelihoods recursively
computeLnLikelihoodRecursively( root );
// get the prior probabilities for the root states
// in the Jukes-Cantor model these are equal probabilities.
// Otherwise, they are given by the stationary frequencies,
// which should be a parameter here too then
std::vector<double> f = std::vector<double>(num_states, 1.0/num_states);
// const std::vector<double>& f = *stationary_frequencies;
// now get the conditional likelihoods at the root node
const std::vector< std::vector<double> >& cl_node = conditional_likelihoods[active_likelihood_index[root_index] ][root_index];
// double category_prob = 1.0 / num_gamma_categories;
// iterate over all sites
size_t num_sites = value->getNumberOfSites();
for (size_t c=0; c<num_sites; c++)
{
// compute the probability for this site
double site_prob = 0.0;
// get the conditional likelihood for this specific site
const std::vector<double>& cl_site = cl_node[c];
// and compute the weighted sum
// if we allow for among-site-rate-variation, we need to sum over the rate categories here too
// and weight them by the prior for the rate category (usually assumed to be equal -> 1/n)
for (size_t j=0; j<num_states; j++)
{
site_prob += cl_site[j] * f[j];
}
// add the log-probability for this site to our overall probability
ln_prob += log(site_prob);
}
// return the computed log probability
return ln_prob;
}
Last time we didn’t add the setValue method, so we need to it now. Importantly, when the value/alignment has been set, we can initialize the condition likelihoods.
Finally, you can put everything together and test it in your main function. You should write the following lines into your main.cpp file.
// read in some trees
NewickTreeReader tree_reader = NewickTreeReader();
std::vector<Tree> trees = tree_reader.readTrees("primates.tre");
Tree* my_tree = new Tree(trees[0]);
// create a transition rate matrix
RateMatrix_JC *my_rate_matrix = new RateMatrix_JC();
// create a transition rate matrix
double *my_clock_rate = new double(1.0);
PhyloCTMC ctmc = PhyloCTMC( my_tree, my_rate_matrix, my_clock_rate );
AlignmentReader reader;
Alignment* my_alignment = new Alignment( reader.readPhylip( "primates.phy" ) );
ctmc.setValue( my_alignment );
double likelihood = ctmc.lnProbability();
std::cout << "ln(likelihood) = " << likelihood << std::endl;
delete my_alignment;
delete my_rate_matrix;
delete my_tree;
Replace the Jukes-Cantor substitution matrix in the PhyloCTMC class and use the F81 substitution model instead. Don’t forget to add another parameter to the PhyloCTMC class for the stationary frequencies, we should be the same parameter as used in the F81 class.
In our first implementation, we assumed that all sites evolve under the exactly same rate of evolution. However, this is often a terrible assumption. For example, nucleotides at the first codon position are often more conserved than nucleotides at the third codon position. Therefore, we should allow for some rate variation across sites. Commonly, researchers assume that rates for sites are drawn from a gamma distribution. To make computations easier, we actually integrate over this distribution by discretization.
In your code, you can compute the rate categories using
double q = (cat_index+0.5) / double(num_gamma_categories);
double site_rate = Statistics::Gamma::quantile(alpha, alpha, q);
For this computation you need to use the two additional class Math and Statistics. I provided these because they include a lot of technical details that are not of primary interest here.
You also need to change the code to use one additional layer in the conditional likelihood vectors. Generally speaking, you need to have one likelihood vector for each of the rate categories. Thus, you should initialize the conditional likelihoods with
conditional_likelihoods = std::vector< std::vector< std::vector<std::vector< std::vector< double > > > > >( 2,
std::vector< std::vector< std::vector< std::vector< double> > > >( n_nodes,
std::vector< std::vector< std::vector< double> > >(n_sites,
std::vector< std::vector< double > >( num_gamma_categories,
std::vector< double >( num_states, 0) ) ) ) );
Make the other changes that you iterate over the rate categories where indicate.