Chapter 4: Simulating Trees

Section 4.1: The random number generator

In this chapter, we need a random number generator. As it turns out, computers can’t generate randomness on their own. All they can do is logical operations and math. So, how do we make randomness from one of the most deterministic tools made by man? The answer is that we don’t. Rather, we generate sequences of numbers that in many respects behave randomly. To be more correct, I will modify my original statement: In this chapter, we are going to make a pseudo random number generator.

The sequence of pseudorandom numbers is completely determined by the initial value, called the seed. If we initialize the pseudorandom number generator with the same seed, we will get the same sequence out. If the sequence of random numbers revisits the value for the seed, it will repeat. The period of the pseudorandom number is the length of this repeat. Good pseudorandom number generators will have a very long period. Moreover, the values will be difficult to distinguish from truly random numbers. For example, they shouldn’t be correlated with previous values.

You should be wary of using pseudorandom number generators. There is an entire sub-discipline in computer science dedicated to developing better pseudorandom numbers.

We will implement a simple linear congruential generator described by (Park and Miller 1988). It is not a state-of-the-art pseudorandom number generator, but it will give you an idea of how apparent randomness can be generated on a computer. It should be good enough for the applications in this book.

A linear congruential generator starts with an initial value of an int variable called the seed. The seed value is changed sequentially using the following equation: $s_{i+1} = (a \times s_i + c) \mod m$ where $$s_i$$ is the $$i$$th value of the seed, $$a$$ is the multiplier, $$c$$ is the increment, and $$m$$ is the modulus. Good pseudorandom number generators that use the linear congruential generator choose $$a$$, $$c$$, and $$m$$ wisely. The seed value for a good choice of $$a$$, $$c$$, and $$m$$ will bounce between 1 and the maximum value possible in a way that has the appearance of randomness.

Subsection 4.1.1: The RandomNumberGenerator.h file

Let us start, as usual, by creating the definition of the random number generator. Create a new file called RandomNumberGenerator.h. In this file, write the following class definition.

#ifndef RandomNumberGenerator_h
#define RandomNumberGenerator_h

#include <stdio.h>

/**
* \class RandomNumberGenerator
*
* \brief Draw a random number between 0 and 1.
*
* The random number generator draws values in the range between 0 and 1.
*
*
*
* \author Sebastian Höhna
*
*/
class RandomNumberGenerator {

public:

RandomNumberGenerator(void);
RandomNumberGenerator(int x);

// Regular functions
double                              uniform01(void);

private:
int                                 seed;

};

#endif /* RandomNumberGenerator_h */

Overall, this class definition is fairly simple. We have two constructors, one without a user-provided seed and one with a user-provided seed. The seed variable is stored privately in this class. The only actual function which the random number generator provides is the uniform01 method. This method, as its name suggests, draws (pseudo-)random numbers in the range between 0 and 1. This is all that is usually needed. As we see later in this chapter, we can transform uniform01 random variables into random draws from an exponential distribution.

Subsection 4.1.2: The RandomNumberGenerator.cpp file

Now let’s implement the random number generator. Create a new file called RandomNumberGenerator.cpp.

#include "RandomNumberGenerator.h"

#include <ctime>        // needed for time()

The first include statement should be familiar to you; we include the definition of the class that we want to implement. The second include statement provides us with a function to call the system time.

Our implementation of the default constructor sets the seed variable to be equal to the current time, obtained using the time function. The variable that is returned by the time function is not an int variable, so we cannot directly equate it to seed. The time function returns a variable of type time_t. However, we can force the time_t variable to act as an int using the casting argument, which was the (int) before the time(NULL). Casting can be dangerous. You can’t always sensibly cast one variable type into another. In this case, however, you can sensibly cast a time_t variable into an int variable. All is well.

/**
* Constructor for RandomNumberGenerator class. This constructor does not take
* any parameters and initializes the seed using the current
* system time.
*
*/
RandomNumberGenerator::RandomNumberGenerator(void)
{

// initialize the seed with the system time
seed = (int)( time( 0 ) );

}

Our second constructor takes in a user-provided seed. This makes it much simpler and only set the private class variable for the seed.

/**
* Constructor for RandomNumberGenerator class which takes as a
* parameter an integer with the user-supplied seed for the
* random number generator.
*
* @param   x           is an integer with the user-supplied random number seed.
*/
RandomNumberGenerator::RandomNumberGenerator(int x)
{

seed = x;

}

Now we turn to the only really interesting function: the uniform01. The code for pseudorandom number generators is always disturbing to look at. The non-randomness of a pseudorandom number generator sort of smacks you in the face when you see the code. You should be feeling a little queasy at this point.

Clearly, the seed variable is changed every time the uniform01 function is called. The number that is returned is the new value for the seed divided by its maximum possible value. We use casting, again, to force the compiler to treat seed as a double (real-valued number) when the division occurs in the last line. If we didn’t do this, the function would return 0 every time.

/**
* This function generates a uniformly-distributed random variable on the interval [0,1).
* It is a version of simple linear congruential generator.
*
*
* @return                  Returns a uniformly-distributed random variable on the interval [0,1).
*
*/
double RandomNumberGenerator::uniform01(void)
{

// Returns a pseudo-random number between 0 and 1.
int hi = seed / 127773;
int lo = seed % 127773;
int test = 16807 * lo - 2836 * hi;
if (test > 0)
seed = test;
else
seed = test + 2147483647;

return (double)(seed) / (double)2147483647;

}

You now finished writing your own random number generator. You should explore this random number generator a bit more and try the exercises.

Section 4.2: The pure birth process

The birth-death process of cladogenesis is widely-used in evolutionary biology to model the Tree of Life. The process models the splitting (speciation) and termination (extinction) of lineages through time. The probability of a speciation event occurring in an infinitesimal interval of time, $$\Delta t$$, is $$\lambda \Delta t$$ whereas the probability of an extinction event occurring in the same infinitesimal interval of time is $$\mu \Delta t$$. This implies that the time between speciation or extinction events is exponentially distributed with parameter $$\lambda + \mu$$. When an event occurs, it is a speciation with probability $$\frac{\lambda}{\lambda + \mu}$$ or an extinction with probability $$\frac{\mu}{\lambda + \mu}$$.

Birth-death processes are used for two main purposes in evolutionary biology: (1) they are used to estimate speciation and extinction rates to study macroevolutionary questions. (2) they are used as prior distributions on phylogenetic trees in Bayesian phylogenetic analyses. Here, we use the birth-death process for the second scenario as a prior distribution.

To make things a bit simpler in the beginning, we will start with the pure birth process, sometime also called the Yule process. The pure birth process is a birth-death process where the extinction (or death) rate is equal to zero. Extinction events are not allowed to happen. This makes simulation and probability computation a fair bit easier. We have added all information to change the code and write a birth-death process as an exercise.

Subsection 4.2.1: The PureBirthProcess.h file

Create a new file called PureBirthProcess.h. In this file, we will add the following definition of the pure birth process.

#ifndef PureBirthProcess_h
#define PureBirthProcess_h

#include "RandomNumberGenerator.h"

#include <stdio.h>
#include <vector>

// forward declaration(s)
class Tree;

/**
* \class PureBirthProcess
*
* \brief A pure birth process for simulating trees and computing probabilities.
*
* This class represents a pure birth process where birth/speciation events
* occur after an exponential distributed waiting time.
* There are no death/extinction events.
* The two main functions in this class are the rv function to simulate trees
* and the lnProbability function to compute the log-probability of a given tree.
*
*
* \author Sebastian Höhna
*
*/
class PureBirthProcess {

public:
PureBirthProcess(const double *b);
virtual                    ~PureBirthProcess();

double                      lnProbability(void) const ;
Tree*                       rv(RandomNumberGenerator *rng, double age);
void                        setValue(const Tree *x);

protected:

const double*               birth_rate;
const Tree*                 value;

};

#endif /* PureBirthProcess_h */

Our class has two member variables which are both constant pointers. A constant pointer written in this way, e.g., double*, means that the value at the memory address cannot be changeg. We still can replace the pointer with a new one. We usually use constant pointers to signal that we do not own the pointer. We are going to need values written at the memory addresses of these pointers, but we are not going to change the values. Our model class will be responsible for the memory and changing the values.

Here, the three important function are lnProbability, rv and setValue. We will cover each one of them in detail in the implementation.

Subsection 4.2.2: The PureBirthProcess.cpp file

Create a new file called PureBirthProcess.cpp. Start by adding the include statements.

#include "PureBirthProcess.h"

#include "Tree.h"
#include "TreeNode.h"

#include <cmath>

Next, we will add the constructor. This constructor will simply set the birth rate variable but otherwise will do nothing.

/**
* Constructor of the PureBirthProcess.
*
* @param   n           The taxon names used for the tips.
* @param   b           The birth rate parameter.
*/
PureBirthProcess::PureBirthProcess(const double *b) :
birth_rate( b )
{

}

As good practice, we will also add the destructor although it is not actually doing anything. In this case here, it is nice to have the destructor because it makes it clear that the member variables which are pointers do not point to memory that this class needs to clean up.

/**
* Destructor of the pure birth process.
*/
PureBirthProcess::~PureBirthProcess( void )
{
// we do not own the rate parameter nor the value, so we have nothing to delete
}

The probability of a tree under the pure birth process is computed by two parts: (1) computing the probability of the waiting time until the next event, and (2) computing the probability density of the actual events, the birth events. As we mentioned in the beginning of this chapter, waiting times under the birth-death process are exponentially distributed. Since we have here only birth events, the waiting time is simply exponentiall distributed. So we can say that the probability along a branch is computed by $$e^{-\lambda t}$$ where $$\lambda$$ is the birth rate. The probability density of the birth events themselves is $$\lambda$$.

Putting this together, we need to compute the waiting time along each branch, and the probability $$\lambda$$ for each birth event. That’s all!

/**
* Compute the log-probability of the values given the birth rate.
*
* The algorithm here is to get all branches of the tree and compute the
* probability of nothing happening along the branch and the
* speciation/birth event at the end of the branch.
*
* @return              Returns the log-probability.
*/
double PureBirthProcess::lnProbability( void ) const
{
// initialize the log probability
double ln_prob = 0.0;

// for simplicity of use, store the birth-rate as a local double variable
double l = *birth_rate;

// get all the nodes of the tree
const std::vector<TreeNode*> &nodes = value->getNodes();

// iterate over all the nodes to get the branch lengths
for (size_t i=0; i<nodes.size(); i++)
{
// get the current node
TreeNode *node = nodes[i];

// make sure that it is not the root node
if ( node->isRoot() == false )
{
// get the current branch length
double bl = node->getBranchLength();
// add the log-probability of no event along the branch (-l*bl)
ln_prob -= l * bl;
// the log-probability density for the event itself: log(l)
if ( node->isTip() == false )
{
ln_prob += std::log(l);
}
}

} // end for loop over all nodes

// return the computed log probability
return ln_prob;
}

Our second function that is more intricate is the simulator under the pure birth process. The idea of the algorithm is that we start at some time in the past and keep on simulating until we reach the present time, denoted by $$age = 0$$. At the root, we will start with two nodes, the two children of the root node. We will keep track of the the active nodes, which are the current lineages alive in the simulation that do not have descendants. Since each of the lineages has the same rate until an event occurs, we can sum the rates together by using $$num\_nodes * \lambda$$. Then, we need to consider how the exponential distribution works for drawing random waiting times.

The exponential probability distribution accurately models the waiting time until an event occurs when that event occurs at a constant rate, $$\lambda$$. The exponential probability distribution is a continuous distribution with density function, $$f(t) = \lambda e^{-\lambda t}$$, for $$t > 0, \lambda > 0$$. How can we generate an exponentially-distributed random number?

The idea is to transform the uniform(0,1) random variable into an exponential($$\lambda$$) random variable. We know that if we integrate over all possible values of $$t$$, that the overall probability is one ({}, the exponential is a probability distribution): $\int_0^{\infty} \lambda e^{-\lambda x} dx = 1$ The key to transforming our uniform into an exponential is to generate a uniform(0,1) random variable called $$u$$ and set the integral equal to that value: $\int_0^{t} \lambda e^{-\lambda x} dx = u$ We can solve for $$t$$ as follows: $\begin{eqnarray*} \int_0^{t} \lambda e^{-\lambda x} dx & = & u \\ -e^{-\lambda x} \Big|_0^t & = & u \\ -e^{-\lambda t} - -e^{0} & = & u \\ 1 - -e^{-\lambda t} & = & u \\ e^{-\lambda t} & = & 1 - u \\ -\lambda t & = & \ln(1 - u) \\ t & = & -\frac{1}{\lambda} \ln(1-u) \end{eqnarray*}$ Here, $$t$$ is an exponential($$\lambda$$) random variable. It was obtained by transforming our uniform(0,1) random variable, $$u$$. Equivalently, we could use the equation $$t = -\frac{1}{\lambda} \ln(u)$$.

So let’s see how this plays together.

/**
* Draw a random tree from a birth death process.
*
* @param   rng         The random number generator used for the simulation.
* @return              The random tree.
*/
Tree* PureBirthProcess::rv(RandomNumberGenerator *rng, double age)
{
// prepare some variables, such as the birth rate
double l = *birth_rate;

// now we need some helper vectors with the active and inactive nodes
std::vector<TreeNode*> active;
std::vector<TreeNode*> inactive;

// we also start our simulation with the root node
// this means that we create the root node and the two children of it
TreeNode *root = new TreeNode();
TreeNode *left = new TreeNode();
left->setParent( root );
root->setLeftChild( left );
TreeNode *right = new TreeNode();
right->setParent( root );
root->setRightChild( right );

// now we also know that the age of the root was the age for the simulation
root->setAge( age );

// add the two children of the root to the active list and the root to the inactive list
active.push_back(left);
active.push_back(right);
inactive.push_back(root);

// now we simulate until we reached the present time (a time of 0)
// so we initialize the current time variable with the age of the tree
double current_time = age;
while ( current_time > 0.0 )
{
// randomly draw a new speciation event
double rate = l*active.size();
double u = rng->uniform01();
double next_time = current_time + (1.0/rate) * std::log(u);

// make sure that the next speciation time was not in the future
if ( next_time < 0 )
{
break;
}

// randomly pick a node that speciated
size_t index = rng->uniform01() * active.size();
TreeNode *node = active[index];

// set the age of the speciation event
node->setAge( next_time );

// add new descendants to the node
TreeNode *left = new TreeNode();
left->setParent( node );
node->setLeftChild( left );
TreeNode *right = new TreeNode();
right->setParent( node );
node->setRightChild( right );

// remove the current node and add it's two children to our active list
active.erase( active.begin() + index );
active.push_back( left );
active.push_back( right );

// for later use, store the current node
inactive.push_back( node );

// update the current time with the time of the next birth event
current_time = next_time;

} // end while loop until the end of time

// get how many tips we had at the end of the simulation
size_t num_taxa = active.size();

// set the indices of the simulated
for ( size_t i=0; i<inactive.size(); i++ )
{
TreeNode *node = inactive[i];
node->setIndex( 2*num_taxa-2-i );
}

// now also create some dummy tip names
for ( size_t i=0; i<num_taxa; i++ )
{

// randomly pick a node
size_t index = rng->uniform01() * active.size();
TreeNode *node = active[index];
// don't forget to erase it from our list
active.erase( active.begin() + index );

// set the index of this tip
node->setIndex( i );

// and create an arbitrary name for it
node->setName( "Tip " + std::to_string(i) );
}

// finally, we create our tree object by passing in the root node
Tree *t = new Tree( root );

// and return the new tree
return t;
}

This it it. You have now written a simulator for a pure birth process!

Section 4.4: Putting it together

RandomNumberGenerator *rng = new RandomNumberGenerator();

// simulate a starting tree
double *birth_rate = new double(1.0);
PureBirthProcess *tree_prior = new PureBirthProcess( birth_rate );
Tree *my_tree = tree_prior->rv(rng, 10);
std::cout << my_tree->getNewickRepresentation() << std::endl;

Don’t forget to also include the new classes that we just wrote today. Thus, add the two include statements at the beginning of the main.cpp file.

#include "PureBirthProcess.h"
#include "RandomNumberGenerator.h"

Also, make sure to update the Makefile to include all files:

SRC       = Alignment.cpp AlignmentReader.cpp TreeNode.cpp Tree.cpp NewickTreeReader.cpp RandomNumberGenerator.cpp PureBirthProcess.cpp main.cpp
OBJECTS   = Alignment.o AlignmentReader.o TreeNode.o Tree.o NewickTreeReader.o RandomNumberGenerator.o PureBirthProcess.o main.o

Now compile and run your BabyBayes. How do the trees look when you simulate them? When you run BabyBayes again, do you get the same trees?

Section 4.5: Exercises

Subsection 4.5.1: How random are the random numbers

Write a new main.cpp and simulate 10,000 random uniform01. Write each number in a single line of a file. Then, create a histogram in R from these random numbers. Does this histogram show a uniform distribution?

Subsection 4.5.2: Implement another random number generator

There are more complex random number generators than the one we used. Let us play around with another one, a version of the Marsaglia Multi-Carry. You need to change your random number generator to use two seeds: seed1 and seed2. You can initialize both seeds with

unsigned int x = (unsigned int)( time( 0 ) );
seed1 = x & 0xFFFF;
seed2 = x >> 16;

Then, change the code of the uniform01 function as follows.

/**
* This function generates a uniformly-distributed random variable on the interval [0,1).
* It is a version of the Marsaglia Multi-Carry.
*
* Taken from:
*   Mathlib : A C Library of Special Functions
*   Copyright (C) 2000, 2003  The R Development Core Team
*
* This random generator has a period of 2^60, which ensures it has the maximum
* period of 2^32 for unsigned ints (32 bit ints).
*
* @return                  Returns a uniformly-distributed random variable on the interval [0,1).
*
* @see http://stat.fsu.edu/~geo/diehard.html
*/
double RandomNumberGenerator::uniform01(void)
{

// Returns a pseudo-random number between 0 and 1.
seed1 = 36969 * (seed1 & 0177777) + (seed1 >> 16);
seed2 = 18000 * (seed2 & 0177777) + (seed2 >> 16);
return (((seed1 << 16)^(seed2 & 0177777)) * 2.328306435996595e-10) +  2.328306437080797e-10 ;
}

Create 10,000 or 100,000 draws from this random number generator. Which one looks more uniform?

Subsection 4.5.3: Implement a birth-death process

We all know that extinction events happen. So we should try to make our models as realistic as possible, which means we should include death events into our tree prior and simulator. There are two parts for this. First, you need to adapt the simulation routine. Remember that the time between speciation or extinction events is exponentially distributed with parameter $$\lambda + \mu$$. When an event occurs, it is a speciation with probability $$\frac{\lambda}{\lambda + \mu}$$ or an extinction with probability $$\frac{\mu}{\lambda + \mu}$$. When an speciation event occurs, you can proceed as before. When an extinction event occurs, you should simply place the affected node into the inactive list and set the age for the node.

You should make sure that the process does not go extinct in your simulation. So the while loop should stop if there are no active lineages left; everyone went extinct. It might be fruitful to prune of all extinct lineages, but this should be an extra exercise.

Second, you also need to adapt the probability compution. We have derived for you the probability of the waiting time under the birth-death process: $$D(t) = \frac{4 e^{-(\lambda - \mu) t}}{(1 + \frac{\lambda + \mu}{\lambda - \mu} + e^{-(\lambda - \mu)t}(1 - \frac{\lambda + \mu}{\lambda - \mu}))^2}$$ To use this equation along a branch with begin age $$t_b$$ and end age $$t_e$$ where $$t_b > t_e$$, you need to compute $$\frac{D(t_b)}{D_e}$$. The rest can remain the same!

Simulate and plot some trees under a birth-death process!

References

Park, S. K., and K. W. Miller. 1988. Random number generators: Good ones are hard to find. Communications of the ACM 31:1192–1201.