# Chapter 3: Trees

## Section 3.1: Background information

### Subsection 3.1.1: The basic idea of representing phylogenetic trees in computer memory Figure 3.1. The nodes of a tree are objects, each of which contains information on its neighbors. The boxes represent the node object and the numbers next to each box, such as 0x01 represent the memory address of each object.

Figure 3.1a shows an example of a phylogenetic tree of three species. The species are named Sp1, Sp2, and Sp3. The branches of the tree have lengths in some unit, for example, in terms of expected number of substitutions per site or time. The branch lengths are also indicated on the tree. A tree is a complex thing consisting of nodes and branches. The nodes are represented by the circles on the tree of Figure 3.1a whereas the branches are the lines between the nodes. The tree of Figure 3.1a is rooted; there is a direction to tree with some nodes occurring before others. (The terminology we use to describe trees as consisting of nodes and branches is one that comes from the field of evolutionary biology. Mathematicians and computer scientists, however, see a tree as a type of graph, which has vertices and edges, which are equivalent to nodes and branches, respectively.) One of the nodes, the one furthest from the tips of the tree, is denoted the root node.

How can we represent such a complicated object in computer memory? The idea is to represent a tree as a series of nodes, each of which has information on its neighbors. Figure 3.1b illustrates the basic idea. Each node on the tree is represented as a box in Figure 3.1b, but would be an object in C++. The address of each node is indicated next to each node in Figure 3.1b. You can see that the ancestor of the node with address 0x01, which represents Sp1, is the node with address 0x04, which happens to be the ancestor of Sp1 and Sp2.

Do you see why we need to use pointers here? It is because we need to point to the exact same variable multiple times. For example, both tip nodes Sp1 and Sp2 point to their ancestor/parent, the node at the memory address 0x04. Furthermore, the root node has this node as its left child and thus points to the memory address 0x04 too.

This connection between multiple object is best achieved with pointers. However, using pointers come with more burden. We need to allocated and deallocated the memory ourselves. We will explain more about allocation and deallocation below.

The general strategy we will take in this chapter is to build up the machinery we need to represent and manipulate trees in computer memory. We will start with the implementation of a TreeNode class, which will represent the nodes of the tree (i.e., the boxes of Figure 3.1b). We will also implement a Tree class that will manage the nodes and allow us to manipulate the tree.

### Subsection 3.1.2: The Newick tree format

The Newick tree format was designed and adopted by several of the early developers of phylogeny software packages — J. Felsenstein, D. Swofford, F. J. Rohlf, C. Meacham, J. Archie, and W. Maddison — at Newick’s restaurant in Durhan, New Hampshire, on June 24, 1986. They wanted a standardized format describing trees so that any of their programs could read trees produced by another program. The meeting at the restaurant was motivated by conversations at the 1986 Evolution Meetings (the joint meetings of the Society of the Systematic Biologists, the American Society of Naturalists, and the Society for the Study of Evolution). They hunkered down to work at Newick’s restaurant while the meeting was underway. The Newick tree format was the result. According to D. Swofford, the meal at Newick’s restaurant was a good one.

The Newick format represents trees using parentheses. For example, the tree of Figure 3.1a can be represented as
((Sp1, Sp2), Sp3)
This format has several advantages: a single tree can be written to a plain text file on a single line; trees can be directly created in computer memory by reading the Newick string from left to right; and it requires few character to represent a tree.

We will use the Newick tree format here both for printing and reading trees from a file.

### Subsection 3.1.3: Working with dynamically allocated memory

In C++, we dynamically allocate memory using the new operator. This will create a dynamically allocated memory space for the variable. Dynamically in this context means that it happens while the program runs and the actual size of the memory might not be known when you compile your software. You have to free dynamically allocated memory with the command delete my_pointer.

Every class that has dynamically allocated memory as one of their member variables, must to implement these three functions:

• A copy constructor The copy constructor creates a new copy of the variable with the same value. The copy constructor is called, for example, when you use statements such as TreeNode b = a where a was some other TreeNode. The copy constructor must create deep copies of all dynamically allocated member variables. Bu default, the C++ compiler creates a default copy constructor that simply creates a shallow copy of all member variables.
• A destructor The destructor must free all dynamically allocated member variables.
• An assignment operator The assignment operator is called when you use b = a and both b and a are variables that did exist before. You must ensure that the assignment operator creates proper deep assignments (the same as the copy constructor). Bu default, the C++ compiler creates a default assignment operator that simply creates a shallow assignment of all member variables.

The reason why the deep copies are so important is that otherwise two objects point to exactly the same memory address. Which of the two is responsible for deleting the memory? If the memory is not properly deleted, then you will have memory leaks. What happens if the first deletes the memory while the second still needs it? If the memory is deleted too soon, you will have segmentation faults.

In our TreeNode class we will assume that the parent owns the children. Therefore, the parent has to create deep copies for the children and need to make sure that the children are deleted when needed.

## Section 3.2: The TreeNode class

### Subsection 3.2.1 The TreeNode.h file

We start with writing the definition of the TreeNode class. Create a new file called TreeNode.h and copy the code below into the file.

``````#ifndef TreeNode_h

#include <string>

/**
* \class TreeNode
*
* \brief The node class used in phylogenetic tree.
*
* This class represent a tree node of a phylogenetic tree.
* Here we store the main structure of the trees and all necessary methods.
*
*
* \author Sebastian Höhna
*
*/
class TreeNode {

public:
TreeNode(void);
TreeNode(const TreeNode& tn);
~TreeNode(void);

TreeNode&                           operator=(const TreeNode& tn);

double                              getAge(void) const;
double                              getBranchLength(void) const;
int                                 getIndex(void) const;
const TreeNode*                     getLeftChild(void) const;
TreeNode*                           getLeftChild(void);
const std::string&                  getName(void) const;
const TreeNode*                     getParent(void) const;
TreeNode*                           getParent(void);
const TreeNode*                     getRightChild(void) const;
TreeNode*                           getRightChild(void);
bool                                isRoot(void) const;
bool                                isTip(void) const;
void                                setAge(double a);
void                                setLeftChild(TreeNode *n);
void                                setIndex(int i);
void                                setName(const std::string& n);
void                                setParent(TreeNode* n);
void                                setRightChild(TreeNode* n);

private:

TreeNode*                           parent;
TreeNode*                           left;
TreeNode*                           right;

int                                 index;
double                              age;
std::string                         name;
};

#endif /* TreeNode_h */``````

Let’s deconstruct this class definition. First, this class contains six member variable: we have three pointers of type TreeNode*, an int, a double, and a std::string. The idea is that the TreeNode pointers will hold the memory address of the neighbors to the node. The index variable will simply be a number that will help us distinguish nodes (this will be useful later). The age variable will hold the age of the node. You can compute the branch length leading to a node by computing the difference between the age of the parent node and the age of the node itself. Finally, the string variable will hold the name of the node, if it happens to be a tip node that represents a species. Because we are using the std::string type in this class, we have to include the string header file. You can see we do this before the class definition:

``#include <string>``

We made all of the member variables for this class protected; other classes cannot directly access these variables. We allow other parts of the code to change the instance variables with public functions. Each of these functions starts with set (e.g., setParent, which sets the parent variable). Similarly, we provide functions that start with get that will return the current value of the member variables. The set and get functions are called setters and getters among computer scientists.

The last unfamiliar element of the TreeNode class is ~TreeNode(void) (although we have used in our Alignment class before too). This looks a lot like our default constructor, except for the addition of the tilde symbol. The addition of the tilde indicates that this is a destructor function. As mentioned earlier, constructor functions are called when an object is created. The destructor function, as the name implies, is called just before the object is destroyed. Constructors are useful places to carry out initialization. Destructors, by contrast, are an excellent place to carry out clean-up activities just before an object is destroyed.

### Subsection 3.2.2 The TreeNode.cpp file

Now let’s write the file that implements the TreeNode class. Create a new file called TreeNode.cpp. Add the following code.

``````#include "TreeNode.h"

#include <stdio.h>
#include <iostream>

/**
* Constructor of the tree node class.
*
* We initialize all variables with default values.
*
*/
TreeNode::TreeNode( void ) :
parent( NULL ),
left( NULL ),
right( NULL ),
index( 0 ),
age( 0.0 ),
name( "" )
{

}

/**
* Copy constructor of the tree node class.
*
* We need to make sure that we get a deep copy of the children.
*
* @param   tn              The object that we want to copy.
*/
TreeNode::TreeNode( const TreeNode& tn ) :
parent( tn.parent ),
left( NULL ),
right( NULL ),
index( tn.index ),
age( tn.age ),
name( tn.name )
{
// only make a deep copy of the left child if it exists
if ( tn.left != NULL )
{
left = new TreeNode( *tn.left );
left->setParent( this );
}
// only make a deep copy of the right child if it exists
if ( tn.right != NULL )
{
right = new TreeNode( *tn.right );
right->setParent( this );
}
}

/**
* Destructor.
*
* We need to delete the children because the parent nodes own their children.
*/
TreeNode::~TreeNode( void )
{
// delete both the left and right child.
// it's actually safe to delete a pointer even it point to NULL
delete left;
delete right;
}

/**
*
* We need to free the memorythe children and make a deep copy.
*
* @param   tn              The object that we want to assign to us.
* @return          The newly assigned object.
*/
TreeNode& TreeNode::operator=(const TreeNode& tn)
{
// first, we need to check and avoid self-assignment
if ( this != &tn )
{
// we are not assigning ourselves to us.
// so let's free the memore first before we copy the new values
delete left;
delete right;

// to be safe, set the pointer to NULL
left  = NULL;
right = NULL;

// now we can copy over the values
parent  = tn.parent;
index   = tn.index;
age     = tn.age;
name    = tn.name;

// only make a deep copy of the left child if it exists
if ( tn.left != NULL )
{
left = new TreeNode( *tn.left );
left->setParent( this );
}
// only make a deep copy of the right child if it exists
if ( tn.right != NULL )
{
right = new TreeNode( *tn.right );
right->setParent( this );
}

// we done now with our assignment
}

// we return this object now
return *this;
}

/**
* Get the age of this node.
*
* @return          Return the local variable that stores the age.
*/
double TreeNode::getAge( void ) const
{
// simply return the local age variable
return age;
}

/**
* Get the branch length of this node.
*
* The branch length is computed by the age of the parent minus my own age.
* If this is the root node, then we don't have parent, so no parent age.
* By convention, we say that the root branch has a length of 0.0.
*
* @return          Return the branch length leading to this node.
*/
double TreeNode::getBranchLength( void ) const
{
// check if this node has a parent, i.e., is not a root node.
if ( parent != NULL )
{
// we do have a parent, so it's safe to compute the branch length
// by the difference of my parent's age and my own age
return parent->getAge() - age;
}
else
{
// otherwise, we return 0.0 by convention.
return 0.0;
}
}

/**
* Get the index of this node.
*
* @return              The index for this node.
*/
size_t TreeNode::getIndex( void ) const
{
// return the local variable for the index
return index;
}

/**
* Get the left child of this node.
*
* @return              The left child.
*/
const TreeNode* TreeNode::getLeftChild( void ) const
{
// return the local variable for the left child node
return left;
}

/**
* Get the left child of this node. (non-const)
*
* @return              The left child.
*/
TreeNode* TreeNode::getLeftChild( void )
{
// return the local variable for the left child node
return left;
}

/**
* Get the taxon name of this node.
*
* @return              The name of the node.
*/
const std::string& TreeNode::getName( void ) const
{
// return the local variable for the name of this node
// tip nodes will have the taxon name stored here.
return name;
}

/**
* Get the parent of this node.
*
* @return              The parent node.
*/
const TreeNode* TreeNode::getParent( void ) const
{
// return the local variable for the parent node
return parent;
}

/**
* Get the parent of this node.  (non-const)
*
* @return              The parent node.
*/
TreeNode* TreeNode::getParent( void )
{
// return the local variable for the parent node
return parent;
}

/**
* Get the right child of this node.
*
* @return              The right child.
*/
const TreeNode* TreeNode::getRightChild( void ) const
{
// return the local variable for the right child
return right;
}

/**
* Get the right child of this node.  (non-const)
*
* @return              The right child.
*/
TreeNode* TreeNode::getRightChild( void )
{
// return the local variable for the right child
return right;
}

/**
* Is this node a root node.
*
* @return              True if this node is a root node, false otherwise.
*/
bool TreeNode::isRoot( void ) const
{
// this node is a root node if the parent node does not exist, i.e., is NULL
return parent == NULL;
}

/**
* Is this node a tip node?
*
* @return              True if both children don't exist, false otherwise.
*/
bool TreeNode::isTip( void ) const
{
// check if either child node exists, i.e., is not NULL
return left == NULL && right == NULL;
}

/**
* Set the age of this node.
*
* @param   a               The new age of the node.
*/
void TreeNode::setAge(double a)
{
// set the internal variable with the new age
age = a;
}

/**
* Set the left child of this node.
*
* @param   n               The new left child of the node.
*/
void TreeNode::setLeftChild(TreeNode* n)
{
// first, we need to free the memory
delete left;

// set the internal variable with the new left child
left = n;
}

/**
* Set the index of this node.
*
* @param   i               The new index of the node.
*/
void TreeNode::setIndex(size_t i)
{
// set the internal variable with the new index
index = i;
}

/**
* Set the name of this node.
*
* @param   n               The new name of the node.
*/
void TreeNode::setName(const std::string& n)
{
// set the internal variable with the new name
name = n;
}

/**
* Set the parent of this node.
*
* @param   n               The new parent of the node.
*/
void TreeNode::setParent(TreeNode* n)
{
// set the internal variable with the new parent
parent = n;
}

/**
* Set the right child of this node.
*
* @param   n               The new right child of the node.
*/
void TreeNode::setRightChild(TreeNode* n)
{
// first, we need to free the memory
delete right;

// set the internal variable with the new right child
right = n;
}``````

## Section 3.3: The Tree class

### Subsection 3.3.1: The Tree.h file

Now create a new file called Tree.h for the definition of our Tree class. Copy the code below into the file.

``````#ifndef Tree_h
#define Tree_h

#include <vector>
#include <string>

// forward declaration of the class TreeNode
class TreeNode;

/**
* \class Tree
*
* \brief The phylogenetic tree class.
*
* This class represent the phylogenetic tree.
* The main structure is actually stored in the TreeNode itself.
* Here we store the root node and for convenience also a vector of all nodes.
*
*
* \author Sebastian Höhna
*
*/
class Tree {

public:
Tree(TreeNode* r);
Tree(const Tree& t);
~Tree(void);

Tree&                               operator=(const Tree& t);

std::string                         getNewickRepresentation(void) const;
const std::vector<TreeNode*>        getNodes(void) const;
size_t                              getNumberOfNodes(void) const;
const TreeNode*                     getRootNode(void) const;

private:
std::string                         computeNewickRecursively(const TreeNode& n) const;
void                                fillNodesByPhylogeneticTraversal(TreeNode* node);

TreeNode*                           root;
std::vector<TreeNode*>              nodes;

};

#endif /* Tree_h */``````

In many respects, this class should look familiar to you. As you scan though the class definition, you should be comfortable with the idea that the class is defined as:

``````class Tree {

};``````

Seeing variables that are public and protected shouldn’t freak you out. The idea of a instance variable that is a pointer, such as TreeNode* root, should make you feel warm and fuzzy. That said, there are a few aspects of this class definition that are new. We will go through those one at a time.

One of the unfamiliar instance variables is:

``std::vector<TreeNode*>  nodes;``

This declares a vector of TreeNode pointers called nodes. This is the first time you have brushed shoulders with the Standard Template Library (STL) in C++. The STL is a collection of tools available to the C++ programmer that include containers, algorithms, and iterators. The vector we use in the Tree class is an example of a container. As objects, STL containers have functions associated with them. For example, the vector container has two functions that we will take advantage of: push_back and size. The push_back function adds a new thing-to-be-contained to the end of the vector object. The size function returns the number of elements in the vector. In the Tree class, the elements in the vector are TreeNode pointers. When we declare the vector object, we also indicate what type of thing will be contained in it. We do this with the < and > signs: vector<Node*>. Note that we need to include the file vector if we are to use a vector in our code. This explains the compiler directive, #include .

You will also note that before the class definition, we added the line:

``class TreeNode;``

This is called ‘forward declaration.’ We forward declare TreeNode because we use a Node pointer in the class definition. (In fact, we do this three times, once when we declare the return type of the method getRootNode, once when we declare the root variable, and again when we declare the nodes variable.) We could also have simply included the TreeNode.h file, right before or after we include the vector header,

``````#include <vector>
#include "TreeNode.h"``````

in which case we could forgo the statement class TreeNode. Both are equivalent in as much as the code will compile and run. However, in keeping with the principle of including the least amount of information necessary, the method employed here is preferable. In the class definition, we only declare TreeNode pointers (TreeNode*}). All pointers take up the same amount of memory. Essentially, all the compiler has to do when examining this file is trust that we have implemented the TreeNode class elsewhere.

I made the getNewickRepresentation function public but the computeNewickRecursively function protected. Why do you think that is? It is through the getNewickRepresentation function that other parts of the program can obtain a string with the Newick representation of the tree. This function returns a string.

### Subsection 3.3.2: The Tree.cpp file

``````#include "Tree.h"
#include "TreeNode.h"

#include <stdio.h>
#include <iostream>

/**
* Constructor of the tree class.
*
* @param   r           The root node of the tree.
*/
Tree::Tree( TreeNode* r ) :
root( r )
{

// fill our nodes by phylogenetic traversal
fillNodesByPhylogeneticTraversal( root );
}

/**
* Copy constructor of the tree class.
*
* @param   t           The tree that we want to copy.
*/
Tree::Tree( const Tree& t ) :
root( NULL )
{
// only create a deep copy if the root exists
if ( t.root != NULL )
{
// create a deep copy
root = new TreeNode( *t.root );

// fill our nodes by phylogenetic traversal
fillNodesByPhylogeneticTraversal( root );
}

}

/**
* Destructor of the tree class.
*
* Since we own the root node, we need to free it.
* This will also make sure that all nodes will be freed.
*
*/
Tree::~Tree( void )
{
// free the memory of the root node which will free the memory of all nodes recursively.
delete root;
}

/**
*
* @param   t               The tree that we want to copy.
* @return          The newly assigned object.
*/
Tree& Tree::operator=(const Tree &t)
{
// first, we need to check and avoid self-assignment
if ( this != &t )
{

// free the memory of the root node which will free the memory of all nodes recursively.
delete root;

// for safety, set the root to NULL
root = NULL;

nodes.clear();

// only create a deep copy if the root exists
if ( t.root != NULL )
{
// create a deep copy
root = new TreeNode( *t.root );

// fill our nodes by phylogenetic traversal
fillNodesByPhylogeneticTraversal( root );
}

// we done now with our assignment
}

// we return this object now
return *this;
}``````
``````/**
* Compute the newick string for this tree.
*
* This function is a recursive function, i.e., it calls itself again.
* You should always start with the root node when calling this function and then it terminates ones reaching the tips.
* The newick string should be something like
*      (((A:1.0,B:1.0):0.5,C:1.5):1.0,(D:2.0,E:2.0):2.5);
*
* @return          The newick string for the clade starting at this node.
*/
std::string Tree::computeNewickRecursively(const TreeNode &node) const
{
// initialize the newick string as an empty string
std::string newick = "";

// check if this is a tip node
if ( node.isTip() == true )
{
// this is a tip node
// so we can terminate the recursion
// we simply have to use the name of this node for the newick string
newick = node.getName();
}
else
{
// since it is not a tip node it must be an interior node with two children.
// we therefore compute the newick string for both children
// this is done with a recursive call to this function with the left and right child as an argument
std::string left    = computeNewickRecursively( *node.getLeftChild() );
std::string right   = computeNewickRecursively( *node.getRightChild() );

// we want to print the newick trees in a unique format, so we order it alphabetically
// hence, check if the newick string of the left child is alphabetically before the newick string of the right child.
// we then put together the newick strings of the two children
if ( left < right )
{
newick = "(" + left + "," + right + ")";
}
else
{
newick = "(" + right + "," + left + ")";
}
}
// finally, we need to add the branch length to this newick string
newick += ":" + std::to_string(node.getBranchLength());

// and then we return this newick string
return newick;
}

/**
* Fill the nodes vector in phylogenetic ordering.
*
* Fill the nodes vector by a phylogenetic traversal recursively starting with this node.
* The tips fill the slots 0,...,n-1 followed by the internal nodes and then the root.
*/
void Tree::fillNodesByPhylogeneticTraversal(TreeNode* node)
{

if ( node->isTip() )
{
// all the tips go to the beginning
nodes.insert(nodes.begin(), node);
}
else
{
// now call this function recursively for all your children
fillNodesByPhylogeneticTraversal( node->getLeftChild() );
fillNodesByPhylogeneticTraversal( node->getRightChild() );

// this is phylogenetic ordering so the internal nodes come last
nodes.push_back(node);
}
}

/**
* Compute the newick string for this tree.
*
* This method simply starts by traversing the tree recursively from the root.
*
* @return          The newick string for the entire tree.
*/
std::string Tree::getNewick( void ) const
{

return computeNewickRecursively( *root );
}

/**
* Get the vector with all the nodes in the tree.
*
* @return              The vector of nodes,
*/
const std::vector<TreeNode*> Tree::getNodes( void ) const
{
return nodes;
}

/**
* Get the number of nodes contained in the tree.
*
* This number should be 2*N-1 for rooted trees where N is the number of tips.
*
* @return              The number of nodes in the tree,
*/
size_t Tree::getNumberOfNodes( void ) const
{

return nodes.size();
}

/**
* Get the root node of the tree.
*
* @return              The root node.
*/
const TreeNode* Tree::getRootNode( void ) const
{

return root;
}``````

### Subsection 3.4.1: The NewickTreeReader.h file

``````#ifndef NewickTreeReader_H

#include <string>
#include <vector>

class Tree;
class TreeNode;

/**
*
* The newick tree reader provides convenience methods for reading trees in Newick format.
*
*
*
*/

public:

private:
TreeNode*                           createTreeFromNewick(const std::string& newick);
std::vector<std::string>            parseNewickString(const std::string& ns);

};

#endif``````

### Subsection 3.4.2: The NewickTreeReader.cpp file

``````#include <fstream>
#include <string>
#include <vector>
#include <stdio.h>
#include <iostream>
#include <sstream>

#include "Tree.h"
#include "TreeNode.h"

/**
* Default constructor.
*
* The default constructor does nothing except allocating the object.
*/
{

}

/**
* Read all trees in the file which should be in Newick tree format.
*
* @param   fn          The filename.
* @return              A vector with all the trees read from the fil.
*/
{

// open the file and create a file input stream
std::ifstream inFile( fn );

// check if we were successful in opening the file
if ( !inFile )
{
// the file could not be opened
std::cerr << "Could not open file \"" << fn << "\"" << std::endl;
}

// initialize the vector of trees
// we will fill these later
std::vector<Tree> trees = std::vector<Tree>();

// now loop over each line in the file
while ( inFile.good() )
{
// create a string variable that will hold the line
std::string line;

getline(inFile, line);

// skip empty lines
if (line.length() == 0)
{
continue;
}
// note that we assume here for simplicity that our Newick trees don't contain spaces.
// you may want to remove all spaces ...

// construct the tree by calling the create tree from Newick string function
TreeNode *root = createTreeFromNewick( trimmed );

// create and allocate the tree object
Tree t =  Tree( root );

// add the tree to our vector
trees.push_back( t );
}

// finally, return the vector of trees
return trees;
}``````

Parsing is a real pain in the neck.

The parser introduces a few new ideas. First of all, you can see in the conditional if statement that we use two parallel lines. This is read or. You can check for several conditions in a if statement:

• == : is the thing on the left equal to the thing on the right (equal to)
• != : is the thing on the left not equal to the thing on the right (not equal to)
• || : the statement on the left or the statement on the right is true (logical OR)
• && : the statements to the left and to the right are both true (logical AND)

Walk through this code line-by-line.

``````/**
* Create a tree from a Newick string.
*
* @param   newick      The tree as a Newick string.
* @return              The root node of the create tree.
*/
{
// break the string into its component parts
std::vector<std::string> tokens = parseNewickString(newick);

// we need some variables
// first, if the last token was a ':' so that we next read in the branch length
// we also need a pointer to the ancestor
TreeNode* ancestor  = NULL;
// and also a pointer to the root for checking that the algorithm worked correctly
TreeNode* root      = NULL;

// build up the tree from the parsed Newick string
for (int i=0; i<tokens.size(); i++)
{
// get the current token
const std::string& token = tokens[i];

// check what the token corresponds to
if (token == "(")
{
// we go one level deeper into the tree with a '('
// create new node
TreeNode* next_node = new TreeNode();
// if this was the first '(', then we need to initialize the root
if (ancestor == NULL)
{
root = next_node;
}
else
{
// we have an ancestral node for this new node
// so we set the it as the parent
next_node->setParent(ancestor);
// and the new node will be either the left or right child
if (ancestor->getLeftChild() == NULL)
{
ancestor->setLeftChild(next_node);
}
else
{
ancestor->setRightChild(next_node);
}
}
// now we set the new node as the next ancestor
ancestor = next_node;
}
else if (token == ")" || token == ",")
{
// we finished with the previous node
// move one level back
// for safety, check that we can do so
if (ancestor->getParent() == NULL)
{
std::cout << "Error: We cannot find an expected ancestor" << std::endl;
exit(1);
}
// we will reset the ancestor as the ancestor's parent
ancestor = ancestor->getParent();
}
else if (token == ":")
{
// begin reading a branch length
}
else if (token == ";")
{
// finished!
// just make sure that we also terminated at the root
if (ancestor != root)
{
std::cout << "Error: We expect to finish at the root node" << std::endl;
exit(1);
}
}
else
{
// we have a taxon name or a branch length
{
// convert the token into a double
double x = std::stod(token);
// since we are actually storing ages, we compute the age as
// the child's age plus the branch length.
// thus, we set the age of the parent/ancestor and need to make sure that we have one
if ( ancestor->getParent() != NULL )
{
ancestor->getParent()->setAge( ancestor->getAge() + x );
}
// now flip back the bool that we are not expecting a branch length anymore
}
else
{
// we got a new tip node
// create this tip
TreeNode* next_node = new TreeNode();
// and tell it who its parent was
next_node->setParent(ancestor);
// also tell the ancestor that this was a left/right child
if (p->getLeftChild() == NULL)
{
p->setLeftChild(next_node);
}
else
{
p->setRightChild(next_node);
}
// also set the name of the tip
next_node->setName(token);
// and reset, as usual, reset the ancestor to be the current node
ancestor = next_node;
} // finished the if-else for the taxon name vs branch length

} // finished the if-else for what type of token it was

} // finished the foor loop over all tokens

// return the root node for this tree
return root;
}``````

You will see that we have a loop over the parsed elements in the Newick string. I call these parsed elements token. We do different things depending on the identity of the token:

• token == “(” — The token is a left parentheses, in which case we add a new node. We also set up the pointers for this new node and the ancestor of this node, if there is one. The first time we create a node, three are no other nodes that can be neighbors (to the left, right, or ancestrally). This first node is the root node.
• token == “)” || token == “,” — The token is either a right parentheses or a comma. In either case, we move to the ancestor of the current point in the tree.
• token == “:” — The token is a colon, which means the next token is a branch length. We set a bool variable named reading_branch_length to be true.
• token == “;” — The token is a semicolon, in which case we should be finished reading all of the Newick string elements. The algorithm works such that we should end at the root. We check that this is true. If it isn’t, we exit the program abruptly. Something is very wrong with the Newick string that was passed to the constructor.
• If none of the above are true, then the token must be either a taxon name or a branch length. If it is a taxon name, we add a new node, just as we did when we encountered a left parentheses, but this time we also add the taxon name information. If the token is a branch length, we simply set the branch length for the node we are currently pointed at.

The parseNewickString first declares a vector of strings. It then enters a loop, incremented by a variable i, that reads the Newick string one character at a time. If the character is one of the special ones — ‘(’, ‘)’, ‘,’, ‘:’ and ‘;’ — it immediately adds the character to the vector of strings. Otherwise, the character is part of a taxon name or a branch length. In this case, the parser races ahead to the next special symbol using a new type of loop (while), incrementing a new counter (j), adding the entire string (taxon name or branch length) to the vector.

``````/**
* Parse the Newick string into separate tokens.
*
* A token is separated by a '(', ')', ',', ':' or ','
*
* @param   ns          The newick string that we want to parse.
* @return              A vector of strings containing all the tokens and separators.
*/
{
// create an empty vector of strings for the tokens
std::vector<std::string> tks;

// now we iterate over each character in the string
for (int i=0; i<ns.size(); i++)
{
// get the current character of the string
char c = ns[i];

// check if it was on of the separators
if (c == '(' || c == ')' || c == ',' || c == ':' || c == ';')
{
// add the separator-token to our vector
tks.push_back( std::string(c) );
}
else
{
// it was not a separator, so it was actually some token
// we need to extract it until the next separator
// we move over the string until we hit the next separator
int j = i;
// create
std::string temp_token = "";
while ( !(c == '(' || c == ')' || c == ',' || c == ':' || c == ';') )
{
// add the current character to our token string
temp_token += c;
// move the index forward
j++;
// get the next character
c = ns[j];
}
// fast forward the index of the newick string
i = j-1;
// add the token to our vector
tks.push_back(temp_token);
}

} // end for loop over all characters in the string

// return the vector of tokens
return tks;
}``````