Figure 3.1a shows an example of a phylogenetic tree of three species. The species are named Sp1, Sp2, and Sp3. The branches of the tree have lengths in some unit, for example, in terms of expected number of substitutions per site or time. The branch lengths are also indicated on the tree. A tree is a complex thing consisting of nodes and branches. The nodes are represented by the circles on the tree of Figure 3.1a whereas the branches are the lines between the nodes. The tree of Figure 3.1a is rooted; there is a direction to tree with some nodes occurring before others. (The terminology we use to describe trees as consisting of nodes and branches is one that comes from the field of evolutionary biology. Mathematicians and computer scientists, however, see a tree as a type of graph, which has vertices and edges, which are equivalent to nodes and branches, respectively.) One of the nodes, the one furthest from the tips of the tree, is denoted the root node.
How can we represent such a complicated object in computer memory? The idea is to represent a tree as a series of nodes, each of which has information on its neighbors. Figure 3.1b illustrates the basic idea. Each node on the tree is represented as a box in Figure 3.1b, but would be an object in C++. The address of each node is indicated next to each node in Figure 3.1b. You can see that the ancestor of the node with address 0x01, which represents Sp1, is the node with address 0x04, which happens to be the ancestor of Sp1 and Sp2.
Do you see why we need to use pointers here? It is because we need to point to the exact same variable multiple times. For example, both tip nodes Sp1 and Sp2 point to their ancestor/parent, the node at the memory address 0x04. Furthermore, the root node has this node as its left child and thus points to the memory address 0x04 too.
This connection between multiple object is best achieved with pointers. However, using pointers come with more burden. We need to allocated and deallocated the memory ourselves. We will explain more about allocation and deallocation below.
The general strategy we will take in this chapter is to build up the machinery we need to represent and manipulate trees in computer memory. We will start with the implementation of a TreeNode class, which will represent the nodes of the tree (i.e., the boxes of Figure 3.1b). We will also implement a Tree class that will manage the nodes and allow us to manipulate the tree.
The Newick tree format was designed and adopted by several of the early developers of phylogeny software packages — J. Felsenstein, D. Swofford, F. J. Rohlf, C. Meacham, J. Archie, and W. Maddison — at Newick’s restaurant in Durhan, New Hampshire, on June 24, 1986. They wanted a standardized format describing trees so that any of their programs could read trees produced by another program. The meeting at the restaurant was motivated by conversations at the 1986 Evolution Meetings (the joint meetings of the Society of the Systematic Biologists, the American Society of Naturalists, and the Society for the Study of Evolution). They hunkered down to work at Newick’s restaurant while the meeting was underway. The Newick tree format was the result. According to D. Swofford, the meal at Newick’s restaurant was a good one.
The Newick format represents trees using parentheses. For example, the tree of Figure 3.1a can be represented as
((Sp1, Sp2), Sp3)
This format has several advantages: a single tree can be written to a plain text file on a single line; trees can be directly created in computer memory by reading the Newick string from left to right; and it requires few character to represent a tree.
We will use the Newick tree format here both for printing and reading trees from a file.
In C++, we dynamically allocate memory using the new operator. This will create a dynamically allocated memory space for the variable. Dynamically in this context means that it happens while the program runs and the actual size of the memory might not be known when you compile your software. You have to free dynamically allocated memory with the command delete my_pointer.
Every class that has dynamically allocated memory as one of their member variables, must to implement these three functions:
The reason why the deep copies are so important is that otherwise two objects point to exactly the same memory address. Which of the two is responsible for deleting the memory? If the memory is not properly deleted, then you will have memory leaks. What happens if the first deletes the memory while the second still needs it? If the memory is deleted too soon, you will have segmentation faults.
In our TreeNode class we will assume that the parent owns the children. Therefore, the parent has to create deep copies for the children and need to make sure that the children are deleted when needed.
We start with writing the definition of the TreeNode class. Create a new file called TreeNode.h and copy the code below into the file.
#ifndef TreeNode_h
#include <string>
/**
* \class TreeNode
*
* \brief The node class used in phylogenetic tree.
*
* This class represent a tree node of a phylogenetic tree.
* Here we store the main structure of the trees and all necessary methods.
*
*
* \author Sebastian Höhna
*
*/
class TreeNode {
public:
TreeNode(void);
TreeNode(const TreeNode& tn);
~TreeNode(void);
TreeNode& operator=(const TreeNode& tn);
double getAge(void) const;
double getBranchLength(void) const;
int getIndex(void) const;
const TreeNode* getLeftChild(void) const;
TreeNode* getLeftChild(void);
const std::string& getName(void) const;
const TreeNode* getParent(void) const;
TreeNode* getParent(void);
const TreeNode* getRightChild(void) const;
TreeNode* getRightChild(void);
bool isRoot(void) const;
bool isTip(void) const;
void setAge(double a);
void setLeftChild(TreeNode *n);
void setIndex(int i);
void setName(const std::string& n);
void setParent(TreeNode* n);
void setRightChild(TreeNode* n);
private:
TreeNode* parent;
TreeNode* left;
TreeNode* right;
int index;
double age;
std::string name;
};
#endif /* TreeNode_h */
Let’s deconstruct this class definition. First, this class contains six member variable: we have three pointers of type TreeNode*, an int, a double, and a std::string. The idea is that the TreeNode pointers will hold the memory address of the neighbors to the node. The index variable will simply be a number that will help us distinguish nodes (this will be useful later). The age variable will hold the age of the node. You can compute the branch length leading to a node by computing the difference between the age of the parent node and the age of the node itself. Finally, the string variable will hold the name of the node, if it happens to be a tip node that represents a species. Because we are using the std::string type in this class, we have to include the string header file. You can see we do this before the class definition:
We made all of the member variables for this class protected; other classes cannot directly access these variables. We allow other parts of the code to change the instance variables with public functions. Each of these functions starts with set (e.g., setParent, which sets the parent variable). Similarly, we provide functions that start with get that will return the current value of the member variables. The set and get functions are called setters and getters among computer scientists.
The last unfamiliar element of the TreeNode class is ~TreeNode(void) (although we have used in our Alignment class before too). This looks a lot like our default constructor, except for the addition of the tilde symbol. The addition of the tilde indicates that this is a destructor function. As mentioned earlier, constructor functions are called when an object is created. The destructor function, as the name implies, is called just before the object is destroyed. Constructors are useful places to carry out initialization. Destructors, by contrast, are an excellent place to carry out clean-up activities just before an object is destroyed.
Now let’s write the file that implements the TreeNode class. Create a new file called TreeNode.cpp. Add the following code.
#include "TreeNode.h"
#include <stdio.h>
#include <iostream>
/**
* Constructor of the tree node class.
*
* We initialize all variables with default values.
*
*/
TreeNode::TreeNode( void ) :
parent( NULL ),
left( NULL ),
right( NULL ),
index( 0 ),
age( 0.0 ),
name( "" )
{
}
/**
* Copy constructor of the tree node class.
*
* We need to make sure that we get a deep copy of the children.
*
* @param tn The object that we want to copy.
*/
TreeNode::TreeNode( const TreeNode& tn ) :
parent( tn.parent ),
left( NULL ),
right( NULL ),
index( tn.index ),
age( tn.age ),
name( tn.name )
{
// only make a deep copy of the left child if it exists
if ( tn.left != NULL )
{
left = new TreeNode( *tn.left );
left->setParent( this );
}
// only make a deep copy of the right child if it exists
if ( tn.right != NULL )
{
right = new TreeNode( *tn.right );
right->setParent( this );
}
}
/**
* Destructor.
*
* We need to delete the children because the parent nodes own their children.
*/
TreeNode::~TreeNode( void )
{
// delete both the left and right child.
// it's actually safe to delete a pointer even it point to NULL
delete left;
delete right;
}
/**
* Overloaded assignment operator.
*
* We need to free the memorythe children and make a deep copy.
*
* @param tn The object that we want to assign to us.
* @return The newly assigned object.
*/
TreeNode& TreeNode::operator=(const TreeNode& tn)
{
// first, we need to check and avoid self-assignment
if ( this != &tn )
{
// we are not assigning ourselves to us.
// so let's free the memore first before we copy the new values
delete left;
delete right;
// to be safe, set the pointer to NULL
left = NULL;
right = NULL;
// now we can copy over the values
parent = tn.parent;
index = tn.index;
age = tn.age;
name = tn.name;
// only make a deep copy of the left child if it exists
if ( tn.left != NULL )
{
left = new TreeNode( *tn.left );
left->setParent( this );
}
// only make a deep copy of the right child if it exists
if ( tn.right != NULL )
{
right = new TreeNode( *tn.right );
right->setParent( this );
}
// we done now with our assignment
}
// we return this object now
return *this;
}
/**
* Get the age of this node.
*
* @return Return the local variable that stores the age.
*/
double TreeNode::getAge( void ) const
{
// simply return the local age variable
return age;
}
/**
* Get the branch length of this node.
*
* The branch length is computed by the age of the parent minus my own age.
* If this is the root node, then we don't have parent, so no parent age.
* By convention, we say that the root branch has a length of 0.0.
*
* @return Return the branch length leading to this node.
*/
double TreeNode::getBranchLength( void ) const
{
// check if this node has a parent, i.e., is not a root node.
if ( parent != NULL )
{
// we do have a parent, so it's safe to compute the branch length
// by the difference of my parent's age and my own age
return parent->getAge() - age;
}
else
{
// otherwise, we return 0.0 by convention.
return 0.0;
}
}
/**
* Get the index of this node.
*
* @return The index for this node.
*/
size_t TreeNode::getIndex( void ) const
{
// return the local variable for the index
return index;
}
/**
* Get the left child of this node.
*
* @return The left child.
*/
const TreeNode* TreeNode::getLeftChild( void ) const
{
// return the local variable for the left child node
return left;
}
/**
* Get the left child of this node. (non-const)
*
* @return The left child.
*/
TreeNode* TreeNode::getLeftChild( void )
{
// return the local variable for the left child node
return left;
}
/**
* Get the taxon name of this node.
*
* @return The name of the node.
*/
const std::string& TreeNode::getName( void ) const
{
// return the local variable for the name of this node
// tip nodes will have the taxon name stored here.
return name;
}
/**
* Get the parent of this node.
*
* @return The parent node.
*/
const TreeNode* TreeNode::getParent( void ) const
{
// return the local variable for the parent node
return parent;
}
/**
* Get the parent of this node. (non-const)
*
* @return The parent node.
*/
TreeNode* TreeNode::getParent( void )
{
// return the local variable for the parent node
return parent;
}
/**
* Get the right child of this node.
*
* @return The right child.
*/
const TreeNode* TreeNode::getRightChild( void ) const
{
// return the local variable for the right child
return right;
}
/**
* Get the right child of this node. (non-const)
*
* @return The right child.
*/
TreeNode* TreeNode::getRightChild( void )
{
// return the local variable for the right child
return right;
}
/**
* Is this node a root node.
*
* @return True if this node is a root node, false otherwise.
*/
bool TreeNode::isRoot( void ) const
{
// this node is a root node if the parent node does not exist, i.e., is NULL
return parent == NULL;
}
/**
* Is this node a tip node?
*
* @return True if both children don't exist, false otherwise.
*/
bool TreeNode::isTip( void ) const
{
// check if either child node exists, i.e., is not NULL
return left == NULL && right == NULL;
}
/**
* Set the age of this node.
*
* @param a The new age of the node.
*/
void TreeNode::setAge(double a)
{
// set the internal variable with the new age
age = a;
}
/**
* Set the left child of this node.
*
* @param n The new left child of the node.
*/
void TreeNode::setLeftChild(TreeNode* n)
{
// first, we need to free the memory
delete left;
// set the internal variable with the new left child
left = n;
}
/**
* Set the index of this node.
*
* @param i The new index of the node.
*/
void TreeNode::setIndex(size_t i)
{
// set the internal variable with the new index
index = i;
}
/**
* Set the name of this node.
*
* @param n The new name of the node.
*/
void TreeNode::setName(const std::string& n)
{
// set the internal variable with the new name
name = n;
}
/**
* Set the parent of this node.
*
* @param n The new parent of the node.
*/
void TreeNode::setParent(TreeNode* n)
{
// set the internal variable with the new parent
parent = n;
}
/**
* Set the right child of this node.
*
* @param n The new right child of the node.
*/
void TreeNode::setRightChild(TreeNode* n)
{
// first, we need to free the memory
delete right;
// set the internal variable with the new right child
right = n;
}
Now create a new file called Tree.h for the definition of our Tree class. Copy the code below into the file.
#ifndef Tree_h
#define Tree_h
#include <vector>
#include <string>
// forward declaration of the class TreeNode
class TreeNode;
/**
* \class Tree
*
* \brief The phylogenetic tree class.
*
* This class represent the phylogenetic tree.
* The main structure is actually stored in the TreeNode itself.
* Here we store the root node and for convenience also a vector of all nodes.
*
*
* \author Sebastian Höhna
*
*/
class Tree {
public:
Tree(TreeNode* r);
Tree(const Tree& t);
~Tree(void);
Tree& operator=(const Tree& t);
std::string getNewickRepresentation(void) const;
const std::vector<TreeNode*> getNodes(void) const;
size_t getNumberOfNodes(void) const;
const TreeNode* getRootNode(void) const;
private:
std::string computeNewickRecursively(const TreeNode& n) const;
void fillNodesByPhylogeneticTraversal(TreeNode* node);
TreeNode* root;
std::vector<TreeNode*> nodes;
};
#endif /* Tree_h */
In many respects, this class should look familiar to you. As you scan though the class definition, you should be comfortable with the idea that the class is defined as:
Seeing variables that are public and protected shouldn’t freak you out. The idea of a instance variable that is a pointer, such as TreeNode* root, should make you feel warm and fuzzy. That said, there are a few aspects of this class definition that are new. We will go through those one at a time.
One of the unfamiliar instance variables is:
This declares a vector of TreeNode pointers called nodes. This is the first time you have brushed shoulders with the Standard Template Library (STL) in C++. The STL is a collection of tools available to the C++ programmer that include containers, algorithms, and iterators. The vector we use in the Tree class is an example of a container. As objects, STL containers have functions associated with them. For example, the vector container has two functions that we will take advantage of: push_back and size. The push_back function adds a new thing-to-be-contained to the end of the vector object. The size function returns the number of elements in the vector. In the Tree class, the elements in the vector are TreeNode pointers. When we declare the vector object, we also indicate what type of thing will be contained in it. We do this with the < and > signs: vector<Node*>. Note that we need to include the file vector if we are to use a vector in our code. This explains the compiler directive, #include
You will also note that before the class definition, we added the line:
This is called ‘forward declaration.’ We forward declare TreeNode because we use a Node pointer in the class definition. (In fact, we do this three times, once when we declare the return type of the method getRootNode, once when we declare the root variable, and again when we declare the nodes variable.) We could also have simply included the TreeNode.h file, right before or after we include the vector header,
in which case we could forgo the statement class TreeNode. Both are equivalent in as much as the code will compile and run. However, in keeping with the principle of including the least amount of information necessary, the method employed here is preferable. In the class definition, we only declare TreeNode pointers (TreeNode*}). All pointers take up the same amount of memory. Essentially, all the compiler has to do when examining this file is trust that we have implemented the TreeNode class elsewhere.
I made the getNewickRepresentation function public but the computeNewickRecursively function protected. Why do you think that is? It is through the getNewickRepresentation function that other parts of the program can obtain a string with the Newick representation of the tree. This function returns a string.
#include "Tree.h"
#include "TreeNode.h"
#include <stdio.h>
#include <iostream>
/**
* Constructor of the tree class.
*
* @param r The root node of the tree.
*/
Tree::Tree( TreeNode* r ) :
root( r )
{
// fill our nodes by phylogenetic traversal
fillNodesByPhylogeneticTraversal( root );
}
/**
* Copy constructor of the tree class.
*
* @param t The tree that we want to copy.
*/
Tree::Tree( const Tree& t ) :
root( NULL )
{
// only create a deep copy if the root exists
if ( t.root != NULL )
{
// create a deep copy
root = new TreeNode( *t.root );
// fill our nodes by phylogenetic traversal
fillNodesByPhylogeneticTraversal( root );
}
}
/**
* Destructor of the tree class.
*
* Since we own the root node, we need to free it.
* This will also make sure that all nodes will be freed.
*
*/
Tree::~Tree( void )
{
// free the memory of the root node which will free the memory of all nodes recursively.
delete root;
}
/**
* Overloaded assignment operator.
*
* @param t The tree that we want to copy.
* @return The newly assigned object.
*/
Tree& Tree::operator=(const Tree &t)
{
// first, we need to check and avoid self-assignment
if ( this != &t )
{
// free the memory of the root node which will free the memory of all nodes recursively.
delete root;
// for safety, set the root to NULL
root = NULL;
nodes.clear();
// only create a deep copy if the root exists
if ( t.root != NULL )
{
// create a deep copy
root = new TreeNode( *t.root );
// fill our nodes by phylogenetic traversal
fillNodesByPhylogeneticTraversal( root );
}
// we done now with our assignment
}
// we return this object now
return *this;
}
/**
* Compute the newick string for this tree.
*
* This function is a recursive function, i.e., it calls itself again.
* You should always start with the root node when calling this function and then it terminates ones reaching the tips.
* The newick string should be something like
* (((A:1.0,B:1.0):0.5,C:1.5):1.0,(D:2.0,E:2.0):2.5);
*
* @return The newick string for the clade starting at this node.
*/
std::string Tree::computeNewickRecursively(const TreeNode &node) const
{
// initialize the newick string as an empty string
std::string newick = "";
// check if this is a tip node
if ( node.isTip() == true )
{
// this is a tip node
// so we can terminate the recursion
// we simply have to use the name of this node for the newick string
newick = node.getName();
}
else
{
// since it is not a tip node it must be an interior node with two children.
// we therefore compute the newick string for both children
// this is done with a recursive call to this function with the left and right child as an argument
std::string left = computeNewickRecursively( *node.getLeftChild() );
std::string right = computeNewickRecursively( *node.getRightChild() );
// we want to print the newick trees in a unique format, so we order it alphabetically
// hence, check if the newick string of the left child is alphabetically before the newick string of the right child.
// we then put together the newick strings of the two children
if ( left < right )
{
newick = "(" + left + "," + right + ")";
}
else
{
newick = "(" + right + "," + left + ")";
}
}
// finally, we need to add the branch length to this newick string
newick += ":" + std::to_string(node.getBranchLength());
// and then we return this newick string
return newick;
}
/**
* Fill the nodes vector in phylogenetic ordering.
*
* Fill the nodes vector by a phylogenetic traversal recursively starting with this node.
* The tips fill the slots 0,...,n-1 followed by the internal nodes and then the root.
*/
void Tree::fillNodesByPhylogeneticTraversal(TreeNode* node)
{
if ( node->isTip() )
{
// all the tips go to the beginning
nodes.insert(nodes.begin(), node);
}
else
{
// now call this function recursively for all your children
fillNodesByPhylogeneticTraversal( node->getLeftChild() );
fillNodesByPhylogeneticTraversal( node->getRightChild() );
// this is phylogenetic ordering so the internal nodes come last
nodes.push_back(node);
}
}
/**
* Compute the newick string for this tree.
*
* This method simply starts by traversing the tree recursively from the root.
* See the method computeNewickRecursively for more information.
*
* @return The newick string for the entire tree.
*/
std::string Tree::getNewick( void ) const
{
return computeNewickRecursively( *root );
}
/**
* Get the vector with all the nodes in the tree.
*
* @return The vector of nodes,
*/
const std::vector<TreeNode*> Tree::getNodes( void ) const
{
return nodes;
}
/**
* Get the number of nodes contained in the tree.
*
* This number should be 2*N-1 for rooted trees where N is the number of tips.
*
* @return The number of nodes in the tree,
*/
size_t Tree::getNumberOfNodes( void ) const
{
return nodes.size();
}
/**
* Get the root node of the tree.
*
* @return The root node.
*/
const TreeNode* Tree::getRootNode( void ) const
{
return root;
}
#ifndef NewickTreeReader_H
#define NewickTreeReader_H
#include <string>
#include <vector>
class Tree;
class TreeNode;
/**
* Newick tree reader.
*
* The newick tree reader provides convenience methods for reading trees in Newick format.
*
*
*
*/
class NewickTreeReader {
public:
NewickTreeReader();
std::vector<Tree> readTrees(const std::string &fn);
private:
TreeNode* createTreeFromNewick(const std::string& newick);
std::vector<std::string> parseNewickString(const std::string& ns);
};
#endif
#include <fstream>
#include <string>
#include <vector>
#include <stdio.h>
#include <iostream>
#include <sstream>
#include "NewickTreeReader.h"
#include "Tree.h"
#include "TreeNode.h"
/**
* Default constructor.
*
* The default constructor does nothing except allocating the object.
*/
NewickTreeReader::NewickTreeReader()
{
}
/**
* Read all trees in the file which should be in Newick tree format.
*
* @param fn The filename.
* @return A vector with all the trees read from the fil.
*/
std::vector<Tree> NewickTreeReader::readTrees(const std::string& fn)
{
// open the file and create a file input stream
std::ifstream inFile( fn );
// check if we were successful in opening the file
if ( !inFile )
{
// the file could not be opened
std::cerr << "Could not open file \"" << fn << "\"" << std::endl;
}
// initialize the vector of trees
// we will fill these later
std::vector<Tree> trees = std::vector<Tree>();
// now loop over each line in the file
while ( inFile.good() )
{
// create a string variable that will hold the line
std::string line;
// read a line
getline(inFile, line);
// skip empty lines
if (line.length() == 0)
{
continue;
}
// note that we assume here for simplicity that our Newick trees don't contain spaces.
// you may want to remove all spaces ...
// construct the tree by calling the create tree from Newick string function
TreeNode *root = createTreeFromNewick( trimmed );
// create and allocate the tree object
Tree t = Tree( root );
// add the tree to our vector
trees.push_back( t );
}
// finally, return the vector of trees
return trees;
}
Parsing is a real pain in the neck.
The parser introduces a few new ideas. First of all, you can see in the conditional if statement that we use two parallel lines. This is read or. You can check for several conditions in a if statement:
Walk through this code line-by-line.
/**
* Create a tree from a Newick string.
*
* @param newick The tree as a Newick string.
* @return The root node of the create tree.
*/
TreeNode* NewickTreeReader::createTreeFromNewick(const std::string& newick)
{
// break the string into its component parts
std::vector<std::string> tokens = parseNewickString(newick);
// we need some variables
// first, if the last token was a ':' so that we next read in the branch length
bool reading_branch_length = false;
// we also need a pointer to the ancestor
TreeNode* ancestor = NULL;
// and also a pointer to the root for checking that the algorithm worked correctly
TreeNode* root = NULL;
// build up the tree from the parsed Newick string
for (int i=0; i<tokens.size(); i++)
{
// get the current token
const std::string& token = tokens[i];
// check what the token corresponds to
if (token == "(")
{
// we go one level deeper into the tree with a '('
// create new node
TreeNode* next_node = new TreeNode();
// if this was the first '(', then we need to initialize the root
if (ancestor == NULL)
{
root = next_node;
}
else
{
// we have an ancestral node for this new node
// so we set the it as the parent
next_node->setParent(ancestor);
// and the new node will be either the left or right child
if (ancestor->getLeftChild() == NULL)
{
ancestor->setLeftChild(next_node);
}
else
{
ancestor->setRightChild(next_node);
}
}
// now we set the new node as the next ancestor
ancestor = next_node;
}
else if (token == ")" || token == ",")
{
// we finished with the previous node
// move one level back
// for safety, check that we can do so
if (ancestor->getParent() == NULL)
{
std::cout << "Error: We cannot find an expected ancestor" << std::endl;
exit(1);
}
// we will reset the ancestor as the ancestor's parent
ancestor = ancestor->getParent();
}
else if (token == ":")
{
// begin reading a branch length
reading_branch_length = true;
}
else if (token == ";")
{
// finished!
// just make sure that we also terminated at the root
if (ancestor != root)
{
std::cout << "Error: We expect to finish at the root node" << std::endl;
exit(1);
}
}
else
{
// we have a taxon name or a branch length
if (reading_branch_length == true)
{
// convert the token into a double
double x = std::stod(token);
// since we are actually storing ages, we compute the age as
// the child's age plus the branch length.
// thus, we set the age of the parent/ancestor and need to make sure that we have one
if ( ancestor->getParent() != NULL )
{
ancestor->getParent()->setAge( ancestor->getAge() + x );
}
// now flip back the bool that we are not expecting a branch length anymore
reading_branch_length = false;
}
else
{
// we got a new tip node
// create this tip
TreeNode* next_node = new TreeNode();
// and tell it who its parent was
next_node->setParent(ancestor);
// also tell the ancestor that this was a left/right child
if (p->getLeftChild() == NULL)
{
p->setLeftChild(next_node);
}
else
{
p->setRightChild(next_node);
}
// also set the name of the tip
next_node->setName(token);
// and reset, as usual, reset the ancestor to be the current node
ancestor = next_node;
} // finished the if-else for the taxon name vs branch length
} // finished the if-else for what type of token it was
} // finished the foor loop over all tokens
// return the root node for this tree
return root;
}
You will see that we have a loop over the parsed elements in the Newick string. I call these parsed elements token. We do different things depending on the identity of the token:
The parseNewickString first declares a vector of strings. It then enters a loop, incremented by a variable i, that reads the Newick string one character at a time. If the character is one of the special ones — ‘(’, ‘)’, ‘,’, ‘:’ and ‘;’ — it immediately adds the character to the vector of strings. Otherwise, the character is part of a taxon name or a branch length. In this case, the parser races ahead to the next special symbol using a new type of loop (while), incrementing a new counter (j), adding the entire string (taxon name or branch length) to the vector.
/**
* Parse the Newick string into separate tokens.
*
* A token is separated by a '(', ')', ',', ':' or ','
*
* @param ns The newick string that we want to parse.
* @return A vector of strings containing all the tokens and separators.
*/
std::vector<std::string> NewickTreeReader::parseNewickString(const std::string& ns)
{
// create an empty vector of strings for the tokens
std::vector<std::string> tks;
// now we iterate over each character in the string
for (int i=0; i<ns.size(); i++)
{
// get the current character of the string
char c = ns[i];
// check if it was on of the separators
if (c == '(' || c == ')' || c == ',' || c == ':' || c == ';')
{
// add the separator-token to our vector
tks.push_back( std::string(c) );
}
else
{
// it was not a separator, so it was actually some token
// we need to extract it until the next separator
// we move over the string until we hit the next separator
int j = i;
// create
std::string temp_token = "";
while ( !(c == '(' || c == ')' || c == ',' || c == ':' || c == ';') )
{
// add the current character to our token string
temp_token += c;
// move the index forward
j++;
// get the next character
c = ns[j];
}
// fast forward the index of the newick string
i = j-1;
// add the token to our vector
tks.push_back(temp_token);
}
} // end for loop over all characters in the string
// return the vector of tokens
return tks;
}