In our fist class we learnt about how to set up our c++ environment and we wrote our first, very simple program. Importantly, you should know by now how to compile a c++ software. In this chapter we will start with our BabyBayes implementation. The first part of this implementation is to implement an alignment class and a simple alignment reader, i.e., another class that reads in alignments from a file.
Here we will start with one of the most basic alignment formats: the phylip file format. You can read up about this on Joe Felsenstein’s website. Here is an example of the primates.phy alignment file.
12 898
Tarsius_sy AAGTTTCATT GGAGCCACCA CTCTTATAAT TGCCCATGGC CTCACCTCCT CCCTATTATT
Lemur_catt AAGCTTCATA GGAGCAACCA TTCTAATAAT CGCACATGGC CTTACATCAT CCATATTATT
Homo_sapie AAGCTTCACC GGCGCAGTCA TTCTCATAAT CGCCCACGGG CTTACATCCT CATTACTATT
Pan AAGCTTCACC GGCGCAATTA TCCTCATAAT CGCCCACGGA CTTACATCCT CATTATTATT
Gorilla AAGCTTCACC GGCGCAGTTG TTCTTATAAT TGCCCACGGA CTTACATCAT CATTATTATT
Pongo AAGCTTCACC GGCGCAACCA CCCTCATGAT TGCCCATGGA CTCACATCCT CCCTACTGTT
Hylobates AAGCTTTACA GGTGCAACCG TCCTCATAAT CGCCCACGGA CTAACCTCTT CCCTGCTATT
Macaca_fus AAGCTTTTCC GGCGCAACCA TCCTTATGAT CGCTCACGGA CTCACCTCTT CCATATATTT
M_mulatta AAGCTTTTCT GGCGCAACCA TCCTCATGAT TGCTCACGGA CTCACCTCTT CCATATATTT
M_fascicul AAGCTTCTCC GGCGCAACCA CCCTTATAAT CGCCCACGGG CTCACCTCTT CCATGTATTT
M_sylvanus AAGCTTCTCC GGTGCAACTA TCCTTATAGT TGCCCATGGA CTCACCTCTT CCATATACTT
Saimiri_sc AAGCTTCACC GGCGCAATGA TCCTAATAAT CGCTCACGGG TTTACTTCGT CTATGCTATT
TTGCCTAGCA AATACAAACT ACGAACGAGT CCACAGTCGA ACAATAGCAC TAGCCCGTGG
CTGTCTAGCC AACTCTAACT ACGAACGAAT CCATAGCCGT ACAATACTAC TAGCACGAGG
CTGCCTAGCA AACTCAAACT ACGAACGCAC TCACAGTCGC ATCATAATCC TCTCTCAAGG
CTGCCTAGCA AACTCAAATT ATGAACGCAC CCACAGTCGC ATCATAATTC TCTCCCAAGG
CTGCCTAGCA AACTCAAACT ACGAACGAAC CCACAGCCGC ATCATAATTC TCTCTCAAGG
CTGCCTAGCA AACTCAAACT ACGAACGAAC CCACAGCCGC ATCATAATCC TCTCTCAAGG
CTGCCTTGCA AACTCAAACT ACGAACGAAC TCACAGCCGC ATCATAATCC TATCTCGAGG
CTGCCTAGCC AATTCAAACT ATGAACGCAC TCACAACCGT ACCATACTAC TGTCCCGAGG
CTGCCTAGCC AATTCAAACT ATGAACGCAC TCACAACCGT ACCATACTAC TGTCCCGGGG
CTGCTTGGCC AATTCAAACT ATGAGCGCAC TCATAACCGT ACCATACTAC TATCCCGAGG
CTGCTTGGCC AACTCAAACT ACGAACGCAC CCACAGCCGC ATCATACTAC TATCCCGAGG
CTGCCTAGCA AACTCAAATT ACGAACGAAT TCACAGCCGA ACAATAACAT TTACTCGAGG
...
A phylip formatted alignment file always starts with a line containing two numbers: the number of taxa and the number of characters. The two numbers are separated by any number of whitespace characters (e.g., spaces or tabs). Then, we have the sequence data in each row. The row starts with the name of the taxon which can be at most 10 characters long.
Each class start with the header file to define the class with its methods and variables. Write a new files called Alignment.h with the following content.
#ifndef Alignment_h
#define Alignment_h
#include <stdio.h>
#include <string>
#include <vector>
/**
* \class Alignment
*
* \brief Represents the multiple sequence alignment class holding the data matrix.
*
* This class represent the multiple sequence alignment.
* It's main purpose is to hold and manage the character data matrix.
*
*
* \author Sebastian Höhna
*
*/
class Alignment {
public:
Alignment(size_t nt, size_t ns);
~Alignment(void);
char getNucleotide(size_t i, size_t j) const;
size_t getNumberOfSites(void) const;
size_t getNumberOfTaxa(void) const;
const std::vector<std::string>& getTaxonNames(void) const;
std::vector<int> getPossibleNucleotides (char nuc) const;
void print(std::ostream & o) const;
void setMatrix(const std::vector<std::vector<char> > &m);
void setNucleotide(char x, size_t i, size_t j);
void setTaxonNames(const std::vector<std::string> &n);
private:
size_t num_taxa;
size_t num_sites;
std::vector<std::string> taxon_names;
std::vector<std::vector<char> > data_matrix;
};
#endif /* Alignment_h */
This class has all the definition that we need.
Then, we write a new file called Alignment.cpp. We start implementing each function that we defined in Alignment.h now. First, we load the libraries and other header files needed,
Then, we continue with the constructor function. The constructor has always the same name as the class. Here the constructor will create an empty data matrix.
/**
* Constructor of the alignment class.
* We will construct an empty data matrix of size nt rows and nc columns.
*
* @param nt number of taxa/rows in the alignment
* @param ns number of sites/characters/columns in the alignment
*/
Alignment::Alignment( size_t nt, size_t ns ) :
num_taxa( nt ),
num_sites( ns )
{
// create an empty data matrix of num_taxa rows and num_sites columns.
// By default we set every entry to '?', the missing character
data_matrix = std::vector<std::vector<char> >(num_taxa, std::vector<char>(num_sites, '?'));
}
It is a good style to also always add a destructor. The destructor is always called when the object is destroyed and should free all the memory from an object. So we will do this here that you will learn how to add one.
/**
* Desctructor of the alignment class.
*
* Since we don't have any allocated memory we don't need to free any.
*/
Alignment::~Alignment( void )
{
// nothing to do in here.
}
Now, our first actual method is the getNucleotide method. This is a faily simple method that exactly does what the name suggests; it returns a nucleotide from the data matrix.
/**
* Get the character for the i-th taxon/row and j-th site/column.
*
*
* @param i the i-th taxa in the alignment
* @param j the j-th site in the alignment
*
* @return the character.
*/
char Alignment::getNucleotide(size_t i, size_t j) const
{
// simply return
return data_matrix[i][j];
}
For future use, we might also provide more helper methods that return, for example, the number of sites.
/**
* Get the number of sites/characters/columns for this alignment.
*
* @return the number of sites.
*/
size_t Alignment::getNumberOfSites( void ) const
{
// return our private variable for the number of sites
return num_sites;
}
Similary, we implement a method for the number of taxa.
/**
* Get the number of taxa/rows for this alignment.
*
* @return the number of taxa.
*/
size_t Alignment::getNumberOfTaxa( void ) const
{
// return our private variable for the number of taxa
return num_taxa;
}
And we also implement a method to retrieve the taxon names.
/**
* Get the taxa names.
*
* @return a vector of strings with the names of the taxa.
*/
const std::vector<std::string>& Alignment::getTaxonNames(void) const
{
// return our private variable for the taxon names
return taxon_names;
}
Most Importantly, we want to see what we actually stored in the data matrix. So for our own purpose, we print the data matrix.
/**
* Print the alignment to the given stream.
* For simplicity, we will assume that the number of character
*
*/
void Alignment::print(std::ostream & o) const
{
// first, we print some information about the alignment
// print the number of sites
o << "Number of sites = " << num_sites << '\n';
o << " ";
// then print the index for each taxon
for (int i=0; i<num_taxa; i++)
o << std::setw(3) << i << " ";
o << '\n';
// we nicely separate these by '-'
o << "------------------------";
for (int i=0; i<num_taxa; i++)
o << "---";
o << '\n';
// now we print on each line the column/site
for (int j=0; j<num_sites; j++)
{
o << std::setw(10) << j+1 << " -- ";
// so we iterate over all taxa to print the actual data matrix
for (int i=0; i<num_taxa; i++)
{
o << std::setw(3) << data_matrix[i][j] << " ";
}
o << '\n';
}
// make sure to flush the stream so that it is printed.
o.flush();
}
As before with the get methods, we provide a set of set methods. In general, it is a good programming style to have your variables to be private. This means, no one can change them from the outside. Only through getters and setters can the variables be used.
First, we add a set method for a specific character.
/**
* Set the nucleotide for taxon i at site j.
*
* @param x The new nucleotide.
* @param i The index of the taxon.
* @param j The index of the site.
*/
void Alignment::setNucleotide(char x, size_t i, size_t j)
{
// set the character of our internal data matrix
data_matrix[i][j] = x;
}
Then, we also add a set function for the entire matrix, just for convenience.
/**
* Set the data matrix.
*
* @param m The new data matrix.
*/
void Alignment::setMatrix(const std::vector<std::vector<char> > &m)
{
// set the internal variable
data_matrix = m;
}
And another set function for the taxon names.
/**
* Set the taxon names.
*
* @param n A vector of strings with the new taxon names.
*/
void Alignment::setTaxonNames(const std::vector<std::string> &n)
{
taxon_names = n;
}
This was it for our alignment class. If you finished with this, you can take a coffee break :)
In this next section we write a reader class to read in alignments in phylip format.
#ifndef AlignmentReader_h
#define AlignmentReader_h
#include <string>
/**
* \class AlignmentReader
*
* \brief A reader for alignments from files in different formats.
*
* This class provides the functionality of a reader for alignments.
* The alignments will be read from file. Currently, we support only
* the Phylip data format.
*
*
* \author Sebastian Höhna
*
*/
class AlignmentReader {
public:
AlignmentReader(void);
~AlignmentReader(void);
Alignment readPhylip(const std::string &fn);
};
#endif /* Alignment_h */
This concludes the definition of our AlignmentReader in the file AlignmentReader.h.
#include <iostream> // for std::cerr
#include <sstream> // for splitting strings
#include <fstream> // for reading from a file
#include <iterator> // std::istream_iterator
#include <string>
#include <vector>
#include "Alignment.h"
#include "AlignmentReader.h"
/**
* A default constructor that does nothing besides creating an object of this class.
*/
AlignmentReader::AlignmentReader( void )
{
// nothing to initialize
}
/**
* A default destructor that does nothing because we don't have any allocated variables.
*/
AlignmentReader::~AlignmentReader( void )
{
// nothing to delete here
}
Alignment AlignmentReader::readPhylip(const std::string& fn)
{
// first, we open the file stream
// this uses a ifstream object and needs the filesname
std::ifstream in_stream(fn);
// let's see if the stream was opened
if (!in_stream)
{
// it was not opened, so the file did not exist
// we need to tell the user and terminate
std::cerr << "Cannot open file \"" + fn + "\"" << std::endl;
exit(1);
}
// now we can initialize some helper variables
// we a string for the current line of the file
std::string linestring = "";
// we also need the current taxon number
size_t taxon_num = 0;
// the number of taxa
size_t num_taxa = 0;
// the number of sites
size_t num_sites = 0;
// and if all the taxon names have been read in
bool taxon_names_read = false;
// finally we also need a vector of strings for the taxon names
std::vector<std::string> taxon_names;
// first, we need to read the number of taxa and sites
// this should be the first line, so we load the first line into the linestring variable
getline(in_stream, linestring);
// now we split the line into tokens separated by a white space
// this assumme that the number of taxa and sites are provided in this line with only whitespaces separating them
// we create stringstream for this
std::istringstream word_iss(linestring);
// and then separate the line by whitespaces
std::vector<std::string> words((std::istream_iterator<std::string>(word_iss)),
std::istream_iterator<std::string>());
// now we store the number of taxa and sites into our variables
num_taxa = std::stoi( words[0] );
num_sites = std::stoi( words[1] );
// with these two, we can now create our alignment
Alignment this_alignment = Alignment(num_taxa, num_sites);
// for the actual data, and for convenience, we create a data matrix that has num_taxa rows but no columns yet
std::vector<std::vector<char> > matrix = std::vector<std::vector<char> >( num_taxa, std::vector<char>() );
// then we start to loop over the entire file
// we do this with an endless loop which we will stop when the end of file has been reached.
while ( true )
{
// first, check if there are more lines to be read
bool more_lines_in_file = getline(in_stream, linestring).good();
// then, we skip forward over any empty line
while ( more_lines_in_file && linestring == "" )
{
// check again if we reached the end of file
more_lines_in_file = getline(in_stream, linestring).good();
}
// if we reached the end of the file, then we should break out of endless loop
if ( more_lines_in_file == false )
{
// this break command stops the endless loop
break;
}
// now we split the line by whitespaces using again a stringstream
std::istringstream iss(linestring);
std::vector<std::string> tokens((std::istream_iterator<std::string>(iss)),
std::istream_iterator<std::string>());
// we then iterate over all the tokens/words that we received.
for ( int i=0; i<tokens.size(); i++ )
{
// the first word should be the taxon name
// this should only be done if we haven't read all taxon names in interleaved file format
if ( i == 0 && taxon_names_read == false )
{
// so we push the first word to the end of the names of the taxa
taxon_names.push_back(tokens[i]);
}
else
{
// otherwise, we have an actual sequence
const std::string &the_sequence = tokens[i];
// so we add the sequence to the end of the sequence for this taxon
// thus, we append our data matrix using the insert function
matrix[taxon_num].insert(matrix[taxon_num].end(), the_sequence.begin(), the_sequence.end());
}
}
// since we finished we this line, we need to move on to the next taxon.
taxon_num++;
// if we have finished with all taxa, then we start with the first one again (for interleaved files)
if ( taxon_num == num_taxa )
{
// hence, set the taxon index to 0 (we start counting with 0)
taxon_num = 0;
// and remember that we read all taxon names
taxon_names_read = true;
}
} // the end of our while ( true ) loop
// always remember to close the file
in_stream.close();
// now we propagate the taxon names and data matrix into our alignment
this_alignment.setTaxonNames( taxon_names );
this_alignment.setMatrix( matrix );
// and finally return this new alignment
return this_alignment;
}
Finally, so that you can test your program, you need to add your new classes to the main function. Change your main class to something like this.
#include <iostream> // for std::cout
#include "Alignment.h"
#include "AlignmentReader.h"
int main (int argc, char* argv[]) {
// print header
std::cout << "BabyBayes v1.0\n";
std::cout << "Sebastian Höhna\n";
std::cout << "Ludwig-Maximilians-Universität München\n";
std::cout << "\n\n";
// read the data
AlignmentReader reader;
Alignment my_alignment = reader.readPhylip( "primates.phy" );
my_alignment.print(std::cout);
return 0;
}
Make sure that your Makefile has the correct source and object files:
SRC = Alignment.cpp AlignmentReader.cpp main.cpp
OBJECTS = Alignment.o AlignmentReader.o main.o
Then you can compile BabyBayes using your Makefile and run it. See what it does!