Alignments and Reading Data

pdf version

Chapter 2: Alignments and Reading Data

In our fist class we learnt about how to set up our c++ environment and we wrote our first, very simple program. Importantly, you should know by now how to compile a c++ software. In this chapter we will start with our BabyBayes implementation. The first part of this implementation is to implement an alignment class and a simple alignment reader, i.e., another class that reads in alignments from a file.

Here we will start with one of the most basic alignment formats: the phylip file format. You can read up about this on Joe Felsenstein’s website. Here is an example of the primates.phy alignment file.

12   898
Tarsius_sy  AAGTTTCATT GGAGCCACCA CTCTTATAAT TGCCCATGGC CTCACCTCCT CCCTATTATT
Lemur_catt  AAGCTTCATA GGAGCAACCA TTCTAATAAT CGCACATGGC CTTACATCAT CCATATTATT
Homo_sapie  AAGCTTCACC GGCGCAGTCA TTCTCATAAT CGCCCACGGG CTTACATCCT CATTACTATT
Pan         AAGCTTCACC GGCGCAATTA TCCTCATAAT CGCCCACGGA CTTACATCCT CATTATTATT
Gorilla     AAGCTTCACC GGCGCAGTTG TTCTTATAAT TGCCCACGGA CTTACATCAT CATTATTATT
Pongo       AAGCTTCACC GGCGCAACCA CCCTCATGAT TGCCCATGGA CTCACATCCT CCCTACTGTT
Hylobates   AAGCTTTACA GGTGCAACCG TCCTCATAAT CGCCCACGGA CTAACCTCTT CCCTGCTATT
Macaca_fus  AAGCTTTTCC GGCGCAACCA TCCTTATGAT CGCTCACGGA CTCACCTCTT CCATATATTT
M_mulatta   AAGCTTTTCT GGCGCAACCA TCCTCATGAT TGCTCACGGA CTCACCTCTT CCATATATTT
M_fascicul  AAGCTTCTCC GGCGCAACCA CCCTTATAAT CGCCCACGGG CTCACCTCTT CCATGTATTT
M_sylvanus  AAGCTTCTCC GGTGCAACTA TCCTTATAGT TGCCCATGGA CTCACCTCTT CCATATACTT
Saimiri_sc  AAGCTTCACC GGCGCAATGA TCCTAATAAT CGCTCACGGG TTTACTTCGT CTATGCTATT

            TTGCCTAGCA AATACAAACT ACGAACGAGT CCACAGTCGA ACAATAGCAC TAGCCCGTGG
            CTGTCTAGCC AACTCTAACT ACGAACGAAT CCATAGCCGT ACAATACTAC TAGCACGAGG
            CTGCCTAGCA AACTCAAACT ACGAACGCAC TCACAGTCGC ATCATAATCC TCTCTCAAGG
            CTGCCTAGCA AACTCAAATT ATGAACGCAC CCACAGTCGC ATCATAATTC TCTCCCAAGG
            CTGCCTAGCA AACTCAAACT ACGAACGAAC CCACAGCCGC ATCATAATTC TCTCTCAAGG
            CTGCCTAGCA AACTCAAACT ACGAACGAAC CCACAGCCGC ATCATAATCC TCTCTCAAGG
            CTGCCTTGCA AACTCAAACT ACGAACGAAC TCACAGCCGC ATCATAATCC TATCTCGAGG
            CTGCCTAGCC AATTCAAACT ATGAACGCAC TCACAACCGT ACCATACTAC TGTCCCGAGG
            CTGCCTAGCC AATTCAAACT ATGAACGCAC TCACAACCGT ACCATACTAC TGTCCCGGGG
            CTGCTTGGCC AATTCAAACT ATGAGCGCAC TCATAACCGT ACCATACTAC TATCCCGAGG
            CTGCTTGGCC AACTCAAACT ACGAACGCAC CCACAGCCGC ATCATACTAC TATCCCGAGG
            CTGCCTAGCA AACTCAAATT ACGAACGAAT TCACAGCCGA ACAATAACAT TTACTCGAGG

...

A phylip formatted alignment file always starts with a line containing two numbers: the number of taxa and the number of characters. The two numbers are separated by any number of whitespace characters (e.g., spaces or tabs). Then, we have the sequence data in each row. The row starts with the name of the taxon which can be at most 10 characters long.

Section 2.2: The Alignment class

Each class start with the header file to define the class with its methods and variables. Write a new files called Alignment.h with the following content.

#ifndef Alignment_h
#define Alignment_h

#include <stdio.h>
#include <string>
#include <vector>

/**
* \class Alignment
*
* \brief Represents the multiple sequence alignment class holding the data matrix.
*
* This class represent the multiple sequence alignment.
* It's main purpose is to hold and manage the character data matrix.
*
*
* \author Sebastian Höhna
*
*/
class Alignment {

public:

    Alignment(size_t nt, size_t ns);
    ~Alignment(void);

    char                                getNucleotide(size_t i, size_t j) const;
    size_t                              getNumberOfSites(void) const;
    size_t                              getNumberOfTaxa(void) const;
    const std::vector<std::string>&     getTaxonNames(void) const;
    std::vector<int>                    getPossibleNucleotides (char nuc) const;
    void                                print(std::ostream & o) const;
    void                                setMatrix(const std::vector<std::vector<char> > &m);
    void                                setNucleotide(char x, size_t i, size_t j);
    void                                setTaxonNames(const std::vector<std::string> &n);

private:

    size_t                              num_taxa;
    size_t                              num_sites;
    std::vector<std::string>            taxon_names;
    std::vector<std::vector<char> >     data_matrix;

};


#endif /* Alignment_h */

This class has all the definition that we need.

Then, we write a new file called Alignment.cpp. We start implementing each function that we defined in Alignment.h now. First, we load the libraries and other header files needed,

#include "Alignment.h"

#include <iostream>
#include <iomanip>

Then, we continue with the constructor function. The constructor has always the same name as the class. Here the constructor will create an empty data matrix.

/**
 * Constructor of the alignment class.
 * We will construct an empty data matrix of size nt rows and nc columns.
 *
 * @param    nt     number of taxa/rows in the alignment
 * @param    ns     number of sites/characters/columns in the alignment
 */
Alignment::Alignment( size_t nt, size_t ns ) :
    num_taxa( nt ),
    num_sites( ns )
{
    // create an empty data matrix of num_taxa rows and num_sites columns.
    // By default we set every entry to '?', the missing character
    data_matrix = std::vector<std::vector<char> >(num_taxa, std::vector<char>(num_sites, '?'));
}

It is a good style to also always add a destructor. The destructor is always called when the object is destroyed and should free all the memory from an object. So we will do this here that you will learn how to add one.

/**
 * Desctructor of the alignment class.
 *
 * Since we don't have any allocated memory we don't need to free any.
 */
Alignment::~Alignment( void )
{
    // nothing to do in here.
}

Now, our first actual method is the getNucleotide method. This is a faily simple method that exactly does what the name suggests; it returns a nucleotide from the data matrix.

/**
 * Get the character for the i-th taxon/row and j-th site/column.
 *
 *
 * @param   i           the i-th taxa in the alignment
 * @param   j           the j-th site in the alignment
 *
 * @return              the character.
 */
char Alignment::getNucleotide(size_t i, size_t j) const
{
    // simply return
    return data_matrix[i][j];
}

For future use, we might also provide more helper methods that return, for example, the number of sites.

/**
 * Get the number of sites/characters/columns for this alignment.
 *
 * @return              the number of sites.
 */
size_t Alignment::getNumberOfSites( void ) const
{
    // return our private variable for the number of sites
    return num_sites;
}

Similary, we implement a method for the number of taxa.

/**
 * Get the number of taxa/rows for this alignment.
 *
 * @return              the number of taxa.
 */
size_t Alignment::getNumberOfTaxa( void ) const
{
    // return our private variable for the number of taxa
    return num_taxa;
}

And we also implement a method to retrieve the taxon names.

/**
 * Get the taxa names.
 *
 * @return              a vector of strings with the names of the taxa.
 */
const std::vector<std::string>& Alignment::getTaxonNames(void) const
{
    // return our private variable for the taxon names
    return taxon_names;
}

Most Importantly, we want to see what we actually stored in the data matrix. So for our own purpose, we print the data matrix.

/**
 * Print the alignment to the given stream.
 * For simplicity, we will assume that the number of  character
 *
 */
void Alignment::print(std::ostream & o) const
{
    // first, we print some information about the alignment
    // print the number of sites
    o << "Number of sites = " << num_sites << '\n';
    o << "              ";

    // then print the index for each taxon
    for (int i=0; i<num_taxa; i++)
        o << std::setw(3) << i << " ";
    o << '\n';

    // we nicely separate these by '-'
    o << "------------------------";
    for (int i=0; i<num_taxa; i++)
        o << "---";
    o << '\n';

    // now we print on each line the column/site
    for (int j=0; j<num_sites; j++)
    {
        o << std::setw(10) << j+1 << " -- ";
        // so we iterate over all taxa to print the actual data matrix
        for (int i=0; i<num_taxa; i++)
        {
            o << std::setw(3) << data_matrix[i][j] << " ";
        }
        o << '\n';
    }

    // make sure to flush the stream so that it is printed.
    o.flush();
}

As before with the get methods, we provide a set of set methods. In general, it is a good programming style to have your variables to be private. This means, no one can change them from the outside. Only through getters and setters can the variables be used.

First, we add a set method for a specific character.

/**
 * Set the nucleotide for taxon i at site j.
 *
 * @param   x               The new nucleotide.
 * @param   i               The index of the taxon.
 * @param   j               The index of the site.
 */
void Alignment::setNucleotide(char x, size_t i, size_t j)
{
    // set the character of our internal data matrix
    data_matrix[i][j] = x;
}

Then, we also add a set function for the entire matrix, just for convenience.

/**
* Set the data matrix.
*
* @param   m               The new data matrix.
*/
void Alignment::setMatrix(const std::vector<std::vector<char> > &m)
{
    // set the internal variable
    data_matrix = m;
}

And another set function for the taxon names.

/**
* Set the taxon names.
*
* @param   n               A vector of strings with the new taxon names.
*/
void Alignment::setTaxonNames(const std::vector<std::string> &n)
{
    taxon_names = n;
}

This was it for our alignment class. If you finished with this, you can take a coffee break :)

Section 2.3: A phylip format alignment reader

In this next section we write a reader class to read in alignments in phylip format.

#ifndef AlignmentReader_h
#define AlignmentReader_h

#include <string>

/**
* \class AlignmentReader
*
* \brief A reader for alignments from files in different formats.
*
* This class provides the functionality of a reader for alignments.
* The alignments will be read from file. Currently, we support only
* the Phylip data format.
*
*
* \author Sebastian Höhna
*
*/
class AlignmentReader {

public:
    AlignmentReader(void);
    ~AlignmentReader(void);

    Alignment                           readPhylip(const std::string &fn);

};

#endif /* Alignment_h */

This concludes the definition of our AlignmentReader in the file AlignmentReader.h.

#include <iostream>     // for std::cerr
#include <sstream>      // for splitting strings
#include <fstream>      // for reading from a file

#include <iterator>     // std::istream_iterator

#include <string>
#include <vector>

#include "Alignment.h"
#include "AlignmentReader.h"
/**
 * A default constructor that does nothing besides creating an object of this class.
 */
AlignmentReader::AlignmentReader( void )
{
    // nothing to initialize
}
/**
* A default destructor that does nothing because we don't have any allocated variables.
*/
AlignmentReader::~AlignmentReader( void )
{
    // nothing to delete here
}
Alignment AlignmentReader::readPhylip(const std::string& fn)
{

    // first, we open the file stream
    // this uses a ifstream object and needs the filesname
    std::ifstream in_stream(fn);
    // let's see if the stream was opened
    if (!in_stream)
    {
        // it was not opened, so the file did not exist
        // we need to tell the user and terminate
        std::cerr << "Cannot open file \"" + fn + "\"" << std::endl;
        exit(1);
    }

    // now we can initialize some helper variables
    // we a string for the current line of the file
    std::string linestring = "";
    // we also need the current taxon number
    size_t taxon_num = 0;
    // the number of taxa
    size_t num_taxa = 0;
    // the number of sites
    size_t num_sites = 0;
    // and if all the taxon names have been read in
    bool taxon_names_read = false;
    // finally we also need a vector of strings for the taxon names
    std::vector<std::string> taxon_names;

    // first, we need to read the number of taxa and sites
    // this should be the first line, so we load the first line into the linestring variable
    getline(in_stream, linestring);

    // now we split the line into tokens separated by a white space
    // this assumme that the number of taxa and sites are provided in this line with only whitespaces separating them
    // we create stringstream for this
    std::istringstream word_iss(linestring);
    // and then separate the line by whitespaces
    std::vector<std::string> words((std::istream_iterator<std::string>(word_iss)),
                                    std::istream_iterator<std::string>());

    // now we store the number of taxa and sites into our variables
    num_taxa = std::stoi( words[0] );
    num_sites = std::stoi( words[1] );
    // with these two, we can now create our alignment
    Alignment this_alignment = Alignment(num_taxa, num_sites);
    // for the actual data, and for convenience, we create a data matrix that has num_taxa rows but no columns yet
    std::vector<std::vector<char> > matrix = std::vector<std::vector<char> >( num_taxa, std::vector<char>() );

    // then we start to loop over the entire file
    // we do this with an endless loop which we will stop when the end of file has been reached.
    while ( true )
    {
        // first, check if there are more lines to be read
        bool more_lines_in_file = getline(in_stream, linestring).good();
        // then, we skip forward over any empty line
        while ( more_lines_in_file && linestring == "" )
        {
            // check again if we reached the end of file
            more_lines_in_file = getline(in_stream, linestring).good();
        }
        // if we reached the end of the file, then we should break out of endless loop
        if ( more_lines_in_file == false )
        {
            // this break command stops the endless loop
            break;
        }

        // now we split the line by whitespaces using again a stringstream
        std::istringstream iss(linestring);
        std::vector<std::string> tokens((std::istream_iterator<std::string>(iss)),
                                        std::istream_iterator<std::string>());

        // we then iterate over all the tokens/words that we received.
        for ( int i=0; i<tokens.size(); i++ )
        {
            // the first word should be the taxon name
            // this should only be done if we haven't read all taxon names in interleaved file format
            if ( i == 0 && taxon_names_read == false )
            {
                // so we push the first word to the end of the names of the taxa
                taxon_names.push_back(tokens[i]);
            }
            else
            {
                // otherwise, we have an actual sequence
                const std::string &the_sequence = tokens[i];
                // so we add the sequence to the end of the sequence for this taxon
                // thus, we append our data matrix using the insert function
                matrix[taxon_num].insert(matrix[taxon_num].end(), the_sequence.begin(), the_sequence.end());
            }
        }

        // since we finished we this line, we need to move on to the next taxon.
        taxon_num++;
        // if we have finished with all taxa, then we start with the first one again (for interleaved files)
        if ( taxon_num == num_taxa )
        {
            // hence, set the taxon index to 0 (we start counting with 0)
            taxon_num = 0;
            // and remember that we read all taxon names
            taxon_names_read = true;
        }

    } // the end of our while ( true ) loop


    // always remember to close the file
    in_stream.close();

    // now we propagate the taxon names and data matrix into our alignment
    this_alignment.setTaxonNames( taxon_names );
    this_alignment.setMatrix( matrix );

    // and finally return this new alignment
    return this_alignment;
}

Section 2.4: Putting it together

Finally, so that you can test your program, you need to add your new classes to the main function. Change your main class to something like this.

#include <iostream>     // for std::cout

#include "Alignment.h"
#include "AlignmentReader.h"

int main (int argc, char* argv[]) {

    // print header
    std::cout << "BabyBayes v1.0\n";
    std::cout << "Sebastian Höhna\n";
    std::cout << "Ludwig-Maximilians-Universität München\n";
    std::cout << "\n\n";


    // read the data
    AlignmentReader reader;
    Alignment my_alignment = reader.readPhylip( "primates.phy" );
    my_alignment.print(std::cout);

    return 0;
}

Make sure that your Makefile has the correct source and object files:

SRC       = Alignment.cpp AlignmentReader.cpp main.cpp
OBJECTS   = Alignment.o AlignmentReader.o main.o

Then you can compile BabyBayes using your Makefile and run it. See what it does!

Section 2.5: Exercises

Subsection 2.5.1: Reading in files in fasta format

References