IntroductionA DeoxyriboNucleic Acid (DNA) macromolecule can be coded by a string over a four-letters alphabet. These letters are A, C, G and T, they code respectively the bases Adenine, Cytosine, Guanine and Thymine. DNA sequencing consists then in determining the exact order of these bases in a DNA macromolecule. As a matter of fact, DNA sequencing technology has played a key role in the advancement of Molecular Biology. In this article, are discussed the three main steps (preprocessing, processing and postprocessing) for Next-Generations Sequencing (NGS) data assembly.
DNA Sequencing TechnologiesThe first generation sequencing technology, also called Sanger sequencing technology, was developed in the mid 70’s. It was based on sequencing machines like Applied Biosystems’s 3130xL-3730xL and Beckman’s GeXP Genetic Analysis System. The sequencing of the first human genome by using this technology took about ten years, required the cooperation of many laboratories around the world and costed ~3 billion US dollars. Few years after the publication of the first human genome in 2003, the Second-Generation Sequencing (2GS) technology appeared. Compared to Sanger sequencing machines, 2GS ones run much faster, with significantly lower production costs and much higher throughput in the form of short reads, i.e., strings coding portions of DNA macromolecules, of lengths varying from ~30 to ~400 base pairs (bps). Available 2GS machines include Roche’s 454 Genome Sequencer FLX, Illumina’s HiSeq 2000, Life Technologies’ SOLiD3 and Helicos’ Heliscope.
By using 2GS technology, Biologists can sequence more than five human genomes in a single run and produce related reads in days, with a cost of less than 5,000 US dollars per genome. The Third Generation Sequencing (3GS) technology is appearing and it is providing even lower production costs and even much higher throughput. Available 3GS machines include Pacific Biosciences’ PacBioRS, Complete Genomics' cPAL and Ion Torrent's PGM. And with Oxford Nanopore's gridION, we talk already of the Fourth Generation Sequencing (4GS) technology. So, let’s call the sequencing technologies that appeared after Sanger sequencing one, i.e., 2GS, 3GS, 4GS,… technologies, Next-Generations Sequencing (NGS) technologies. The set of reads associated to a DNA macromolecule obtained, thanks to the NGS technology, is called NGS data.
Steps for NGS Data AssemblyNGS data assembly can be achieved in three main steps: preprocessing, processing and postprocessing. In what follows, we introduce each one of these steps.
PreprocessingThe preprocessing step consists of errors correction, redundant reads removal and low quality reads removal. Indeed, during the sequencing process:
Sequencing errors may occur, resulting in errors in the reads. There are three types of errors: insertion, deletion and substitution of characters.
It is possible that portions of a DNA macromolecule were sequenced many times resulting in reads with more than one copy or reads that are portions of longer ones. These reads can be considered as redundant. It is then computationally efficient to remove these reads, without loss of information.
Reads, or portion of reads, that contain a lot of errors are low quality data. It is necessary then to remove these reads otherwise they will generate wrongly assembled strings.
Actually, the preprocessing step enables to improve the results of the assembly and reduces the computing time and/or the memory space of the processing and postprocessing steps.
ProcessingThe processing step includes assembling the obtained reads to infer a string that, hopefully, codes a DNA macromolecule, or a part of a DNA macromolecule. There are various approaches that can be adopted to reach this goal:
Greedy approach: By using this approach, we operate as follows: First, we identify the reads that overlap with each other and select the two reads that have the highest overlapping score.
Then, we remove the two selected reads from the current set of reads, merge them in a single string and add the resulting new string to the current set of reads.
We repeat this process until no more reads can be merged or we obtain a single long string coding a DNA macromolecule, or a part of a DNA macromolecule.
Overlap-Layout-Consensus (OLC) approach: By using this approach, we operate as follows: First, we construct a graph where a node represents a read and there is an arc from a node i to a node j, if and only if, the ith read overlaps with the jth one.
Then, we identify paths in the built graph. Each path represents actually a succession of overlapping reads that make up a longer string, called contig.
Finally, we try to find an overlapping order of contigs to infer a single long string coding a DNA macromolecule, or a part of a DNA macromolecule.
De Bruijn Graph approach: By using this approach, we operate as follows: First, we construct a De Bruijn graph where a node represents a substring of length k, called k-mer, extracted from an input read and there is an arc from a node i to a node j, if and only if, the k-mer represented by the node i overlaps with the k-mer represented by the node j on (k-1) characters.
Then, we identify an Eulerian path in the constructed graph. This path represents a string coding a DNA macromolecule, or a part of a DNA macromolecule.
PostprocessingThe postprocessing step consists mainly of validating the obtained string, supposed to code a DNA macromolecule, or a part of a DNA macromolecule. To do this, there are several techniques that can be adopted. For example, we can:
- Map the reads that were used to produce the assembled string.
- Identify and count highly conserved genes expected to be present in the assembled string.
- Evaluate the quality of the assembly
- Compare the obtained result with those of other assembly algorithms