Team II Genome Assembly Group

From Compgenomics 2018
Jump to navigation Jump to search

Introduction

Background

Antibiotic resistance has been called one of the world’s most pressing public health problems. Antibiotic resistance is the ability of bacteria to resist the effects of an antibiotic. It occurs when bacteria change in a way that reduces the effectiveness of drugs, chemicals, or other agents designed to cure or prevent infections. The bacteria survive and continue to multiply, causing more harm (Figure 1.a.). Antibiotic resistance can cause illnesses that were once easily treatable with antibiotics to become dangerous infections, prolonging suffering for children and adults. Antibiotic-resistant bacteria can spread to family members, schoolmates, and co-workers, and may threaten your community (Figure 1.b). Antibiotic-resistant bacteria are often more difficult to kill and more expensive to treat and in some cases, can lead to serious disability or even death here.

50

Figure 1: a. How antibiotic resistance happens, b. How antibiotic resistance spreads

Data

Emory Antibiotic Resistance Center (ARC) in Emory University, School of medicine tries to better understand antibiotic resistance to combat this crisis and improve human health. Their goals include learning how antibiotic resistance develops, optimizing the way antibiotics are used to preserve their power, and discovering novel therapeutics and vaccines to directly combat antibiotic-resistant pathogens. Solving the crisis of antibiotic resistance requires a multi-faceted approach that crosses traditional boundaries [2].

They Provided us with a sample of 262 pair-end raw reads sequencing of Klebsiella spp, from illumina MiSeq.

Klebsiella is a genus of nonmotile, Gram-negative, oxidase-negative, rod-shaped bacteria (Figure 2) with a prominent polysaccharide-based capsule. Klebsiella species are found everywhere in nature. The members of the genus Klebsiella are a part of the human and animal's normal flora in the nose, mouth and intestines. The species of Klebsiella are all gram-negative and non-motile. They tend to be shorter and thicker when compared to others in the Enterobacteriaceae family. The cells are rods in shape and generally measures 0.3 to 1.5 µm wide by 0.5 to 5.0 µm long. They can be found singly, in pairs, in chains or linked end to end. Klebsiella can grow on ordinary lab medium and do not have special growth requirements, like the other members of Enterobacteriaceae. Some of Klebsiella types are: K.granulomatis, K. oxytoca, K. michiganensis and K. pneumoniae (type-species: K. p. subsp. ozaenae, K. p. subsp. pneumoniae, K. p. subsp. rhinoscleromatis) [3]. Though, they have been extensively studied, for example, only four complete genomes of K. pneumoniae were available till 2011 [4]. To better understand the multidrug resistance factors in Klibsiella, we need to determine genome DNA sequences of strains.

Figure 2: Scanning electron microscope image of Klebsiella pneumoniae. From: Bioquell.com

Objectives

  • To distinguish between susceptible and heteroresistant strains/species
  • To discover genomic determinants of antibiotic resistance
  • To develop a predictive web server

Genome Assembly Pipeline

Pre-Assembly

Before doing any assembly, we would like to got some basic statistics about our dataset as:

1. Per base sequence quality

2. Per base N content

3. Per base GC content

4. Overrepresented sequences

5. Sequence length distribution

We use FastQC [5] for receiving these information and making some visualization. These kind of information will give us an overall view and help us to improve the quality of our sequences, if it is necessary, as raw inputs for the assemblers . For example, The presence of poor quality or technical sequences such as adapters in next-generation sequencing (NGS) data can easily affect the quality of assembly. We used Trim Galore [6] and Sickle [7] to guarantee the best performance for each assembly method that we will use.

Trim Galore is a wrapper script to automate quality and adapter trimming as well as quality control. It specifically designed for Illumina data and has an Illumina adapter library.

Sickle is a tool that uses sliding windows along with quality and length thresholds to determine when quality is sufficiently low to trim the 3'-end of reads and also determines when the quality is sufficiently high enough to trim the 5'-end of reads (Figure 3).

Figure 3: How Sickle is working

We used the following parameters for Trim Galore and Sickle command line and the results for before and after trimming shown in Figure 4 for one of our samples.

  trim_galore --illumina -q 20 --clip_R1 20 --clip_R2 20 --three_prime_clip_R1 5 --three_prime_clip_R2 5 --paired <Pair1.fastq> <Pair2.fastq> -o <Output>
  sickle pe -q 20 -f <Pair1_adapter_trimmed.fq> -r <Pair2_adapter_trimmed.fq> -t sanger -o <Pair1_Output.fastq> -p <Pair1_Output.fastq> -s <singles_Output.fastq>


Figure 4: Per base sequence quality and adapter content of one of the sample before and after trimming

After trimming the adapter and the low quality reads, we can check the quality of all sequences at one with the MultiQC [8] tool.

MultiQC is a general use tool, perfect for summarising the output from numerous bioinformatics tools .

Figure 5: Summary of trimmed reads from MultiQC

Assembly

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather reads small pieces of between 20 and 30000 bases, depending on the technology used [8].

Reference Assembly

A type of genome assembly where the reads are mapped (or compared) to a known version of the organism’s genome. It saves time and is significantly more accurate and assemble the genome with less contigs. But, it has some disadvantage too. We would not discover new genes with this method of assembly, new (completely different) sequences may lost. It requires a reference that is very similar to the sequenced data. If multiple positions on the reference genome are equally likely for a read, then, reads are ignored, placed at multiple locations or placed at the first likely position.

As we don't know the species of the Klebsiella, cannot simply choose a Klebsiella species to use as reference genome. Therefore, we need to find the reference genomes from the 250+ Klebsiella genomes which are closest related to our reads. For this aim, we used 'Mashtree [12]. It creates a tree based on Mash distance for our trimmed reads and 250+ Klebsiella genomes. Then, we identify representative references.

Nest step, is to make a tree of the closely related reads to create a custom database for StrainSeeker with ALFRED-G [13]. It creates distance matrix and then we convert it to Newick Tree (accepted format for StrainSeeker). Ad finally, we use StrainSeeker. It is a program for detecting bacterial strains from raw sequencing reads. It gives us the option of fully customizable database of our own strains of interest. It can detect novel strains that are related to strains in the database [10, 11]. StrainSeeker takes the tree generated from the previous step, and creates a customized database of reference genomes we believe to be closely related to our dataset. Results from StrainSeeker will tell us which Klebsiella strain is related to each single read. Then, we categorize all the reads to different bin based on the references and assemble that batch based on the assigned reference.

De Novo Assembly

De novo sequencing refers to sequencing a novel genome where there is no reference sequence available for alignment. Sequence reads are assembled as contigs, and the coverage quality of de novo sequence data depends on the size and continuity of the contigs (ie, the number of gaps in the data).

We chose assemblers that follow De Bruijn graph based methods for our samples, as we are handling a large volume of samples, speed is of high priority.

De Bruijn graphs:


The original sequence is divided into kmers, and each kmer is an node in the graph. Adjacent kmers are connected based on overlap with neighbors. But a real genome is quite unlike the example above, introductions of repeat regions complicate the problem.



Repeats introduce ambiguity in the path the graph follows, and as the number of repeats increase, the graph becomes increasingly harder to traverse. This can be resolved by taking into account the overlap information and the multiplicity of the reads.


Tool testing:

We tested 10 of the largest samples in our dataset as file size would be the speed-limiting step. We also ran each tool with default parameters as they all have methods in place that determine optimum k-mer lengths.

Rationale: The following tools were chosen based on existing literature on comparisons, GAGE evaluations, and the evaluations done by the 2016 and 2017 CompGenomics assembly groups. They are debruijn graph based assemblers that differ in their algorithms based on how they handle kmers.

1. SPAdes:

SPAdes uses 3 different kmer lengths to find an optimum traversal path. It also has a built in scaffolding tool that it uses to merge contigs. Detailed information on SPAdes can be found here

2. MaSuRCA:

MaSuRCA combines initially constructed kmers into "super reads which tend to be longer than typical Illumina reads. The subsequent steps use these super reads instead of the initial kmers. More information can be found here

3. Unicycler:

Unicycler applies SPAdes assembly, but tests across a wide range of k-mer lengths. It also has several built in steps for repeat resolution, scaffolding and final polishing. This produces high qualitiy assemblies but is most time and resource consuming. More information on Unicycler can be found here

4. Skesa:

Skesa is the most recent tool in our list and is currently used by NCBI. Information about the algorithm is currently unavailable. Skesa does not have inbuilt scaffolding,

Post Assembly

Improvement Scaffolding

After choosing the highest quality draft assemblies, contigs would be unordered mass of stretches of DNA. To improve it, scaffolding is used to bring order and direction to these unordered DNA. The tool called SSPACE is used in this step. Using paired end data of reads, algorithm of scaffolding attempts to join multiple contigs by closing the gaps between contigs and determining the order they are in.

The long reads are usually sequenced at a lower depth, whereas the short reads are sequenced at a higher depth. Both the long reads and the short reads complement each other to produce a better de novo assembly.

Assessment Scoring methods

Since different assembly methods are used, post assembly steps should be done for choosing the tool that generated the best quality. N50 and L50 are some of these scoring methods. By using these scoring methods, we can determine which assembly has the highest quality.


N50 Weighted Score (Contig)

L50 Weighted Score (Scaffold)

Simply, the better score means less number of contigs and longer length.

Results

References