Team II Webserver Group: Difference between revisions

From Compgenomics 2018
Jump to navigation Jump to search
Shrey (talk | contribs)
 
(59 intermediate revisions by 3 users not shown)
Line 1: Line 1:


==Web Server ==
==Introduction==
Under Construction
 
 
===Background And Goal===
We got our raw data from Dr. David S. Weiss' lab (ARC, Division of Infectious Diseases). They provided us with a sample ID numbers and we downloaded 258 pair-end raw reads sequences of ''Klebsiella spp'',sequenced by illumina MiSeq. from  NCBI SRA database. The goal of the class was to provide an insight on the major issue of antibiotic resistance.
''Klebsiella'' is a genus of nonmotile, Gram-negative, oxidase-negative, rod-shaped bacteria with a prominent polysaccharide-based capsule. ''Klebsiella'' species are found everywhere in nature. The members of the genus ''Klebsiella'' are a part of the human and animal's normal flora in the nose, mouth and intestines. The species of ''Klebsiella'' are all gram-negative and non-motile.
Some of ''Klebsiella'' types are: ''K.granulomatis'', ''K. oxytoca'', ''K. michiganensis'' and ''K. pneumoniae'' (type-species: ''K. p. subsp. ozaenae'', ''K. p. subsp. pneumoniae'', ''K. p. subsp. rhinoscleromatis'').
 
Our job as the '''Web Server Team''' was to provide the following:-
 
1. An easy-to-use tool to help distinguish between Klebsiella phenotypes, by implementing the work of the comparative genomics group.
 
2. A robust and easy-to-use web-based de-novo assembly tool.
 
3. A feature to visualize and download the results of the 258 genomes.
 
Here, we build a web server that ideally predicts a phenotype based on the genetic information it is given.
 
===Design Principles===
1. Minimalistic
 
2. Mobile Friendly
 
3. Short Load Time
 
4. Contrasting Colors
 
===Functionalities Offered===
 
The Predictive Web Server allows the user to predict a number of phylogenetic features, based on the input sequence. The user can use the use the fasta file from the assembly above or can input his own sequences. Currently, there are two options for prediction -
 
1. Predict just the species and strain of the input sequence - uses StrainSeeker to determine the species and the strain.
 
2. Predict other phylogenetic features -
 
a) Virulence Factors
 
b) AMR Genes
 
c) Resistance Mechanisms
 
d) Drug Classes The second option requires the users to enter their email ID since the prediction usually takes a long time. The user receives a link in an email with the predictions.These predictions are made using the CARD Resistance Gene Identifier. Comprehensive Antibiotic Resistance Database (CARD) provides data, models and algorithms relating to the microbial basis of antimicrobial resistance.
 
 
'''Genome Assembly'''
 
Users will have the option of uploading either an assembled genome or short-reads from NGS methods. If short-reads are provided, a de novo assembly will be performed using the assembler Skesa. The resulting assembly will be available for download and will be used for downstream processes.
 
'''Strain and Species Identification'''
 
Strain identification is performed using k-mer based approaches in StrainSeeker. Each user-provided input sample is reduced to a pool of unique k-mers that is compared to k-mer pools from samples in the NCBI sequence database. The observed and expected k-mer pool overlaps between user-provided samples and NCBI-samples are used to taxonomically place samples with unknown strain identity.
 
'''Antibiotic-resistance Profiling'''
 
Each user-provided sample will be compared to the Comprehensive Antibiotic Resistance Database (CARD) using the toolkit RGI. RGI discovers high-confidence homologues of known antibiotic-resistance genes using Diamond homology searches. RGI also incorporates SNP models to predict genetic variants that are likely to confer new antibiotic resistances. The entire antibiotic profile of each strain is visualized in a wheel-chart (provided by RGI) that allows the user to explore result by drug class, mechanism of resistance, and antibiotic target.
 
'''Virulence Factor Profiling'''
 
Each user-provided sample will be blasted against the Virulence Factor Database (VFDB) to identify homologues of known virulence factors. A non-redundant blast output (outfmt 6) will be provided for download by the user.
 
==Tools==
 
===Skesa===
Skesa is the most recent tool in our list and is currently used by NCBI. Information about the algorithm is currently unavailable. . De novo sequencing refers to sequencing a novel genome where there is no reference sequence available for alignment. Sequence reads are assembled as contigs, and the coverage quality of de novo sequence data depends on the size and continuity of the contigs (ie, the number of gaps in the data).
 
To compare the performance of the above tools, we use Quast.
 
===StrainSeeker===
 
 
StrainSeeker is a program for detecting bacterial strains from raw sequencing reads. Compared to other similar programs, it offers the following advantages:
 
1. Fully customizable database - use your own strains of interest or download our database
 
2. Detect novel strains that are related to strains in the database
 
3. Quickly handle large amounts of data
 
4. Results given all the way down to the strain level!
 
'''Input'''
Fastq or Fasta sample
 
====Output tree====
 
[[File: StrainSeeker.png |700px| center ]]
 
===CARD AND RGI===
The Comprehensive Antibiotic Resistance Database ("CARD") provides data, models, and algorithms relating to the molecular basis of antimicrobial resistance.
* The CARD provides curated reference sequences and SNPs organized via the Antibiotic Resistance Ontology ("ARO"). These data can be browsed on the website or downloaded in a number of formats.
* These data are additionally associated with detection models, in the form of curated homology cut-offs and SNP maps, for prediction of resistome from molecular sequences.
* These models can be downloaded or can be used for analysis of genome sequences using the Resistance Gene Identifier ("RGI"), either online or as a stand-alone tool.
 
The algorithm used is as follows:-
 
*Open Reading Frame (ORF) prediction using Prodigal
* Homolog detection using Diamond
*Strict significance based on CARD curated bitscore cut-offs.Hits of 95% identity or better are automatically listed as Strict.
* All results organized by revised ARO classification: AMR Gene Family, Drug Class, and Resistance Mechanism. Support added for low quality/coverage assemblies, metagenomic merged reads, small plasmids or assembly contigs.
 
[[File: RGI.png |500px| center ]]
 
                                                  ''Visualization of RGI results on Web Server''
 
===Virulence Factor Database(VFDB)===
 
'''What Are Virulence Factors?'''
 
Virulence factors refer to the properties (i.e., gene products) that enable a microorganism to establish itself on or within a host of a particular species and enhance its potential to cause disease. Virulence factors include bacterial toxins, cell surface proteins that mediate bacterial attachment, cell surface carbohydrates and proteins that protect a bacterium, and hydrolytic enzymes that may contribute to the pathogenicity of the bacterium.
 
The virulence factor database (VFDB) is dedicated to providing up-to-date knowledge of virulence factors (VFs) of various bacterial pathogens.
 
 
 
'''How to get virulence factors'''
 
1. Blastn against VFDB
 
2. Remove redundant matches, output list of unique VF homologues
 
3. Accession numbers, blast scores, positions in query
 
==Our Web Page==
===Home Page===
Basic Layout
[[File: homepage_final.png |700px| center ]]
 
===Assembly Page===
 
====Steps to Run Assembly====
 
1. Click on the Assemble button on Home Page
 
2. Upload the sequences on the server
 
3. Enter a single SRR ID
 
4. Enter a comma-separated list of SRR sequences
 
5. Enter sequence reads as a FASTQ file
 
6. Enter the email ID
 
7. The user would then receive an email with a link to the download the finished assemblies.
 
 
[[File:Assembly_Page.png |800px| center ]]
 
===Downloads Page===
This section contains the results of analyses performed by the various groups as mentioned above. The user can download these results and use them according to his/her needs. Currently, there are two types of files available. For 258 samples of Klebsiella sp., the server hosts the assembled sequences and the GFFs of predicted genes along with their annotated functions.
 
 
[[File:Downloads_Page.png |800px| center ]]

Latest revision as of 13:40, 25 April 2018

Introduction

Background And Goal

We got our raw data from Dr. David S. Weiss' lab (ARC, Division of Infectious Diseases). They provided us with a sample ID numbers and we downloaded 258 pair-end raw reads sequences of Klebsiella spp,sequenced by illumina MiSeq. from NCBI SRA database. The goal of the class was to provide an insight on the major issue of antibiotic resistance. Klebsiella is a genus of nonmotile, Gram-negative, oxidase-negative, rod-shaped bacteria with a prominent polysaccharide-based capsule. Klebsiella species are found everywhere in nature. The members of the genus Klebsiella are a part of the human and animal's normal flora in the nose, mouth and intestines. The species of Klebsiella are all gram-negative and non-motile. Some of Klebsiella types are: K.granulomatis, K. oxytoca, K. michiganensis and K. pneumoniae (type-species: K. p. subsp. ozaenae, K. p. subsp. pneumoniae, K. p. subsp. rhinoscleromatis).

Our job as the Web Server Team was to provide the following:-

1. An easy-to-use tool to help distinguish between Klebsiella phenotypes, by implementing the work of the comparative genomics group.

2. A robust and easy-to-use web-based de-novo assembly tool.

3. A feature to visualize and download the results of the 258 genomes.

Here, we build a web server that ideally predicts a phenotype based on the genetic information it is given.

Design Principles

1. Minimalistic

2. Mobile Friendly

3. Short Load Time

4. Contrasting Colors

Functionalities Offered

The Predictive Web Server allows the user to predict a number of phylogenetic features, based on the input sequence. The user can use the use the fasta file from the assembly above or can input his own sequences. Currently, there are two options for prediction -

1. Predict just the species and strain of the input sequence - uses StrainSeeker to determine the species and the strain.

2. Predict other phylogenetic features -

a) Virulence Factors

b) AMR Genes

c) Resistance Mechanisms

d) Drug Classes The second option requires the users to enter their email ID since the prediction usually takes a long time. The user receives a link in an email with the predictions.These predictions are made using the CARD Resistance Gene Identifier. Comprehensive Antibiotic Resistance Database (CARD) provides data, models and algorithms relating to the microbial basis of antimicrobial resistance.


Genome Assembly

Users will have the option of uploading either an assembled genome or short-reads from NGS methods. If short-reads are provided, a de novo assembly will be performed using the assembler Skesa. The resulting assembly will be available for download and will be used for downstream processes.

Strain and Species Identification

Strain identification is performed using k-mer based approaches in StrainSeeker. Each user-provided input sample is reduced to a pool of unique k-mers that is compared to k-mer pools from samples in the NCBI sequence database. The observed and expected k-mer pool overlaps between user-provided samples and NCBI-samples are used to taxonomically place samples with unknown strain identity.

Antibiotic-resistance Profiling

Each user-provided sample will be compared to the Comprehensive Antibiotic Resistance Database (CARD) using the toolkit RGI. RGI discovers high-confidence homologues of known antibiotic-resistance genes using Diamond homology searches. RGI also incorporates SNP models to predict genetic variants that are likely to confer new antibiotic resistances. The entire antibiotic profile of each strain is visualized in a wheel-chart (provided by RGI) that allows the user to explore result by drug class, mechanism of resistance, and antibiotic target.

Virulence Factor Profiling

Each user-provided sample will be blasted against the Virulence Factor Database (VFDB) to identify homologues of known virulence factors. A non-redundant blast output (outfmt 6) will be provided for download by the user.

Tools

Skesa

Skesa is the most recent tool in our list and is currently used by NCBI. Information about the algorithm is currently unavailable. . De novo sequencing refers to sequencing a novel genome where there is no reference sequence available for alignment. Sequence reads are assembled as contigs, and the coverage quality of de novo sequence data depends on the size and continuity of the contigs (ie, the number of gaps in the data).

To compare the performance of the above tools, we use Quast.

StrainSeeker

StrainSeeker is a program for detecting bacterial strains from raw sequencing reads. Compared to other similar programs, it offers the following advantages:

1. Fully customizable database - use your own strains of interest or download our database

2. Detect novel strains that are related to strains in the database

3. Quickly handle large amounts of data

4. Results given all the way down to the strain level!

Input

Fastq or Fasta sample

Output tree

CARD AND RGI

The Comprehensive Antibiotic Resistance Database ("CARD") provides data, models, and algorithms relating to the molecular basis of antimicrobial resistance.

  • The CARD provides curated reference sequences and SNPs organized via the Antibiotic Resistance Ontology ("ARO"). These data can be browsed on the website or downloaded in a number of formats.
  • These data are additionally associated with detection models, in the form of curated homology cut-offs and SNP maps, for prediction of resistome from molecular sequences.
  • These models can be downloaded or can be used for analysis of genome sequences using the Resistance Gene Identifier ("RGI"), either online or as a stand-alone tool.

The algorithm used is as follows:-

  • Open Reading Frame (ORF) prediction using Prodigal
  • Homolog detection using Diamond
  • Strict significance based on CARD curated bitscore cut-offs.Hits of 95% identity or better are automatically listed as Strict.
  • All results organized by revised ARO classification: AMR Gene Family, Drug Class, and Resistance Mechanism. Support added for low quality/coverage assemblies, metagenomic merged reads, small plasmids or assembly contigs.
                                                 Visualization of RGI results on Web Server

Virulence Factor Database(VFDB)

What Are Virulence Factors?

Virulence factors refer to the properties (i.e., gene products) that enable a microorganism to establish itself on or within a host of a particular species and enhance its potential to cause disease. Virulence factors include bacterial toxins, cell surface proteins that mediate bacterial attachment, cell surface carbohydrates and proteins that protect a bacterium, and hydrolytic enzymes that may contribute to the pathogenicity of the bacterium.

The virulence factor database (VFDB) is dedicated to providing up-to-date knowledge of virulence factors (VFs) of various bacterial pathogens.


How to get virulence factors

1. Blastn against VFDB

2. Remove redundant matches, output list of unique VF homologues

3. Accession numbers, blast scores, positions in query

Our Web Page

Home Page

Basic Layout

Assembly Page

Steps to Run Assembly

1. Click on the Assemble button on Home Page

2. Upload the sequences on the server

3. Enter a single SRR ID

4. Enter a comma-separated list of SRR sequences

5. Enter sequence reads as a FASTQ file

6. Enter the email ID

7. The user would then receive an email with a link to the download the finished assemblies.


Downloads Page

This section contains the results of analyses performed by the various groups as mentioned above. The user can download these results and use them according to his/her needs. Currently, there are two types of files available. For 258 samples of Klebsiella sp., the server hosts the assembled sequences and the GFFs of predicted genes along with their annotated functions.