Team II Functional Annotation Group: Difference between revisions

Revision as of 12:44, 11 April 2018

Introduction

Functional Annotation

Functional Annotations are processes that identify the locations of genes and all coding regions in the genome and determine the function of these genes.This biological information can be biological functions, biochemical functions or gene expression regulation.

Data

We were given 258 assembled genomes and predicted genes of Klebsiella spp.

Approach

Functional Annotation utilizes computational methods to functionally annotate 258 Klebsiella genomes. The approach should be scalable, reduce query database size. The two types of tools used are: ab-initio and homology-based. The two main deliverables were a) a wrapper script to run all the tools on given input files b) one .gff file per genome

In order to speed up the process without much compromise on the accuracy of the results, we decided to use a centroid-based clustering algorithm, UCLUST.

UCLUST outputs two files. The first is a long FASTA file with all the unique genes. The second is a clustering file with details of all the "seeds" and "hits". ("Hits" are the genes having a level of similarity to the "seeds").

When we plotted the graph of no. of clusters obtained at different thresholds, we got the following plot:

We thus decided to go ahead with 100% clustering for the ab-initio tools and 97% with the homology based tools.

General Tool

Prokka

Prokka[1] is a command line software tool to annotate bacterial genome rapidly. It is easy to implement and providing standardized input/output file formats. It is well known for rapidly and accurately annotating bacterial genomes in about 10 min on a typical desktop computer. Prokka features include coding sequences (Prodigal), rRNA (RNAmmer), tRNA (Aragorn), signal peptides (SignalP), and non-coding RNA (Infernal) and including these options are flexible. Prokka performs annotation in five steps:

1. An optional search against a user provided set of annotated proteins using BLAST+.
2. Determination of core bacterial proteins by searching UniProt using BLAST+ (contains ~16,000 proteins).
3. An optional BLAST+ search to all proteins from finished bacterial genomes in RefSeq.
4. A search of Pfam and TIGFRAMS databases using hmm scan from HMMER. Both are collections of protein families based on multiple sequence alignments and hidden markov models.
5. Non-matches are labeled hypothetical proteins.

Command: prokka --force –-species <species name> centre X --outdir <output directory> --prefix <file prefixes> --locustag --norrna --notrna <contig.fa>

Prokka output contains 12 output files:- .err -,.faa-, .fna-, .ffn-, .fsa-, .gbk-, .gff-, .log-, .sqn-, .tbl-, .tsv-, .txt-

We did not use Prokka in our final pipeline but we included its output for the Comparative Genomics group.

Protein-Coding Regions

Signaling Peptides

SignalP is a neural-network based method to predict signal peptides cleavage sites in amino acid sequences submitted in FASTA format.

Command: ./signalp -t gram- [-f short/long/all/summary] input_file.faa > output_file.out

Transmembrane Regions and Lipoproteins

LipoP is used for both Transmembrane regions and lipoproteins. It is based on Hidden Markov Model(HMM).

These 4 clases are predicted

SpI: signal peptide (signal peptidase I)

SpII: lipoprotein signal peptide (signal peptidase II)

TMH: n-terminal transmembrane helix. This is generally not a very reliable prediction and should be tested. This part of the model is mainly there to avoid tranmembrane helices being falsely predicted as signal peptides.

CYT: cytoplasmic. It really just means all the rest.

command: perl LipoP -short <input FASTA> > <output GFF>

Lipoproteins

Operons

Operons are clusters of co-regulated genes, that's physically close in the genome. They're all turned on/off together. And they're co-transcribed as a single mRNA. Accurate prediction of operons can improve the functional annotation of genes within operons.

DOOR2 (Database of prOkaryotic Ope Rons )[2] is a comprehensive, and the largest (as of Apr. 2018) public operon database available.

The standard workflow we follow is:

Download operon tables for Klebsiella pneumoniae
Extract fasta files based on GI numbers
Makeblastdb and blastp queries
Filter and match hits back with the operon tables

For blastp result, we use 80% for qcovs and pident threshold

The hits obtained from DOOR-BLAST for each genome were as follows:

Pathways and Ontology

eggNog-mapper

Public database consisting of Orthologous Groups (OGs) of proteins at taxonomic levels. The recent version (4.5) expands functional annotation of OGs to KEGG pathways (KO), SMART/Pfam domains,
and Gene Ontology (GO) terms from Gene Ontology Consortium.

The OG database contains annotations called eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) to produce orthology.
It consists of 2031 and 1655 prokaryotic mappings, respectively. In addition to the command-line tool, eggNOG v4.5 is available through both a web interface and RESTful API.

Computation/query time necessary with the entire optimized bacteria (-d bact) will take over 25 hours to complete. For the purposes of the 258 genomes we received, we were able to limit the class of proteins to increase query rate.

For general purpose procedures, we suggest using diamond, as our diamond results produced only ~10,000 annotations less than hmmer mode out of ~1.4 million sequences. This is because:
1. No reliance on finding species to limit/produce optimized and reduced class database.
2. Much faster for annotating in hmmer with higher sequence counts or larger databases.

Command line:

python emapper.py -i input_file.faa --output output_file -m [diamond|hmmer] -d [database]

Non-coding RNA

rRNA, tRNA, and sRNA

Please refer to "Team II Gene Prediction Group - RNA Prediction" [3]

CRISPR

CRISPRs (Clustered Regularly Interspaced Short Palindromic Repeats) are found in approximately 40% of sequenced bacterial genomes. They are reported to be related to Bacterial immunity regulation, cell defense mechanism, DNA rearrangement, replication, and regulation. And Klebsiella has an unusually high proportion of self-targeting spacers

Tool

Version: Piler-CR 1.06
PILER-CR is a program which specifically designed for the identification and analysis of CRISPR repeats. The program executes rapidly and has both high sensitivity and high specificity.

Input: FASTA

Output: Piler-CR specific report format

Performance by using reference genome:

Run-time: ~5 seconds for a 5Mb genome
CRISPR found:2

Result

Total 84 hits among 50 genomes

Others

Antibiotic Resistance

Anitmicrobial Resistance (AMR) genes were annotated from the Comprehensive Antibiotic Resistance Database (CARD) using the Resistance Gene Identifier (RGI) toolkit.

RGI uses a blast homology search to identify known genes in the CARD, and also analyzes individual mutations likely to confer antibiotic resistance. For this reason, protein sequences clustered at 100% identity were used as input.

command:

rgi -i assembled100_proteins.faa -t protein -n 2 -o OUTPUT

where, -i = input file -t = sequence type -n = number of cores -o = output file prefix

rgi outputs a .txt file with AMR genes and a .json file that can visualize the result on the CARD website.

the python script rgi2gff.py then converts the .txt output to GFF3 format

An average of 41 AMR genes/ genome were detected, with a minimum of 19 and a maximum of 51

Virulence Factors

Virulence factors were obtained from:
1) VFDB
2) Victor
The Virulence Factor Database(VFDB) is a comprehensive database curating information on virulence factors and is updated regularly. The database contains information such as structure features of the virulence factors, functions and mechanisms used by the pathogens for circumventing host defense mechanisms and causing pathogenicity. The database downloaded was the full dataset which covers all genes related to known and predicted virulence factors as opposed to the core database which only contains the genes associated with experimentally verified virulence factors. To annotate, BLAST was used to first create a useable database for the actual BLAST run. Both VFDB and Victor nucleotide sequences were used to make the comprehensive database

Command Used: blastn -db [database prepared using vfdb and victor] -query [97% clusteres fna file] -outfmt "6 qseqid qstart qend sseqid evalue sstart send sframe stitle pident qcovs" -evalue 1e-10 -max_hsps 1, max_target_seqs 1 -out <output.txt>

The above hits were filtered and only the hits with >= 90% identity and >= 90% coverage were retained.

The results obtained were as follows:

Prophage Genes

Prophage genes are a bacteriophage genetic materials integrated into bacterial DNA genome or existing plasmid. This requires phages in latent phase that the viral genes are present in the bacterium without causing disruption of the bacterial cell. In fact, prophage genes are one of the major source of new genes and functions in bacterial genomes[4], such as antibiotic resistance[5], virulence factor[6], and biofilm formation[7].

PHASTER

PHASTER (PHAge Search Tool Enhanced Release)[8] is a server for the rapid identification and annotation of prophage sequences in bacterial genomes and plasmids using GLIMMER, BLAST, and DBSCAN[9]. It takes both assembled genome (with or without contigs) or GenBank file and compares to their viral and bacterial databases[10].

Input: FASTA or GenBank
Output: PHASTER specific report format
Upload Query to PHASTER Server: wget --post-file="Input_File" "http://phaster.ca/phaster_api" -O "Output_Status.txt"
Check Status: wget "http://phaster.ca/phaster_api?acc="query_ID"" -O "Output_Status.txt"
Download Results: wget phaster.ca/submissions/query_ID.zip -O "Output_Results.zip"

Performance by using reference genome:

Run-time: ~5min for a 5Mb assembled genome
Phage Region Found: 8 using NC_016845.1 Klebsiella pneumoniae reference genome

Final Pipeline

References

- Seemann, Torsten. “PROKKA: Rapid Prokaryotic Genome Annotation” Bioinformatics 30.14 (2014):2068-2069

- Petersen, Thomas Nordahl, et al. "SignalP 4.0: discriminating signal peptides from transmembrane regions." Nature methods 8.10 (2011): 785.

- Käll, Lukas, Anders Krogh, and Erik LL Sonnhammer. "A combined transmembrane topology and signal peptide prediction method." Journal of molecular biology 338.5 (2004): 1027-1036.

- Juncker, Agnieszka S. et al. “Prediction of Lipoprotein Signal Peptides in Gram-Negative Bacteria.” Protein Science : A Publication of the Protein Society 12.8 (2003): 1652–1662.

- Mao, Xizeng et al. “DOOR 2.0: Presenting Operons and Their Functions through Dynamic and Integrated Views.” Nucleic Acids Research 42.Database issue (2014): D654–D659.

- Jones, Philip et al. “InterProScan 5: Genome-Scale Protein Function Classification.” Bioinformatics 30.9 (2014): 1236–1240.

- Jensen, Lars Juhl et al. “eggNOG: Automated Construction and Annotation of Orthologous Groups of Genes.” Nucleic Acids Research 36.Database issue (2008): D250–D254.

- Nawrocki, Eric P., Diana L. Kolbe, and Sean R. Eddy. “Infernal 1.0: Inference of RNA Alignments.” Bioinformatics 25.10 (2009): 1335–1337.

- Edgar, Robert C. “PILER-CR: Fast and Accurate Identification of CRISPR Repeats.” BMC Bioinformatics 8 (2007): 18.

- Jia, Baofeng et al. “CARD 2017: Expansion and Model-Centric Curation of the Comprehensive Antibiotic Resistance Database.” Nucleic Acids Research 45.Database issue (2017): D566–D573.

- Chen, Lihong et al. “VFDB: A Reference Database for Bacterial Virulence Factors.” Nucleic Acids Research 33.Database Issue (2005): D325–D328.

- Arndt, David et al. “PHASTER: A Better, Faster Version of the PHAST Phage Search Tool.” Nucleic Acids Research 44.Web Server issue (2016): W16–W21.

@@ Line 46: / Line 46: @@
 ==Protein-Coding Regions==
 ===Signaling Peptides===
+SignalP is a neural-network based method to predict signal peptides cleavage sites in amino acid sequences submitted in FASTA format.
+Command:
+./signalp -t gram- [-f  short/long/all/summary] input_file.faa > output_file.out
+[[File:SignalP.jpg]]
 ===Transmembrane Regions and Lipoproteins===
 LipoP is used for both Transmembrane regions and lipoproteins. It is based on Hidden Markov Model(HMM).