Team II Functional Annotation Group: Difference between revisions
Siulung2005 (talk | contribs) |
Siulung2005 (talk | contribs) |
||
Line 145: | Line 145: | ||
</div> | </div> | ||
[[File:Screen Shot 2018-04-09 at 10.28.43 PM.png|thumb|500px|right]] | [[File:Screen Shot 2018-04-09 at 10.28.43 PM.png|thumb|500px|right|top]] | ||
==Final Pipeline== | ==Final Pipeline== |
Revision as of 21:34, 9 April 2018
Introduction
Functional Annotation
Functional Annotations are processes that identify the locations of genes and all coding regions in the genome and determine the function of these genes.This biological information can be biological functions, biochemical functions or gene expression regulation.
Data
We are given 258 assembled genomes and predicted genes of Klebsiella spp.
Approach
Functional Annotation utilizes computational methods to functionally annotate 258 Klebsiella genomes. The approach should be scalable, reduce query size and reduce database size. Two classes of methods are generally adopted: ab-initio and homology-based.
General Tool
Prokka
Prokka[1] is a command line software tool to annotate bacterial genome rapidly. It is easy to implement and providing standardized input/output file formats. It is well known for rapidly and accurately annotating bacterial genomes in about 10 min on a typical desktop computer. Prokka features include coding sequences (Prodigal), rRNA (RNAmmer), tRNA (Aragorn), signal peptides (SignalP), and non-coding RNA (Infernal) and including these options are flexible. Prokka performs annotation in five steps:
1. An optional search against a user provided set of annotated proteins using BLAST+.
2. Determination of core bacterial proteins by searching UniProt using BLAST+ (contains ~16,000 proteins).
3. An optional BLAST+ search to all proteins from finished bacterial genomes in RefSeq.
4. A search of Pfam and TIGFRAMS databases using hmm scan from HMMER. Both are collections of protein families based on multiple sequence alignments and hidden markov models.
5. Non-matches are labeled hypothetical proteins.
Command: prokka --force –-species <species name> centre X --outdir <output directory> --prefix <file prefixes> --locustag --norrna --notrna <contig.fa>
Prokka output contains 12 output files:- .err -,.faa-, .fna-, .ffn-, .fsa-, .gbk-, .gff-, .log-, .sqn-, .tbl-, .tsv-, .txt-
Protein-Coding Regions
Signaling Peptides
Transmembrane Regions and Lipoproteins
LipoP is used for both Transmembrane regions and lipoproteins. It is based on Hidden Markov Model(HMM).
These 4 clases are predicted
SpI: signal peptide (signal peptidase I)
SpII: lipoprotein signal peptide (signal peptidase II)
TMH: n-terminal transmembrane helix. This is generally not a very reliable prediction and should be tested. This part of the model is mainly there to avoid tranmembrane helices being falsely predicted as signal peptides.
CYT: cytoplasmic. It really just means all the rest.
command: perl LipoP -short <input FASTA> > <output GFF>
Lipoproteins
Operons
Operons are clusters of co-regulated genes, that's physically close in the genome. They're all turned on/off together. And they're co-transcribed as a single mRNA. Accurate prediction of operons can improve the functional annotation of genes within operons.
DOOR2 (Database of prOkaryotic Ope Rons )[2] is a comprehensive, and the largest (as of Apr. 2018) public operon database available.
The standard workflow we follow is:
- Download operon tables for Klebsiella pneumoniae
- Extract fasta files based on GI numbers
- Makeblastdb and blastp queries
- Filter and match hits back with the operon tables
For blastp result, we use 80% for qcovs and pident threshold
Pathways
eggNog-mapper
Public database consisting of Orthologous Groups (OGs) of proteins at taxonomic levels. The recent version (4.5) expands functional annotation of OGs to KEGG pathways (KO), SMART/Pfam domains,
and Gene Ontology (GO) terms from Gene Ontology Consortium.
The OG database contains annotations called eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) to produce orthology.
It consists of 2031 and 1655 prokaryotic mappings, respectively. In addition to the command-line tool, eggNOG v4.5 is available through both a web interface and RESTful API.
Command line:
python emapper.py -i input_file.faa --output output_file -m [diamond|hmmer] -d bact
Non-coding RNA
rRNA, tRNA, and sRNA
Please refer to "Team II Gene Prediction Group - RNA Prediction" [3]
CRISPR
CRISPRs (Clustered Regularly Interspaced Short Palindromic Repeats) are found in approximately 40% of sequenced bacterial genomes. They are reported to be related to Bacterial immunity regulation, cell defense mechanism, DNA rearrangement, replication, and regulation. And Klebsiella has an unusually high proportion of self-targeting spacers
Tool
Version: Piler-CR 1.06
PILER-CR is a program which specifically designed for the identification and analysis of CRISPR repeats. The program executes rapidly and has both high sensitivity and high specificity.
Input: FASTA
Performance by using reference genome:
Run-time: ~5 seconds for a 5Mb genome
CRISPR found:2
Others
Antibiotic Resistance
Virulence Factors
Prophage Genes
Prophage genes are a bacteriophage genetic materials integrated into bacterial DNA genome or existing plasmid. This requires phages in latent phase that the viral genes are present in the bacterium without causing disruption of the bacterial cell. In fact, prophage genes are one of the major source of new genes and functions in bacterial genomes[4], such as antibiotic resistance[5], virulence factor[6], and biofilm formation[7].
PHASTER
PHASTER (PHAge Search Tool Enhanced Release)[8] is a server for the rapid identification and annotation of prophage sequences in bacterial genomes and plasmids using GLIMMER, BLAST, and DBSCAN[9]. It takes both assembled genome (with or without contigs) or GenBank file and compares to their viral and bacterial databases[10].
Input: FASTA or GenBank
Output: PHASTER specific report format
Upload Query to PHASTER Server: wget --post-file="Input_File" "http://phaster.ca/phaster_api" -O "Output_Status.txt"
Check Status: wget "http://phaster.ca/phaster_api?acc="query_ID"" -O "Output_Status.txt"
Download Results: wget phaster.ca/submissions/query_ID.zip -O "Output_Results.zip"
Performance by using reference genome:
Run-time: ~5min for a 5Mb assembled genome
Phage Region Found: 8 using NC_016845.1 Klebsiella pneumoniae reference genome