Team I Functional Annotation Group: Difference between revisions

From Compgenomics 2018
Jump to navigation Jump to search
Rjplace (talk | contribs)
Rjplace (talk | contribs)
Line 14: Line 14:
===Homology===
===Homology===


'''Prokka'''
===='''Prokka'''====


Command:  
Command:  
Line 21: Line 21:




'''eggNOG'''
===='''eggNOG'''====


Eggnog performs functional annotation of genes and proteins using orthology assignments from pre-computed clusters and phylogenies from eggnog database. The database contains Orthologous Groups of proteins at taxonomic levels.
Eggnog performs functional annotation of genes and proteins using orthology assignments from pre-computed clusters and phylogenies from eggnog database. The database contains Orthologous Groups of proteins at taxonomic levels.
Line 29: Line 29:
* --usemem: for loading the database in RAM memory   
* --usemem: for loading the database in RAM memory   


'''PilerCR'''
===='''PilerCR'''====


PilerCR identifies and analyzes CRISPR repeats
PilerCR identifies and analyzes CRISPR repeats
Line 37: Line 37:
* Runtime: <5 sec/genome
* Runtime: <5 sec/genome


'''DeepARG'''
===='''DeepARG'''====


DeepARG is a machine learning solution that uses deep learning to characterize and annotate antibiotic resistance genes in metagenomes. It contains two models for different inputs, short sequence reads from Next Generation Sequencing and gene-like sequences  
DeepARG is a machine learning solution that uses deep learning to characterize and annotate antibiotic resistance genes in metagenomes. It contains two models for different inputs, short sequence reads from Next Generation Sequencing and gene-like sequences  
Line 45: Line 45:
* Runtime: 3min27s /genome
* Runtime: 3min27s /genome


'''Interproscan'''
===='''Interproscan'''====


InterProScan runs the scanning algorithms from the InterPro database, which uses predictive models, known as signatures, provided by member databases, in an integrated way.
InterProScan runs the scanning algorithms from the InterPro database, which uses predictive models, known as signatures, provided by member databases, in an integrated way.
Line 55: Line 55:
* Runtime: 1min/genome, depends on applications you choose
* Runtime: 1min/genome, depends on applications you choose


'''DOOR2'''
===='''DOOR2'''====


DOOR2 (Database of PrOkaryotic OpeRons) is the largest operon database as of April 2018. Operons are genetic units that are transcribed together and play a role in the regulation of protein synthesis based on the needs of the organism.  
DOOR2 (Database of PrOkaryotic OpeRons) is the largest operon database as of April 2018. Operons are genetic units that are transcribed together and play a role in the regulation of protein synthesis based on the needs of the organism.  
Line 75: Line 75:
===Ab Initio===
===Ab Initio===


'''Phobius'''
===='''Phobius'''====


Phobius predicts transmembrane topology and signal peptides from amino acid sequences, it was a challenging problem because of high similarity between the hydrophobic regions of a transmembrane helix and that of a signal peptide, leading to cross-reaction between the two types of predictions.
Phobius predicts transmembrane topology and signal peptides from amino acid sequences, it was a challenging problem because of high similarity between the hydrophobic regions of a transmembrane helix and that of a signal peptide, leading to cross-reaction between the two types of predictions.
Line 85: Line 85:
* Runtime: 12-16mins /genome
* Runtime: 12-16mins /genome


'''LipoP'''
===='''LipoP'''====


LipoP predicts lipoprotein signal peptides in Gram-negative bacteria, its hidden Markov model (HMM) was able to distinguish between lipoproteins (SPaseII-cleaved proteins), SPaseI-cleaved proteins, cytoplasmic proteins, and transmembrane proteins.
LipoP predicts lipoprotein signal peptides in Gram-negative bacteria, its hidden Markov model (HMM) was able to distinguish between lipoproteins (SPaseII-cleaved proteins), SPaseI-cleaved proteins, cytoplasmic proteins, and transmembrane proteins.
Line 93: Line 93:
* Runtime: ~2mins /genome
* Runtime: ~2mins /genome


'''TMHMM'''
===='''TMHMM'''====


TMHMM uses a hidden Markov model to predict transmembrane helices in proteins.
TMHMM uses a hidden Markov model to predict transmembrane helices in proteins.
Line 100: Line 100:
* Runtime: ~6mins /genome
* Runtime: ~6mins /genome


'''SignalP'''
===='''SignalP'''====


SignalP uses a combination of several trained neural networks to predict the presence and location of signal peptide cleavage sites. SignalP allows  
SignalP uses a combination of several trained neural networks to predict the presence and location of signal peptide cleavage sites. SignalP allows  

Revision as of 10:07, 10 April 2018

Introduction

Background

Functional annotation is the process of locating genes and identifying their functions (biochemical functions, regulatory functions, etc.) in the genome.

Objective

  • Fully annotate 258 genomes from Gene Prediction group, focusing on antibiotic resistance
  • Provide Comparative Genomics group with data required to perform Genome Wide Association Study(GWAS)

Pipeline

Tools

Homology

Prokka

Command:

prokka --outdir <output_directory> --kingdom <species' kingdom> --genus <species' genus> --gram <> --prefix <output_file> --rfam --rnammer <input_file> 
  • Runtime: ~ 16mins /genome


eggNOG

Eggnog performs functional annotation of genes and proteins using orthology assignments from pre-computed clusters and phylogenies from eggnog database. The database contains Orthologous Groups of proteins at taxonomic levels.

Command:

python emapper.py -i <input_file> --output <output_file> -m [diamond,hmm] --usemem -d <database_name>
  • --usemem: for loading the database in RAM memory

PilerCR

PilerCR identifies and analyzes CRISPR repeats

Command:

pilercr -in <input_file> -out <output_file>
  • Runtime: <5 sec/genome

DeepARG

DeepARG is a machine learning solution that uses deep learning to characterize and annotate antibiotic resistance genes in metagenomes. It contains two models for different inputs, short sequence reads from Next Generation Sequencing and gene-like sequences

Command:

python ../deeparg-ss/deepARG.py --align --type nucl --genes --input <nucleotide fasta> --out <output file>
  • Runtime: 3min27s /genome

Interproscan

InterProScan runs the scanning algorithms from the InterPro database, which uses predictive models, known as signatures, provided by member databases, in an integrated way.

Command:

interproscan.sh -appl <application_you_want> -iprlookup -pa -i <input_file> -f <output_format> 
  • -iprlookup: include lookup of corresponding InterPro annotation in the TSV and GFF3 format
  • -pa: lookup of corresponding pathway annotation
  • Runtime: 1min/genome, depends on applications you choose

DOOR2

DOOR2 (Database of PrOkaryotic OpeRons) is the largest operon database as of April 2018. Operons are genetic units that are transcribed together and play a role in the regulation of protein synthesis based on the needs of the organism.

Workflow:

-Download operon tables from DOOR2.
-Download Fasta files based on GID
-Make a blast database and perform blastp
-Filter results and requery the operon table

(Percent identity and query coverage were filtered to be 80% or greater)

-makeblastdb -in <fasta file> -dbtype <prot or nucl>
-blastp -db <database> -query <input fasta> -out <output file> -max_target_seqs <int> -max_hsps <int> -num_threads <int> -outfmt "<int followed by information you would like presented from search>"
  • Runtime: <5sec/genome, with multicore


Ab Initio

Phobius

Phobius predicts transmembrane topology and signal peptides from amino acid sequences, it was a challenging problem because of high similarity between the hydrophobic regions of a transmembrane helix and that of a signal peptide, leading to cross-reaction between the two types of predictions.

Phobius is based on a hidden Markov model (HMM) that models the different sequence regions of a signal peptide and the different regions of a transmembrane protein in a series of interconnected states, which allows it to have a higher accuracy rate.

Command:

phobius.pl -<output_format> <input_file> > <output_file>
  • Runtime: 12-16mins /genome

LipoP

LipoP predicts lipoprotein signal peptides in Gram-negative bacteria, its hidden Markov model (HMM) was able to distinguish between lipoproteins (SPaseII-cleaved proteins), SPaseI-cleaved proteins, cytoplasmic proteins, and transmembrane proteins.

Command:

LipoP -<output_format> -<input_file> <output_file>
  • Runtime: ~2mins /genome

TMHMM

TMHMM uses a hidden Markov model to predict transmembrane helices in proteins. Command:

tmhmm -<output_format> -<input_file> <output_file>
  • Runtime: ~6mins /genome

SignalP

SignalP uses a combination of several trained neural networks to predict the presence and location of signal peptide cleavage sites. SignalP allows the user to specify different types of organisms: eukaryotes or gram-positive or gram-negative prokaryotes. Command:

signalp -t <organism_type> -f <output_format> <input_file>
  • Runtime: ~ 4mins /genome

Result

Reference

  • LukasKäll, AndersKrogh, Erik L.LSonnhammer. "A Combined Transmembrane Topology and Signal Peptide Prediction Method"Journal of Molecular Biology 14 May 2004, Pages 1027-1036.