Team I Functional Annotation Group: Difference between revisions
(57 intermediate revisions by 5 users not shown) | |||
Line 2: | Line 2: | ||
===Background=== | ===Background=== | ||
Functional annotation is the process of locating genes and identifying their functions (biochemical functions, regulatory functions, etc.) in the genome. | Functional annotation is the process of locating genes and identifying their functions (biochemical functions, regulatory functions, etc.) in the genome. The two types of tools we used for this project are homology and ab initio. Homology based tools search databases of genetic features with known function, and accuracy is dependent on the database quality. Ab initio tools look for intrinsic characteristics of particular gene feature types and allow the tools to predict the presence of particular proteins. | ||
===Objective=== | ===Objective=== | ||
* Fully annotate 258 genomes from Gene Prediction group, focusing on antibiotic resistance | * Fully annotate 258 genomes from Gene Prediction group, focusing on antibiotic resistance | ||
* Provide Comparative Genomics group with data required to perform Genome Wide Association Study(GWAS) | * Provide Comparative Genomics group with data required to perform Genome Wide Association Study(GWAS) | ||
===Approach=== | |||
A clustering tool was implemented to reduce demand on our limited computational resources, reduce processing time, and increase the scalability of the traditional NGS analysis pipeline. The clustering algorithm we chose to use is called UCLUST. This program clusters predicted gene sequences based on a similarity threshold (we chose 0.99). By clustering highly similar predicted gene sequences from all 258 genomes we were able to annotate a representative sequence (the centroid) from each cluster and reliably assign that annotation to the rest of the members of that particular cluster. We were successful in our goal of reducing the computation power needed to annotate 258 genomes in a reasonable amount of time by reducing the number of predicted gene sequences that were actually analyzed by our annotation tools. In total, there were about 1.5 million predicted gene sequences for each of which we were responsible for assigning an accurate functional annotation. Ultimately our tools annotated roughly 50 thousand representative sequences reducing the number of predicted gene sequences which required an annotation by a factor of 30. | |||
[[File:histogramofthelogtransformofthedistributionoftheclustersizes.png|750px|centre]] | |||
===Pipeline=== | ===Pipeline=== | ||
=='''Tools'''== | [[File:Pipline.png|1000px|centre]] | ||
<div style="text-align: center;">'''Figure 1. The overall pipeline starting with Gene FASTA Files, Protein FASTA Files, and GFFs.'''</div> | |||
=='''Annotation Tools'''== | |||
The tools which are used for Functional Annotation can be broadly split into 2 sections, i.e. Homology based tools and Ab-initio approcahes based tools. | |||
===Homology=== | ===Homology=== | ||
====Prokka==== | ====Prokka==== | ||
Prokka is a command line tool that uses Prodigal(gene predictions), RNAmmer(rRNA), Aragorn(tRNA), and Infernal(non-coding RNA). Prokka is also able to add more databases to itself such as CARD and VFDB. After the genes are predicted with Prodigal they are searched through the multiple databases using blast in a hierarchical manner starting with the smallest most curated database, moving to larger domain-specific databases, and finally larger protein families databases. We did not use Prokka for our final pipeline because it was built to predict and annotate an assembled genome. We could not accept assemblies as inputs due to the nature of this project, so we could not use this tool. | |||
Command: | Command: | ||
<pre>prokka --outdir <output_directory> --kingdom <species' kingdom> --genus <species' genus> --gram <> --prefix <output_file> --rfam --rnammer <input_file> </pre> | <pre>prokka --outdir <output_directory> --kingdom <species' kingdom> --genus <species' genus> --gram <> --prefix <output_file> --rfam --rnammer <input_file> </pre> | ||
* Runtime: ~ 16mins /genome | * Runtime: ~ 16mins /genome | ||
====eggNOG==== | ====eggNOG==== | ||
Eggnog performs functional annotation of genes and proteins using orthology assignments from pre-computed clusters and phylogenies from eggnog database. The database contains Orthologous Groups of proteins at taxonomic levels. | Eggnog performs functional annotation of genes and proteins using orthology assignments from pre-computed clusters and phylogenies from eggnog database. The database contains Orthologous Groups of proteins at taxonomic levels. Eggnog has two modes hmm and diamond that can be used. The HMM mode is comprised of a collection of precompiled hidden Markov models that have each been associated with an orthologous group. Diamond is the other mode and searches for the best seed ortholog located in the eggNOG protein database. Diamond is faster than HMM thus recommended for larger datasets, but HMM is slightly more sensitive especially if the organism is not well represented in the database. | ||
Command: | Command: | ||
<pre>python emapper.py -i <input_file> --output <output_file> -m [diamond,hmm] --usemem -d <database_name></pre> | <pre>python emapper.py -i <input_file> --output <output_file> -m [diamond,hmm] --usemem -d <database_name></pre> | ||
* --usemem: for loading the database in RAM memory | * --usemem: for loading the database in RAM memory | ||
====PilerCR==== | ====PilerCR==== | ||
Line 39: | Line 50: | ||
====DeepARG==== | ====DeepARG==== | ||
DeepARG is a machine learning solution that uses deep learning to characterize and annotate antibiotic resistance genes in metagenomes. It contains two models for different inputs, short sequence reads from Next Generation Sequencing and gene-like sequences | DeepARG is a machine learning solution that uses deep learning to characterize and annotate antibiotic resistance genes in metagenomes. It contains two models for different inputs, short sequence reads from Next Generation Sequencing and gene-like sequences. | ||
DeepARG outputs two files that will be mainly useful: an ARG file and a potential ARG file. | |||
Command: | Command: | ||
<pre>python ../deeparg-ss/deepARG.py --align --type nucl --genes --input <nucleotide fasta> --out <output file></pre> | <pre>python ../deeparg-ss/deepARG.py --align --type nucl --genes --input <nucleotide.fasta> --out <output file></pre> | ||
* Runtime: 3min27s /genome | <pre>python ../deeparg-ss/deepARG.py --align --type nucl --prot --input <protein.fasta> --out <output file></pre> | ||
* Runtime: 3min27s /genome; 44mins for our clustered nuclueotide sequences file, whose size is equivalent to ~12 genomes. | |||
====Interproscan==== | ====Interproscan==== | ||
InterProScan runs the scanning algorithms from the InterPro database, which uses predictive models, known as signatures, provided by member databases, in an integrated way. | InterProScan runs the scanning algorithms from the InterPro database, which uses predictive models, known as signatures, provided by member databases, in an integrated way. Interproscan contains 14 databases which make up its consortium. | ||
Databases used in final pipeline: | |||
-PfamA: Contains many common protein domains. | |||
-CCD: Conserved and ancient domains. | |||
-HAMAP: Conserved protein families and subfamilies that have been manually curated. | |||
-PROSITE: Consists of biologically significant sites, patterns, and profiles. | |||
-SFLD: Hierarchical classification of enzymes that relate sequence-structure data to chemical capabilities. | |||
-SMART: Identity and annotation of genetically mobile domains. | |||
-SUPERFAMILY: Contains structural and functional annotation data for proteins and genomes. | |||
-TIGRFAM: Protein families based on HMMs and annotation. | |||
Command: | Command: | ||
Line 53: | Line 85: | ||
* -iprlookup: include lookup of corresponding InterPro annotation in the TSV and GFF3 format | * -iprlookup: include lookup of corresponding InterPro annotation in the TSV and GFF3 format | ||
* -pa: lookup of corresponding pathway annotation | * -pa: lookup of corresponding pathway annotation | ||
* Runtime: | * Runtime: depends on applications you choose | ||
====DOOR2==== | ====DOOR2==== | ||
The Database of PrOkaryotic OpeRons 2.0 (DOOR2) is the largest operon database as of April 2018. Operons are genetic units that are transcribed together and play a role in the regulation of protein synthesis based on the needs of the organism. | |||
Workflow: | Workflow: | ||
Line 65: | Line 97: | ||
-Filter results and requery the operon table | -Filter results and requery the operon table | ||
(Percent identity and query coverage were filtered to be | (Percent identity and query coverage were filtered to be 90% or greater) | ||
Command: | |||
<pre>-makeblastdb -in <fasta file> -dbtype <prot or nucl> | <pre>-makeblastdb -in <fasta file> -dbtype <prot or nucl> | ||
-blastp -db <database> -query <input fasta> -out <output file> -max_target_seqs <int> -max_hsps <int> -num_threads <int> -outfmt "<int followed by information you would like presented from search>"</pre> | -blastp -db <database> -query <input fasta> -out <output file> -max_target_seqs <int> -max_hsps <int> -num_threads <int> -outfmt "<int followed by information you would like presented from search>"</pre> | ||
* Runtime: <5sec/genome, with multicore | * Runtime: <5sec/genome, with multicore | ||
====CARD==== | |||
The Comprehensive Antibiotic Resistance Database (CARD) is a database containing information of resistance genes, according resistance genes product and their associated phenotypes. | |||
Workflow: | |||
-Download the whole or part of the database as needed | |||
-Make a local database using the file downloaded | |||
-Blast the query against the database built | |||
Command: | |||
<pre>-makeblastdb -in <fasta file> -dbtype <prot or nucl> | |||
-blastp -query <input fasta> -db <database> -max_hsps <int> -max_target_seqs <int> -outfmt "<int followed by information you would like presented from search>"</pre> | |||
====VFDB==== | |||
The Virulence Factors Database (VFDB) is a reference database for bacterial virulence factors. | |||
Worflow: | |||
-Download the whole or part of the database as needed | |||
-Make a local database using the file downloaded | |||
-Blast the query against the database built | |||
Command: | |||
<pre>-makeblastdb -in <fasta file> -dbtype <prot or nucl> | |||
-blastn -query <input fasta> -db <database> -dust <yes/no> -max_hsps <int> -max_target_seqs <int> -outfmt "<int followed by information you would like presented from search>" </pre> | |||
===Ab Initio=== | ===Ab Initio=== | ||
Line 95: | Line 152: | ||
====TMHMM==== | ====TMHMM==== | ||
TMHMM uses a hidden Markov model to predict transmembrane helices in proteins. | TMHMM uses a hidden Markov model to predict transmembrane helices in proteins. It can discriminate between soluble and membrane proteins but has a drop in accuracy when dealing with signal peptides due to its hydrophobic region. | ||
Command: | Command: | ||
<pre>tmhmm -<output_format> -<input_file> <output_file></pre> | <pre>tmhmm -<output_format> -<input_file> <output_file></pre> | ||
Line 104: | Line 162: | ||
SignalP uses a combination of several trained neural networks to predict the presence and location of signal peptide cleavage sites. SignalP allows | SignalP uses a combination of several trained neural networks to predict the presence and location of signal peptide cleavage sites. SignalP allows | ||
the user to specify different types of organisms: eukaryotes or gram-positive or gram-negative prokaryotes. | the user to specify different types of organisms: eukaryotes or gram-positive or gram-negative prokaryotes. | ||
Command: | Command: | ||
<pre>signalp -t <organism_type> -f <output_format> <input_file></pre> | <pre>signalp -t <organism_type> -f <output_format> <input_file></pre> | ||
* Runtime: ~ 4mins /genome | * Runtime: ~ 4mins /genome | ||
==''' | =='''Results'''== | ||
====Annotations==== | |||
[[File:eggNOG annotations.png|800px|centre]] | |||
<div style="text-align: center;">'''Figure 2. The distribution of number of EggNOG annotations for samples. The average is about 5,300. Two samples has higher number of annotations (about 7000). '''</div> | |||
[[File:Prodigal annotations.png|800px|centre]] | |||
<div style="text-align: center;">'''Figure 3. The distribution of number of Prodigal annotations for samples. The average is about 5,400. Two samples has higher number of annotations (about 6300). '''</div> | |||
[[File:GeneMark annotations.png|800px|centre]] | |||
<div style="text-align: center;">'''Figure 4. The distribution of number of GeneMark annotations for samples. The average is about 430. Five samples has even higher number of annotations (about 750). Two samples has even higher number of annotations (about 1000).'''</div> | |||
[[File:SignalP.png|800px|centre]] | |||
<div style="text-align: center;">'''Figure 5. The distribution of number of SignalP annotations for samples. The average is about 5800. Two samples has a higher number of annotations (about 8000).'''</div> | |||
[[File:Phobius.png|800px|centre]] | |||
<div style="text-align: center;">'''Figure 6. The distribution of number of Phobius annotations for samples. The average is about 5800. Two samples has a higher number of annotations (about 8000).'''</div> | |||
[[File:LipoP.png|800px|centre]] | |||
<div style="text-align: center;">'''Figure 7. The distribution of number of LipoP annotations for samples. The average is about 5800. Two samples has a higher number of annotations (about 8000).'''</div> | |||
[[File:TMHMM.png|800px|centre]] | |||
<div style="text-align: center;">'''Figure 8. The distribution of number of TMHMM annotations for samples. The average is about 5800. Two samples has a higher number of annotations (about 8000).'''</div> | |||
[[File:CARD.png|800px|centre]] | |||
<div style="text-align: center;">'''Figure 9. The distribution of number of CARD annotations for samples. The average is about 30. '''</div> | |||
[[File:DeepArg.png|800px|centre]] | |||
<div style="text-align: center;">'''Figure 10. The distribution of number of DeepArg annotations for samples. The average is about 150. '''</div> | |||
[[File:VFDB.png|800px|centre]] | |||
<div style="text-align: center;">'''Figure 11. The distribution of number of VFDB annotations for samples. The average is about 50. One sample has a higher number of annotations (about 130).'''</div> | |||
[[File:Door2.png|800px|centre]] | |||
<div style="text-align: center;">'''Figure 12. The distribution of number of DOOR2 annotations for samples. The average is about 5200. Two samples has a higher number of annotations (about 6200).'''</div> | |||
====Overall Statistics==== | |||
[[File:Overall stats.PNG|500px|centre]] | |||
=='''Reference'''== | =='''Reference'''== | ||
* LukasKäll, AndersKrogh, Erik L.LSonnhammer. "A Combined Transmembrane Topology and Signal Peptide Prediction Method"Journal of Molecular Biology 14 May 2004, Pages 1027-1036. | * LukasKäll, AndersKrogh, Erik L.LSonnhammer. "A Combined Transmembrane Topology and Signal Peptide Prediction Method"Journal of Molecular Biology 14 May 2004, Pages 1027-1036. | ||
*Bendtsen, J. D., Nielsen, H., von Heijne, G., & Brunak, S. (2004). Improved prediction of signal peptides: SignalP 3.0. Journal of molecular biology, 340(4), 783-795. | |||
*Petersen, T. N., Brunak, S., von Heijne, G., & Nielsen, H. (2011). SignalP 4.0: discriminating signal peptides from transmembrane regions. Nature methods, 8(10), 785. | |||
*Nielsen, H. (2017). Predicting secretory proteins with SignalP. Protein Function Prediction: Methods and Protocols, 59-73. | |||
*Jia, B., Raphenya, A. R., Alcock, B., Waglechner, N., Guo, P., Tsang, K. K., … McArthur, A. G. (2017). CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Research, 45(Database issue), D566–D573. http://doi.org/10.1093/nar/gkw1004 | |||
*Jones, P., Binns, D., Chang, H.-Y., Fraser, M., Li, W., McAnulla, C., … Hunter, S. (2014). InterProScan 5: genome-scale protein function classification. Bioinformatics, 30(9), 1236–1240. http://doi.org/10.1093/bioinformatics/btu031 | |||
*http://csbl.bmb.uga.edu/DOOR/index.php | |||
*Arango-Argoty, G., Garner, E., Pruden, A., Heath, L. S., Vikesland, P., & Zhang, L. (2018). DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data. Microbiome, 6(1), 23. | |||
*Conesa, A., & Götz, S. (2008). Blast2GO: A comprehensive suite for functional analysis in plant genomics. International journal of plant genomics, 2008. | |||
*Juncker, A. S., Willenbrock, H., Von Heijne, G., Brunak, S., Nielsen, H., & Krogh, A. (2003). Prediction of lipoprotein signal peptides in Gram‐negative bacteria. Protein Science, 12(8), 1652-1662. | |||
*The UniProt Consortium. (2015). UniProt: a hub for protein information. Nucleic Acids Research, 43(Database issue), D204–D212. http://doi.org/10.1093/nar/gku989 | |||
*J Mol Biol. 2001 Jan 19;305(3):567-80. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. Krogh A1, Larsson B, von Heijne G, Sonnhammer EL. | |||
*Nguyen, M., Brettin, T., Long, S. W., Musser, J. M., Olsen, R. J., Olson, R., … Davis, J. J. (2018). Developing an in silico minimum inhibitory concentration panel test for Klebsiella pneumoniae. Scientific Reports, 8, 421. http://doi.org/10.1038/s41598-017-18972-w | |||
*Torsten Seemann; Prokka: rapid prokaryotic genome annotation, Bioinformatics, Volume 30, Issue 14, 15 July 2014, Pages 2068–2069, https://doi.org/10.1093/bioinformatics/btu153 |
Latest revision as of 22:38, 11 April 2018
Introduction
Background
Functional annotation is the process of locating genes and identifying their functions (biochemical functions, regulatory functions, etc.) in the genome. The two types of tools we used for this project are homology and ab initio. Homology based tools search databases of genetic features with known function, and accuracy is dependent on the database quality. Ab initio tools look for intrinsic characteristics of particular gene feature types and allow the tools to predict the presence of particular proteins.
Objective
- Fully annotate 258 genomes from Gene Prediction group, focusing on antibiotic resistance
- Provide Comparative Genomics group with data required to perform Genome Wide Association Study(GWAS)
Approach
A clustering tool was implemented to reduce demand on our limited computational resources, reduce processing time, and increase the scalability of the traditional NGS analysis pipeline. The clustering algorithm we chose to use is called UCLUST. This program clusters predicted gene sequences based on a similarity threshold (we chose 0.99). By clustering highly similar predicted gene sequences from all 258 genomes we were able to annotate a representative sequence (the centroid) from each cluster and reliably assign that annotation to the rest of the members of that particular cluster. We were successful in our goal of reducing the computation power needed to annotate 258 genomes in a reasonable amount of time by reducing the number of predicted gene sequences that were actually analyzed by our annotation tools. In total, there were about 1.5 million predicted gene sequences for each of which we were responsible for assigning an accurate functional annotation. Ultimately our tools annotated roughly 50 thousand representative sequences reducing the number of predicted gene sequences which required an annotation by a factor of 30.
Pipeline
Annotation Tools
The tools which are used for Functional Annotation can be broadly split into 2 sections, i.e. Homology based tools and Ab-initio approcahes based tools.
Homology
Prokka
Prokka is a command line tool that uses Prodigal(gene predictions), RNAmmer(rRNA), Aragorn(tRNA), and Infernal(non-coding RNA). Prokka is also able to add more databases to itself such as CARD and VFDB. After the genes are predicted with Prodigal they are searched through the multiple databases using blast in a hierarchical manner starting with the smallest most curated database, moving to larger domain-specific databases, and finally larger protein families databases. We did not use Prokka for our final pipeline because it was built to predict and annotate an assembled genome. We could not accept assemblies as inputs due to the nature of this project, so we could not use this tool.
Command:
prokka --outdir <output_directory> --kingdom <species' kingdom> --genus <species' genus> --gram <> --prefix <output_file> --rfam --rnammer <input_file>
- Runtime: ~ 16mins /genome
eggNOG
Eggnog performs functional annotation of genes and proteins using orthology assignments from pre-computed clusters and phylogenies from eggnog database. The database contains Orthologous Groups of proteins at taxonomic levels. Eggnog has two modes hmm and diamond that can be used. The HMM mode is comprised of a collection of precompiled hidden Markov models that have each been associated with an orthologous group. Diamond is the other mode and searches for the best seed ortholog located in the eggNOG protein database. Diamond is faster than HMM thus recommended for larger datasets, but HMM is slightly more sensitive especially if the organism is not well represented in the database.
Command:
python emapper.py -i <input_file> --output <output_file> -m [diamond,hmm] --usemem -d <database_name>
- --usemem: for loading the database in RAM memory
PilerCR
PilerCR identifies and analyzes CRISPR repeats
Command:
pilercr -in <input_file> -out <output_file>
- Runtime: <5 sec/genome
DeepARG
DeepARG is a machine learning solution that uses deep learning to characterize and annotate antibiotic resistance genes in metagenomes. It contains two models for different inputs, short sequence reads from Next Generation Sequencing and gene-like sequences. DeepARG outputs two files that will be mainly useful: an ARG file and a potential ARG file.
Command:
python ../deeparg-ss/deepARG.py --align --type nucl --genes --input <nucleotide.fasta> --out <output file>
python ../deeparg-ss/deepARG.py --align --type nucl --prot --input <protein.fasta> --out <output file>
- Runtime: 3min27s /genome; 44mins for our clustered nuclueotide sequences file, whose size is equivalent to ~12 genomes.
Interproscan
InterProScan runs the scanning algorithms from the InterPro database, which uses predictive models, known as signatures, provided by member databases, in an integrated way. Interproscan contains 14 databases which make up its consortium.
Databases used in final pipeline:
-PfamA: Contains many common protein domains.
-CCD: Conserved and ancient domains.
-HAMAP: Conserved protein families and subfamilies that have been manually curated.
-PROSITE: Consists of biologically significant sites, patterns, and profiles.
-SFLD: Hierarchical classification of enzymes that relate sequence-structure data to chemical capabilities.
-SMART: Identity and annotation of genetically mobile domains.
-SUPERFAMILY: Contains structural and functional annotation data for proteins and genomes.
-TIGRFAM: Protein families based on HMMs and annotation.
Command:
interproscan.sh -appl <application_you_want> -iprlookup -pa -i <input_file> -f <output_format>
- -iprlookup: include lookup of corresponding InterPro annotation in the TSV and GFF3 format
- -pa: lookup of corresponding pathway annotation
- Runtime: depends on applications you choose
DOOR2
The Database of PrOkaryotic OpeRons 2.0 (DOOR2) is the largest operon database as of April 2018. Operons are genetic units that are transcribed together and play a role in the regulation of protein synthesis based on the needs of the organism.
Workflow:
-Download operon tables from DOOR2. -Download Fasta files based on GID -Make a blast database and perform blastp -Filter results and requery the operon table
(Percent identity and query coverage were filtered to be 90% or greater)
Command:
-makeblastdb -in <fasta file> -dbtype <prot or nucl> -blastp -db <database> -query <input fasta> -out <output file> -max_target_seqs <int> -max_hsps <int> -num_threads <int> -outfmt "<int followed by information you would like presented from search>"
- Runtime: <5sec/genome, with multicore
CARD
The Comprehensive Antibiotic Resistance Database (CARD) is a database containing information of resistance genes, according resistance genes product and their associated phenotypes.
Workflow:
-Download the whole or part of the database as needed -Make a local database using the file downloaded -Blast the query against the database built
Command:
-makeblastdb -in <fasta file> -dbtype <prot or nucl> -blastp -query <input fasta> -db <database> -max_hsps <int> -max_target_seqs <int> -outfmt "<int followed by information you would like presented from search>"
VFDB
The Virulence Factors Database (VFDB) is a reference database for bacterial virulence factors.
Worflow:
-Download the whole or part of the database as needed -Make a local database using the file downloaded -Blast the query against the database built
Command:
-makeblastdb -in <fasta file> -dbtype <prot or nucl> -blastn -query <input fasta> -db <database> -dust <yes/no> -max_hsps <int> -max_target_seqs <int> -outfmt "<int followed by information you would like presented from search>"
Ab Initio
Phobius
Phobius predicts transmembrane topology and signal peptides from amino acid sequences, it was a challenging problem because of high similarity between the hydrophobic regions of a transmembrane helix and that of a signal peptide, leading to cross-reaction between the two types of predictions.
Phobius is based on a hidden Markov model (HMM) that models the different sequence regions of a signal peptide and the different regions of a transmembrane protein in a series of interconnected states, which allows it to have a higher accuracy rate.
Command:
phobius.pl -<output_format> <input_file> > <output_file>
- Runtime: 12-16mins /genome
LipoP
LipoP predicts lipoprotein signal peptides in Gram-negative bacteria, its hidden Markov model (HMM) was able to distinguish between lipoproteins (SPaseII-cleaved proteins), SPaseI-cleaved proteins, cytoplasmic proteins, and transmembrane proteins.
Command:
LipoP -<output_format> -<input_file> <output_file>
- Runtime: ~2mins /genome
TMHMM
TMHMM uses a hidden Markov model to predict transmembrane helices in proteins. It can discriminate between soluble and membrane proteins but has a drop in accuracy when dealing with signal peptides due to its hydrophobic region.
Command:
tmhmm -<output_format> -<input_file> <output_file>
- Runtime: ~6mins /genome
SignalP
SignalP uses a combination of several trained neural networks to predict the presence and location of signal peptide cleavage sites. SignalP allows the user to specify different types of organisms: eukaryotes or gram-positive or gram-negative prokaryotes.
Command:
signalp -t <organism_type> -f <output_format> <input_file>
- Runtime: ~ 4mins /genome
Results
Annotations
Overall Statistics
Reference
- LukasKäll, AndersKrogh, Erik L.LSonnhammer. "A Combined Transmembrane Topology and Signal Peptide Prediction Method"Journal of Molecular Biology 14 May 2004, Pages 1027-1036.
- Bendtsen, J. D., Nielsen, H., von Heijne, G., & Brunak, S. (2004). Improved prediction of signal peptides: SignalP 3.0. Journal of molecular biology, 340(4), 783-795.
- Petersen, T. N., Brunak, S., von Heijne, G., & Nielsen, H. (2011). SignalP 4.0: discriminating signal peptides from transmembrane regions. Nature methods, 8(10), 785.
- Nielsen, H. (2017). Predicting secretory proteins with SignalP. Protein Function Prediction: Methods and Protocols, 59-73.
- Jia, B., Raphenya, A. R., Alcock, B., Waglechner, N., Guo, P., Tsang, K. K., … McArthur, A. G. (2017). CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Research, 45(Database issue), D566–D573. http://doi.org/10.1093/nar/gkw1004
- Jones, P., Binns, D., Chang, H.-Y., Fraser, M., Li, W., McAnulla, C., … Hunter, S. (2014). InterProScan 5: genome-scale protein function classification. Bioinformatics, 30(9), 1236–1240. http://doi.org/10.1093/bioinformatics/btu031
- Arango-Argoty, G., Garner, E., Pruden, A., Heath, L. S., Vikesland, P., & Zhang, L. (2018). DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data. Microbiome, 6(1), 23.
- Conesa, A., & Götz, S. (2008). Blast2GO: A comprehensive suite for functional analysis in plant genomics. International journal of plant genomics, 2008.
- Juncker, A. S., Willenbrock, H., Von Heijne, G., Brunak, S., Nielsen, H., & Krogh, A. (2003). Prediction of lipoprotein signal peptides in Gram‐negative bacteria. Protein Science, 12(8), 1652-1662.
- The UniProt Consortium. (2015). UniProt: a hub for protein information. Nucleic Acids Research, 43(Database issue), D204–D212. http://doi.org/10.1093/nar/gku989
- J Mol Biol. 2001 Jan 19;305(3):567-80. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. Krogh A1, Larsson B, von Heijne G, Sonnhammer EL.
- Nguyen, M., Brettin, T., Long, S. W., Musser, J. M., Olsen, R. J., Olson, R., … Davis, J. J. (2018). Developing an in silico minimum inhibitory concentration panel test for Klebsiella pneumoniae. Scientific Reports, 8, 421. http://doi.org/10.1038/s41598-017-18972-w
- Torsten Seemann; Prokka: rapid prokaryotic genome annotation, Bioinformatics, Volume 30, Issue 14, 15 July 2014, Pages 2068–2069, https://doi.org/10.1093/bioinformatics/btu153