Team II Gene Prediction Group: Difference between revisions
Line 24: | Line 24: | ||
2. Statistical description of coding regions. | 2. Statistical description of coding regions. | ||
'''Hidden Markov Model (HMM)''' | |||
Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (i.e. hidden) states. | Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (i.e. hidden) states. | ||
[[File: HMM.png|600px]] | |||
===Prodigal=== | ===Prodigal=== |
Revision as of 18:51, 26 March 2018
Introduction
Gene Prediction
In computational biology, gene prediction refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene prediction is one of the first and most important steps in understanding the genome of a species once it has been sequenced here.
Before we move too far into gene prediction, we must understand the biological context. Genes are fragments of DNA that encodes a functional molecule, usually a protein. In order to go from a nucleotide sequence to a functional protein, the sequence must be transcribed and then translated. Genes always begin with a "start" codon (a specific sequence of three nucleotides), which serve to denote the beginning of a DNA sequence that encodes a protein. Prokaryotic genomes have a high gene density and do not contain introns in their protein coding regions, meaning that the prediction of prokaryotic genes tends to be relatively simpler.
Data
We are given 258 assembled genomes of Klebsiella spp.
Approach
Gene prediction by computational methods for finding the location of protein coding regions is one of the essential issues in bioinformatics. There are two basic problems in gene prediction: (1) prediction of protein coding regions and (2) prediction of the functional sites of genes. Two classes of methods are generally adopted: ab-initio prediction and comparative based searches that we explain them in the following here.
Ab-initio Prediction
Ab initio prediction is the challenging attempt to predict protein structures based only on sequence information and without using templates. it relies on two major features: 1. Gene signals (start and stop codon, intron splice signals, codon structure, etc.) 2. Statistical description of coding regions.
Hidden Markov Model (HMM)
Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (i.e. hidden) states.
Prodigal
Prodigal is extremely fast and lightweight; also highly Specific - False positive rate < 5%; a distinct advantage of Prodigal over other gene-finders is that it performs well with high GC content genomes.
The results from Prodigal could be biased, because it was developed using results from GenBank annotation and using a small set of initial genomes; Recognition of short and atypical genes needs improvement.
GeneMarkS
GeneMarkS is a new gene prediction method, utilizes a non-supervised training procedure and can be used for a newly sequenced prokaryotic genome with no prior knowledge of any protein or rRNA genes. The GeneMarkS implementation uses an improved version of the gene finding program GeneMark.hmm, heuristic Markov models of coding and non-coding regions and the Gibbs sampling multiple alignment program. GeneMarkS can be used (self-trained program). Longer than 50kb sequences to be provided. If shorter sequences, GeneMark heuristic program can be used with loss of some accuracy. Also it can generate multiple output format, including off, fna, faa, etc. However in this case, GeneMarkS is very slow.
Glimmer
Gene Locator and Interpolated Markov Modeler Developed at ‘The Institute of Genomic Research (TIGR)’. UNIX program that uses the IMM algorithm to predict potential coding regions. It's based on Interpolated (variable-order) Markov Model Two Steps: 1. Model Building 2. Computation
Comparative
Comparative, similarity based or Homology based gene prediction uses previously sequenced genes and their protein products as a template for recognition of unknown genes in a newly sequenced DNA fragments. So, in short we cab say: It is using "Known Genes" to predict "New Genes".
Recently, the number of sequenced genomes has increased drastically and 99% of genes have homologous partner, 80% have orthologous partner and 85% identity (protein coding DNA) versus 69% identity (intronic DNA). All these can be considered as the motivation of using this method of gene prediction.
Given a known gene and an unannotated genome sequence, find a set of substrings in the genomic sequence whose concatenation best matches the known gene
Sequence alignment is a way of arranging the sequences to identify regions of similarity that may be results of functional, structural or evolutionary relationships between the genomes. Two methods based on similarity research are: Local alignment and Global alignment.
Local alignment tries to match your query with a substring of your reference. Smith–Waterman algorithm is based on local alignment. While, global alignment forces the alignment to span the entire length of all query sequences. It is most useful when the sequences are similar and roughly "equal size". Otherwise, it may end up with a lot of gaps. Needleman–Wunsch algorithmBased on Dynamic programing uses global alignment.
On the left, we have Local Alignment with 2 mismatch and 0 gaps. On the right, we have Global Alignment with 1 mismatch and 2 gaps of length 4 and 2
There are some tools for gene prediction based on comparative method such as "SGP2", "TwinScan" and "GenomeScan". But, they are developed just for some limited number of species. For example, Twinscan is currently available for Mammals, Caenorhabditis (worm), Dicot plants, and Cryptococci. Therefore, we can not use them for our dataset.
Blast
BLAST here for Basic Local Alignment Search Tool is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold.
RNA Prediction
Non-coding RNA is the term used for RNA that gets transcribed from a DNA template but not translated into a protein, Three main classes in bacteria: tRNA/tmRNA, rRNA and sRNA. Non-coding RNAs in bacterial genomes have been found to play a role in Protein synthesis/Translation (tRNA and rRNA), Gene regulation (sRNA) and both of them can be related to antibiotic resistance.
Rfam
Version: Input: Command: output:
Aragorn
Version: aragorn1.2.38
Aragorn identifies tRNA and tmRNA genes. The program employs heuristic algorithms to predict tRNA secondary structure, based on homology with recognized tRNA consensus sequences and ability to form a base-paired cloverleaf.
Output: either FASTA or ARAGORN specific format
Performance by using reference genome:
Sensitivity: 98.4
Precision: 70.5
tRNAscan-SE 2.0
Version: tRNAscan-SE-2.0 and infernal-1.1.2
tRNAscan-SE was written in the PERL (version 5.0) script language.It searches for transfer RNAs in genomic sequence seqfile(s) using three separate methods to achieve a combination of speed, sensitivity, and selectivity not available with each program individually.
output: standard tabular, ACeDB-compatible or an extended format
Performance by using reference genome:
Sensitivity: 98.4
Precision: 68.1
Merge
We use bedtools intersect to merge the two corresponding files from Prodigal and GeneMark. The entries from GeneMark predicted file, which do not overlap with any of the entry in the Prodigal predicted file, are concatenated to the latter file. We say that two entries do not overlap if they do not satisfy the 80% overlap criteria.
Figure 5: AUB=A+B-(A∩B)