Team II Gene Prediction Group

Introduction

Gene Prediction

In computational biology, gene prediction refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene prediction is one of the first and most important steps in understanding the genome of a species once it has been sequenced here.

Before we move too far into gene prediction, we must understand the biological context. Genes are fragments of DNA that encodes a functional molecule, usually a protein. In order to go from a nucleotide sequence to a functional protein, the sequence must be transcribed and then translated. Genes always begin with a "start" codon (a specific sequence of three nucleotides), which serve to denote the beginning of a DNA sequence that encodes a protein. Prokaryotic genomes have a high gene density and do not contain introns in their protein coding regions, meaning that the prediction of prokaryotic genes tends to be relatively simpler.

Figure 1: Prokaryotic Gene Structure

Data

We are given 258 assembled genomes of Klebsiella spp.

Approach

Gene prediction by computational methods for finding the location of protein coding regions is one of the essential issues in bioinformatics. There are two basic problems in gene prediction: (1) prediction of protein coding regions and (2) prediction of the functional sites of genes. Two classes of methods are generally adopted: ab-initio prediction and comparative based searches that we explain them in the following here.

Ab-initio Prediction

Prodigal

GeneMarkS

Glimmer

Comparative

Comparative, similarity based or Homology based gene prediction uses previously sequenced genes and their protein products as a template for recognition of unknown genes in a newly sequenced DNA fragments. So, in short we cab say: It is using "Known Genes" to predict "New Genes".

Recently, the number of sequenced genomes has increased drastically and 99% of genes have homologous partner, 80% have orthologous partner and 85% identity (protein coding DNA) versus 69% identity (intronic DNA). All these can be considered as the motivation of using this method of gene prediction.

Figure 2: Given a known gene and an unannotated genome sequence, find a set of substrings in the genomic sequence whose concatenation best matches the known gene

Sequence alignment is a way of arranging the sequences to identify regions of similarity that may be results of functional, structural or evolutionary relationships between the genomes. Two methods based on similarity research are: Local alignment and Global alignment.

Local alignment tries to match your query with a substring of your reference. Smith–Waterman algorithm is based on local alignment. While, global alignment forces the alignment to span the entire length of all query sequences. It is most useful when the sequences are similar and roughly "equal size". Otherwise, it may end up with a lot of gaps. Needleman–Wunsch algorithmBased on Dynamic programing uses global alignment.

vs

Figure 3: (Left) Local Alignment, 2 mismatch , 0 gaps. (Right) Global Alignment, 1 mismatch , 2 gaps of length 4 and 2

There are some tools for gene prediction based on comparative method such as "SGP2", "TwinScan" and "GenomeScan". But, they are developed just for some limited number of species. For example, Twinscan is currently available for Mammals, Caenorhabditis (worm), Dicot plants, and Cryptococci. Therefore, we can not use them for our dataset.

Blast

BLAST here for Basic Local Alignment Search Tool is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold.

RNA Prediction

Non-coding RNA is the term used for RNA that gets transcribed from a DNA template but not translated into a protein, Three main classes in bacteria: tRNA/tmRNA, rRNA and sRNA. Non-coding RNAs in bacterial genomes have been found to play a role in Protein synthesis/Translation (tRNA and rRNA), Gene regulation (sRNA) and both of them can be related to antibiotic resistance.

Rfam

Version: Input: Command: output:

Aragorn

Version: aragorn1.2.38

Aragorn identifies tRNA and tmRNA genes. The program employs heuristic algorithms to predict tRNA secondary structure, based on homology with recognized tRNA consensus sequences and ability to form a base-paired cloverleaf.

Input: FASTA Output: either FASTA or ARAGORN specific format

Performance by using reference genome:

Run-time: 1s/per genome

Sensitivity: 98.4

Precision: 70.5

tRNAscan-SE 2.0

Version: tRNAscan-SE-2.0 and infernal-1.1.2

tRNAscan-SE was written in the PERL (version 5.0) script language.It searches for transfer RNAs in genomic sequence seqfile(s) using three separate methods to achieve a combination of speed, sensitivity, and selectivity not available with each program individually.

Input: FASTA format

output: standard tabular, ACeDB-compatible, or an extended format

Performance by using reference genome: Run-time: 1m1s/per genome Sensitivity: 98.4 Precision: 68.1