I.J. Information Technology and Computer Science,
2012, 8, 22-36
Published Online July 2012 in MECS ()
DOI: 10.5815/ijitcs.2012.08.03
Copyright © 2012 MECS
I.J. Information Technology and Computer Science,
2012, 8, 22-36
Genomic Analysis and Classification of Exon and
Intron Sequences Using DNA Numerical
Mapping Techniques
Mohammed Abo-Zahhad, Sabah M. Ahmed, Shimaa A. Abd-Elrahman
Electrical and Electronics Engineering Department, Faculty of Engineering, Assiut University, Assiut, Egypt
([email protected], [email protected], and [email protected])
Abstract
—
Using digital signal processing in genomic
field is a key of solving most problems in this area such
as prediction of gene locations in a genomic sequence
and identifying the defect regions in DNA sequence.
It
is found that, using DSP is possible only if the symbol
sequences are mapped into numbers. In literature many
techniques
have
been
developed
for
numerical
representation of DNA sequences. They can be
classified into two types, Fixed Mapping (FM) and
Physico Chemical Property Based Mapping (PCPBM
(
.
The open question is that, which one of these numerical
representation techniques is to be used? The answer to
this question needs understanding these numerical
representations considering the fact that each mapping
depends on a particular application. This paper explains
this answer and introduces comparison between these
techniques in terms of their precision in exon and intron
classification. Simulations are carried out using short
sequences of the human genome (GRch37/hg19). The
final results indicate that the classification performance
is a function of the numerical representation method.
Index Terms
—
Genomic Signal Processing; DNA and
Proteins
Sequences;
Numerical
Mapping;
Codon,
Exons and Introns; Short Time Fourier Transform
I.
Introduction
Genomic Signal Processing (GSP) is defined as the
analysis, and use of genomic signals to gain biological
knowledge, and the translation of that knowledge into
systems-based applications. Genomic information is
digital in a very real sense. It
‟s
It is represented in the
form of sequences of which each element can be one
out of a finite number of entities. Such sequences, like
DNA and proteins, have been represented by character
strings, in which each character is a letter of an
alphabet. In case of DNA, the alphabet is of size 4 (for
proteins it
‟s
20) and consists of the letters A, T, C and
G
(e.g. ….
ATCGCTGA ...). If numerical values are
assigned to these characters, the resulting numerical
sequences are readily amenable to DSP applications
such as gene prediction which refers to locate the
protein-coding regions (exons) of genes in a long DNA
sequence
[1]. Therefore, it is necessary to map the
symbols into numerical sequences. An ideal mapping
should be such that the period-3 component of the
DNA
sequence
should
be
independent
of
the
nucleotides mapping, which is possible only through
symmetric mapping
[2]-
[3]. Once the mapping is done,
signal processing techniques can be used to identify
period-3 regions in the DNA sequence. The average
