Alignment and Alignment Score
We explain the notion of alignment and give a precise definition for
optimal alignment and alignment score. (Here "alignment score" is
"matching score" used in the definition of the sequence identity in
problemE.html.)
Remark: The program scoreout.c is designed for computing the
alignment score of a given pair of amino acid sequences. But if
there is any difference between the score it computes and the one
defined below, then we regard (only for this contest) the one
computed by scoreout.c as the correct one.
1. Protein (Amino Acid Sequence)
A protein is a sequence of amino acids. Amino acids are expressed by 22
symbols {A, B, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, Y, Z}.
(Though the number of amino acid types is 20, two more symbols have been
used for expressing some special amino acid types.) Thus, each protein
that we will analyze consists of these 22 symbols.
2. Optimal Alignment
When align two amino acid sequences, we can adjust a place of each symbol
of the sequences by
- inserting some number of blanks in front of one sequence, and
- inserting some number of "-" symbols (which is called a "gap")
at some points in the sequences.
Note that insertion is allowed before the last symbol of the other
sequence. For example, the following insertions (a) and (b) are not
allowed.
(a) AABB (b) AABBCCCCDD
CCDD AA**CC****EE
It is possible to insert blanks or gaps at the same place to both
sequences; but such insertions do not make sense when computing alignment
score, and we ignore them.
3. Score for a given alignment
When an alignment is fixed, i.e., all blank and gap insertions are fixed,
the correspondence of symbols in the both sequences is determined. Then
we can compute the score for this alignment in the following way:
score for a given alignment
= symbol-wise score total + gap penalty total
Here a symbol-wise score is defined by a table given in Appendix for each
pair of corresponding symbols.
For example, consider the following alignment.
A A B B C C D D - - E E F
A A - - - - D D K K K E F G G
4+4 +6+6 +1+5+6 = 32
Then 32 is the total of symbol-wise scores. On the other hand, for each
insertion of a gap (i.e., "-" symbol), a penalty is given as follows:
gap penalty =
-10, if the gap appears first (i.e., leftmost) in a block of gaps, and
-1, otherwise.
A gap penalty total is the total of these penalties. For example, the gap
penalty total for the previous alignment -24 from the following.
A A B B C C D D - - E E F
A A - - - - D D K K K E F G G
-10-1-1-1 -10-1 = -24
Therefore, the score for this alignment is 32 - 24 = 8.
4. Optimal alignment and alignment score
An optimal alignment is an alignment giving the highest score, and
alignment score is this highest score. That is,
the alignment score of X and Y
= the score of X and Y under an optimal alignment.
For example, the alignment score of the following X and Y is 36.
X: CBADDKDDBBTPPEP
Y: CGAEEDDKDDBBTEEFGKKK
is score is achieved, for example, by the following alignment.
X: CBA--DDKDDBBTPPEP
Y: CGAEEDDKDDBBTEEFGKKK
5. How to compute alignment score
Computing the alignment score of a given pair of amino acid sequences
is not so difficult computation task. We can define a recurrence
formula for computing alignment score; then from this formula, by
using a standard dynamic programming technique, we can design an
algorithm for computing alignment score.
Let us first consider a simple case; we assume that the gap penalty
is always -1. Consider any amino sequences X and Y. We assume that
they are given in two arrays X[1..m] and Y[1..n]. Below we will use
i (1 <= i <= m) to denote the position (from the left) of the sequence
X, and j (1 <= j <= n) to denote that of the sequence Y. Now consider
the following function.
score0(i, j) = the alignment score of X[i..m] and Y[j..n].
Then score0 can be computed by the following recurrence formula.
score0(i, j) = max( TBL[X[i],Y[j]] + score0(i+1, j+1), -- (i)
-1 + score0(i, j+1), -- (ii)
-1 + score0(i+1, j) ) -- (iii)
Here (i) is for the case when X[i] is corresponded to Y[j], (ii) is
for the case when a gap is inserted at the ith position of X, and (iii)
is for the case when a gap is inserted at the jth position of Y. Note
that TBL[X[i], Y[j]] is the symbol-wise score of X[i] and Y[j] given
in the matrix of Appendix.
The boundary condition is also important. For our case, we need to
consider the cases i = m and j = n. Since both cases are symmetric,
we give an explanation for the case i = m. Recall that we cannot
skip X[n] by inserting a gap in Y at the corresponding position.
Hence, the recurrence formula becomes as follows.
score0(m, j) = max( TBL[X[m],Y[j]], -1 + score0(m, j+1) )
Finally, alignment score (under our simplified gap penalty) is computed
as follows.
(*) score(X, Y) = max( score0(1, 1), score0(1, 2), ..., score0(1, n),
score0(2, 1), ..., score0(m, 1) )
Note that score0(1, k) is the largest score for all alignments
matching X[1] and Y[k].
For computing the real alignment score, we need to distinguish a
leftmost gap (of some sequence of gaps) from the other gaps. For this
purpose, we consider the following three functions:
score0(i, j) = the score of X[i..m] and Y[j..n] with no gap before
X[i] or Y[j],
score1(i, j) = the score of X[i..m] and Y[j..n] under the situation
where gap(s) have been inserted in the seq. X, and
score2(i, j) = the score of X[i..m] and Y[j..n] under the situation
where gap(s) have been inserted in the seq. Y.
By a similar idea, we can define recurrence formulas computing these
functions. For example, score0 is now computed as follows.
score0(i, j) = max( TBL[X[i],Y[j]] + score0(i+1, j+1), -- (0-i)
-10 + score1(i, j+1), -- (0-ii)
-10 + score2(i+1, j) ), and -- (0-iii)
score0(m, j) = max( TBL[X[m],Y[j]], -10 + score1(m, j+1) ).
On the other hand, score1 (similarly score2) is computed as follows.
score1(i, j) = max( TBL[X[i],Y[j]] + score0(i+1, j+1), -- (1-i)
-1 + score1(i, j+1) ), and -- (1-ii)
score1(m, j) = max( TBL[X[m],Y[j]], -1 + score1(m, j+1) ).
Note that the score(X, Y) is computed in exactly the same way as the
formula (*); that is, only score0 is used to compute the alignment score.
================================
Appendix: Substitution matrix between amino acids
From http://www.ncbi.nlm.nih.gov/blast/html/sub_matrix.html. (Precisely
speaking this is a table for BLOSUM62.)
A R N D C Q E G H I L K M F P S T W Y V B Z X
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1
B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1
Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1
X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1
Note: X denotes an unidentified/other symbol.
END OF alingmentE.txt
|