Alignment and Alignment Score

We explain the notion of alignment and give a precise definition for optimal alignment and alignment score. (Here "alignment score" is "matching score" used in the definition of the sequence identity in problemE.html.)

Remark: The program scoreout.c is designed for computing the alignment score of a given pair of amino acid sequences. But if there is any difference between the score it computes and the one defined below, then we regard (only for this contest) the one computed by scoreout.c as the correct one.

1. Protein (Amino Acid Sequence)

A protein is a sequence of amino acids. Amino acids are expressed by 22 symbols {A, B, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, Y, Z}. (Though the number of amino acid types is 20, two more symbols have been used for expressing some special amino acid types.) Thus, each protein that we will analyze consists of these 22 symbols.

2. Optimal Alignment

When align two amino acid sequences, we can adjust a place of each symbol of the sequences by
  1. inserting some number of blanks in front of one sequence, and
  2. inserting some number of "-" symbols (which is called a "gap") at some points in the sequences.
Note that insertion is allowed before the last symbol of the other sequence. For example, the following insertions (a) and (b) are not allowed.
       (a) AABB         (b) AABBCCCCDD
               CCDD         AA**CC****EE
It is possible to insert blanks or gaps at the same place to both sequences; but such insertions do not make sense when computing alignment score, and we ignore them.

3. Score for a given alignment

When an alignment is fixed, i.e., all blank and gap insertions are fixed, the correspondence of symbols in the both sequences is determined. Then we can compute the score for this alignment in the following way:
	score for a given alignment
	= symbol-wise score total + gap penalty total
Here a symbol-wise score is defined by a table given in Appendix for each pair of corresponding symbols.

For example, consider the following alignment.

	A A B B C C D D - - E E F 
	A A - - - - D D K K K E F G G
	4+4        +6+6    +1+5+6 = 32
Then 32 is the total of symbol-wise scores. On the other hand, for each insertion of a gap (i.e., "-" symbol), a penalty is given as follows:
gap penalty =
    -10, if the gap appears first (i.e., leftmost) in a block of gaps, and
     -1, otherwise.
A gap penalty total is the total of these penalties. For example, the gap penalty total for the previous alignment -24 from the following.
	A A B B C C D D - - E E F 
	A A - - - - D D K K K E F G G
	  -10-1-1-1   -10-1 = -24
Therefore, the score for this alignment is 32 - 24 = 8.

4. Optimal alignment and alignment score

An optimal alignment is an alignment giving the highest score, and alignment score is this highest score. That is,
	the alignment score of X and Y
	= the score of X and Y under an optimal alignment.
For example, the alignment score of the following X and Y is 36.
	X: CBADDKDDBBTPPEP
	Y: CGAEEDDKDDBBTEEFGKKK
is score is achieved, for example, by the following alignment.
	X: CBA--DDKDDBBTPPEP
	Y: CGAEEDDKDDBBTEEFGKKK

5. How to compute alignment score

Computing the alignment score of a given pair of amino acid sequences is not so difficult computation task. We can define a recurrence formula for computing alignment score; then from this formula, by using a standard dynamic programming technique, we can design an algorithm for computing alignment score.

Let us first consider a simple case; we assume that the gap penalty is always -1. Consider any amino sequences X and Y. We assume that they are given in two arrays X[1..m] and Y[1..n]. Below we will use i (1 <= i <= m) to denote the position (from the left) of the sequence X, and j (1 <= j <= n) to denote that of the sequence Y. Now consider the following function.

	score0(i, j) = the alignment score of X[i..m] and Y[j..n].
Then score0 can be computed by the following recurrence formula.
   score0(i, j) = max( TBL[X[i],Y[j]] + score0(i+1, j+1),  -- (i)
                       -1 + score0(i, j+1),                -- (ii)
                       -1 + score0(i+1, j) )               -- (iii)
Here (i) is for the case when X[i] is corresponded to Y[j], (ii) is for the case when a gap is inserted at the ith position of X, and (iii) is for the case when a gap is inserted at the jth position of Y. Note that TBL[X[i], Y[j]] is the symbol-wise score of X[i] and Y[j] given in the matrix of Appendix.

The boundary condition is also important. For our case, we need to consider the cases i = m and j = n. Since both cases are symmetric, we give an explanation for the case i = m. Recall that we cannot skip X[n] by inserting a gap in Y at the corresponding position. Hence, the recurrence formula becomes as follows.

	score0(m, j) = max( TBL[X[m],Y[j]], -1 + score0(m, j+1) )
Finally, alignment score (under our simplified gap penalty) is computed as follows.
(*)  score(X, Y) = max( score0(1, 1), score0(1, 2), ..., score0(1, n),
                                      score0(2, 1), ..., score0(m, 1) )
Note that score0(1, k) is the largest score for all alignments matching X[1] and Y[k].

For computing the real alignment score, we need to distinguish a leftmost gap (of some sequence of gaps) from the other gaps. For this purpose, we consider the following three functions:

   score0(i, j) = the score of X[i..m] and Y[j..n] with no gap before
                  X[i] or Y[j],
   score1(i, j) = the score of X[i..m] and Y[j..n] under the situation
                  where gap(s) have been inserted in the seq. X, and
   score2(i, j) = the score of X[i..m] and Y[j..n] under the situation
                  where gap(s) have been inserted in the seq. Y.

By a similar idea, we can define recurrence formulas computing these functions. For example, score0 is now computed as follows.

   score0(i, j) = max( TBL[X[i],Y[j]] + score0(i+1, j+1),  -- (0-i)
                       -10 + score1(i, j+1),               -- (0-ii)
                       -10 + score2(i+1, j) ), and         -- (0-iii)

   score0(m, j) = max( TBL[X[m],Y[j]], -10 + score1(m, j+1) ).
On the other hand, score1 (similarly score2) is computed as follows.
   score1(i, j) = max( TBL[X[i],Y[j]] + score0(i+1, j+1),  -- (1-i)
                       -1 + score1(i, j+1) ), and          -- (1-ii)

   score1(m, j) = max( TBL[X[m],Y[j]], -1 + score1(m, j+1) ).
Note that the score(X, Y) is computed in exactly the same way as the formula (*); that is, only score0 is used to compute the alignment score.
================================

Appendix: Substitution matrix between amino acids

From http://www.ncbi.nlm.nih.gov/blast/html/sub_matrix.html.  (Precisely
speaking this is a table for BLOSUM62.)

    A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X 
 A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1  0 
 R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1  0 -1 
 N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  3  0 -1 
 D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4  1 -1 
 C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 
 Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0  3 -1 
 E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1  4 -1 
 G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -2 -1 
 H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0  0 -1 
 I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3 -3 -1 
 L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4 -3 -1 
 K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0  1 -1 
 M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3 -1 -1 
 F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3 -3 -1 
 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -1 -2 
 S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0  0  0 
 T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1  0 
 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -3 -2 
 Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -2 -1 
 V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3 -2 -1 
 B -2 -1  3  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4  1 -1 
 Z -1  0  0  1 -3  3  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -3 -2 -2  1  4 -1 
 X  0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2  0  0 -2 -1 -1 -1 -1 -1 

Note: X denotes an unidentified/other symbol.

END OF alingmentE.txt