Genome Analysis: Search for a Largest Protein Family

SuperCon2003 Finals Problem: Genome Analysis: Search for a Largest Protein Family

A genome, a body of genes, is a DNA (deoxyribonucleic acids) string, which is a sequence of nucleotides. Based on the information coded in a genome, proteins are synthesized, from which cytoskeleton, enzymes, etc. are made of.

A protein is a sequence of amino acids; roughly speaking, a genome encodes these amino acid sequences. For example, Escherichia coli, one of the commonly studied genome, encodes 4300 proteins, each of which consists of 100 to 1000 amino acids. Similar proteins (amino acid sequences) are considered to have similar functions. Thus, for identifying a type of proteins characterizing a given species, it is important to know a set of similar protein (which we call a "protein family").

[Problem]
For a given set of proteins (i.e., a set of amino sequences), find a large (lager is better) protein family w.r.t. sequence identity >= 25.

(*1): See below for the notions of "sequence identity" and "family"
(*2): Make a program that keeps producing protein families whenever a large one is found within a given time limit. (You do not have to worry about the time limit; each program is terminated by the referee program when the time limit comes.) It is not allowed to produce a family having a protein protein pairs with sequence identity < 25. Evaluation is made based on the last output within the time limit. We will announce later a specific way to make an output.

[Example Files and Example Programs]

The following example data and programs are provided:

  - proteins.txt: a set of 1200 proteins,
  - proteins-smallexample.txt: a small set of proteins,
  - scoreoutE.c:
       a program for computing the alignment score and an optimal
       alignment achieving this score, for the first two proteins
       in the file "proteins-smallexample.txt", and
  - seqidE.c:
       a program for computing a table of seq. id. for proteins
       given in the file "proteins.txt".

For the evaluation, we will use one protein set of similar size, but it may have a different feature. An example protein set that is similar to the one used for the evaluation is given on the first day of the final (i.e., July 28th).

[Evaluation]

We evaluate contestants' programs by running them one by one within a certain time limit. Programs are ranked based on the size of produced protein family; if two programs produces families with the same size, they are ranked by elapsed time until the largest family (i.e., the last one) is produced.

[Super Computer and MPI Programming]

We will use SGI Origin2000 (UNIX OS) system, which has 256 processors. Each program can make use of at most 64 processors. For parallel programming, we can use the MPI library. An introduction to parallel programming using the MPI library can be found in MPI tutorial

Some Explanations

1. Sequence identity of a protein pair

There are 20 amino types. Here, only for this explanation, let us denote them A to T. Then proteins are, for example, the following sequences X and Y. (Note: These are sequences defined artificially for this explanation.)

	X: CBADDKDDBBTPPEP  (length |X| = 15)
	Y: CGAEEDDKDDBBTEEFGKKK   (length |Y| = 20)

The sequence identity (seq. id. for short) is defined based on the number of places where two sequences have the same symbol. More precisely, it is defined as follows:

: matching score = the (max) number of places where X and Y match, and seq. id. of X and Y = ( matching score / min(|X|,|Y|) ) * 100
(Note: For this contest, we consider only integral part of seq. id. That is, seq. id. is a rounded value.)

Now we explain the notion of matching score, i.e., the number of places where X and Y match. (This is a very simplified version of "alignment score" and it is not the one used for the contest. See alignmentE.txt for the correct and precise definition.)

Let us consider the above example. By comparing symbols from the top (i.e., the left) of each sequence, we have

	X: CBADDKDDBBTPPEP
	Y: CGAEEDDKDDBBTEEFGKKK
	   * *   *

Hence, X and Y match at three places, which are indicated by "*" symbols. In this sense, the number of places where X and Y matches is 3. But if we align these sequences by shifting X slightly to the right, we have

	X:   CBADDKDDBBTPPEP 
	Y: CGAEEDDKDDBBTEEFGKKK
        	********

Thus, the number of matching places is increased to 8. We allow such a shift. Furthermore, it is also allowed to insert some "-" symbols (which are called "gaps") in the sequences to shift some parts of the sequences. For example, the following alignment yields 11 matching places, which is in fact maximum.

	X: CBA--DDKDDBBTPPEP 
	Y: CGAEEDDKDDBBT--EEFGKKK
	   * *  ********  *

For evaluating "similarity" of a pair of amino sequences, we align them (more precisely, each symbol in the sequences) in the above way to maximize the number of matching places, and the obtained maximum is our "matching score".

Note that this is a simplified explanation. For the real one, which is called a "alignment score" instead of "matching score", some other factors are also considered. For example, inserting a gap (i.e., "-" symbol) creates a less natural alignment; so, for expressing this unnaturalness, some sort of penalty is given each time a gap is inserted. See alignmentE.html for the precise definition of "alignment score".

2. Protein Family

In this contest, we consider a pair of proteins is "similar" if it has a sequence identity >= 25. A set of proteins is called a "protein family" (w.r.t. seq. id >= 25) if every pair of proteins in the set has a sequence identity >= 25. A protein family size is just the number of elements in the family. Our problem is to find such a protein family of large (if possible, the largest) size.

END OF problemE.txt