To compare two or more sequences, it is necessary to align the

- The bases match : this means that there is no change since their divergence.
- The bases mismatch : this means that there is a substitution since their divergence.
- There is a base in one sequence, no base in the other : there is an insertion or a deletion since their divergence.

Figure 12 : The comparison of sequences. A good alignment is important for the next step : the construction of phylogenetic trees. The alignment will affect the distances between 2 different species and this will influence the inferred phylogeny. There are several programs available on the net for aligning sequences. These are all based on different mathematical models to compare two or more sequences with the most optimal score for matching bases with a minimum number of gaps inserted (because you can insert a huge amount of gaps, so every base will match an other).

Example : two sequences :

TCAGACGATTG TCGGAGCTG

How can we get the best alignment ? There are several possibilities :1. Reduce the number of mismatches :TCAG-ACG-ATTG || | | | | | 0 mismatches 7 matches 6 gaps TC-GGA-GC-T-G2. Reduce the number of gaps :TCAGACGATTG || || 5 mismatches 4 matches 2 gaps TCGGAGCTG--3. Reduce neither the number of gaps nor the number of mismatches :TCAG-ACGATTG || | | | | 2 mismatches 6 matches 4 gaps TC-GGA-GCTG-4. Same as 3. but one base (or gap) moved :TCAG-ACGATTG || | | | | | 1 mismatch 7 matches 4 gaps TC-GGA-GCT-GWhich of these is now the best alignment ??There are several alignment algorithms to choose the best alignment. Let's use a simple one in this example :

D = y + sum(w_{k}z_{k})

with :

D = distance

y : number of mismatches

w : penalty for gaps of length k

z : number of gaps of length k

Take gap penalty for gap length 1 = 2

Take gap penalty for gap length 2 = 6 (short gaps occur more frequent than long gaps)

in 1. : 0 + {(2 x 6) + (6 x 0)} = 12

in 2. : 5 + {(2 x 0) + (6 x 1)} = 11

in 3. : 2 + {(2 x 4) + (6 x 0)} = 10

in 4. : 1 + {(2 x 4) + (6 x 0)} = 9

We choose alignment 4 because it has the minimum distance.

Figure 1 : The alignment of sequences. This is done with Clustalw 1.74, and as you can see, the more variable areas are not optimally aligned (indicated with red boxes). Therefore it is mostly necessary to improve the alignment by hand. In this case, it is obvious to improve the alignment, but in other cases it could be more difficult to make improvements.

back to homepage

Next : Phylogenetics