Maximal unique match

Script error: No such module "Unsubst". A maximal unique match or MUM, for short, is part of a key step ^[1] in the multiple sequence alignment of genomes in computational biology. Identification of MUMs and other potential anchors is the first step in larger alignment systems such as MUMmer. Anchors are the areas between two genomes where they are highly similar. To understand what a MUM is we each word in the acronym can be broken down individually. Match implies that the substring occurs in both sequences to be aligned. Unique means that the substring occurs only once in each sequence. Finally, maximal states that the substring is not part of another larger string that fulfills both prior requirements. The idea behind this is that long sequences that match exactly and occur only once in each genome are almost certainly part of the global alignment.

Formal definition

"Given two genomes A and B, Maximal Unique Match (MUM) substring is a common substring of A and B of length longer than a specified minimum length d (by default d = 20) such that
it is maximal, that is, it cannot be extended on either end without incurring a mismatch and

it is unique in both sequences"^[2]

Algorithm

Identifying the set of MUMs in two very long genome sequences is not computationally trivial.^[1] There are several algorithmic ways to approach identifying MUMs in multiple sequence alignment. The simplest and slowest method is using brute force where for every index $i$ Script error: No such module "Check for unknown parameters". in genome $A$ Script error: No such module "Check for unknown parameters". and every index $j$ Script error: No such module "Check for unknown parameters". in genome $B$ Script error: No such module "Check for unknown parameters". , you calculate the longest common prefix $(P)$ Script error: No such module "Check for unknown parameters". of $A[i...n]$ Script error: No such module "Check for unknown parameters". and $B[j...m]$ Script error: No such module "Check for unknown parameters".. Next you must guarantee that the length of $P$ Script error: No such module "Check for unknown parameters". is at least $d$ Script error: No such module "Check for unknown parameters". where $d$ Script error: No such module "Check for unknown parameters". is the minimum MUM size specified. Finally you must ensure that $P$ Script error: No such module "Check for unknown parameters". is unique in both genomes. The complexity of the brute force method is thus $O(mn)$ Script error: No such module "Check for unknown parameters"..^[2]

In actuality though MUMs are identified by building a generalized suffix tree for $A$ Script error: No such module "Check for unknown parameters". and $B$ Script error: No such module "Check for unknown parameters". . A list is then created for all internal nodes with exactly one child from each genome sequence. For each of these nodes(we will identify children from genome $A$ Script error: No such module "Check for unknown parameters". as $i$ Script error: No such module "Check for unknown parameters". and children from genome $B$ Script error: No such module "Check for unknown parameters". as $j$ Script error: No such module "Check for unknown parameters". ) check that $A[i-1] \neq B[j-1]$ Script error: No such module "Check for unknown parameters". and for those where this conditions holds, we know this is a MUM. In this case the complexity is reduced to $O(m+n)$ Script error: No such module "Check for unknown parameters"..^[2]

In the illustration below given the initial strings $S$ Script error: No such module "Check for unknown parameters". and $T$ Script error: No such module "Check for unknown parameters". and a $d$ Script error: No such module "Check for unknown parameters". of 1, the MUMs should be G and TA. A red leaf denotes that the leaf came from string S and a blue leaf denotes string $T$ Script error: No such module "Check for unknown parameters".. The internal node at A was discarded because $A[i-1] = B[j-1]$ Script error: No such module "Check for unknown parameters". meaning the character that comes before both A's is identical (T), this is condition where the sequences belongs to a larger unique sequence. The internal node at C is discarded because it has two children from S. This leaves us with the MUMs of G and TA.

MUM identification using a suffix tree

Example

To illustrate MUMs further, we take the following example. Let's say that d=3 and our two genomes are as follows:

S = CTACTGTCATGCTGAAGTCATGCGGCTGGCTTAT#
T = CTGGTGTGATGCTAAGTCATGCGGTCTGGCTTAT$

Working through the sequence our first MUM that satisfies the conditions occurs here:

S = CTACTGTCATGCTGAAGTCATGCGGCTGGCTTAT#
T = CTGGTGTGATGCTAAGTCATGCGGTCTGGCTTAT$

TGT is at least d in length, occurs only once in both sequences, and the characters to the left and right differ between genomes. To illustrate an example where expansion is needed to ensure that our MUM is not part of a larger sequence and unique, take the following:

S = CTACTGTCATGCTGAAGTCATGCGGCTGGCTTAT#
T = CTGGTGTGATGCTAAGTCATGCGGTCTGGCTTAT$

There are two problems in the case of testing ATG as a MUM because ATG is not unique and it is also part of a larger subsequence as illustrated here:

S = CTACTGTCATGCTGAAGTCATGCGGCTGGCTTAT#
T = CTGGTGTGATGCTAAGTCATGCGGTCTGGCTTAT$

Therefore expansion of this sequence is required in an attempt to satisfy the conditions for being a MUM as shown here.

S = CTACTGTCATGCTGAAGTCATGCGGCTGGCTTAT#
T = CTGGTGTGATGCTAAGTCATGCGGTCTGGCTTAT$

Using this method iteratively produces a final product where each MUM is identified for clarity each MUM is colored-coded:

S = CTACTGTCATGCTGAAGTCATGCGGCTGGCTTAT#
T = ACTTTGTGATGCTAAGTCATGCGGTCTGGCTTAT$

Maximal Exact Match (MEM)

MUMs are a subset of a larger set referred to as Maximal Exact Matches or MEMS. In a MEM the uniqueness condition of MUMs is relaxed. MEMs are like local alignment but in this case only identify sequences where there are no gaps.

References

↑ ^a ^b Script error: No such module "Citation/CS1".
↑ ^a ^b ^c Script error: No such module "citation/CS1".

Script error: No such module "Check for unknown parameters".

[Alignment_of_whole_genomes-1] Script error: No such module "Citation/CS1".

[Algorithms_in_Bioinformatics-2] Script error: No such module "citation/CS1".

[1]

[2]

Maximal unique match

Contents

Formal definition

Algorithm

Example

Maximal Exact Match (MEM)

See also

References

Navigation menu

Maximal unique match

Formal definition

Algorithm

Example

Maximal Exact Match (MEM)

See also

References

Navigation menu

Search