FM-index

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Template:Short description In computer science, an FM-index is a compressed full-text substring index based on the Burrows–Wheeler transform, with some similarities to the suffix array. It was created by Paolo Ferragina and Giovanni Manzini,[1] who describe it as an opportunistic data structure as it allows compression of the input text while still permitting fast substring queries. The name stands for Full-text index in Minute space.[2]

It can be used to efficiently find the number of occurrences of a pattern within the compressed text, as well as locate the position of each occurrence. The query time, as well as the required storage space, has a sublinear complexity with respect to the size of the input data.

The original authors have devised improvements to their original approach and dubbed it "FM-Index version 2".[3] A further improvement, the alphabet-friendly FM-index, combines the use of compression boosting and wavelet trees[4] to significantly reduce the space usage for large alphabets.

The FM-index has found use in, among other places, bioinformatics.[5]

Background

Using an index is a common strategy to efficiently search a large body of text. When the text is larger than what reasonably fits within a computer's main memory, there is a need to compress not only the text but also the index. When the FM-index was introduced, there were several suggested solutions that were based on traditional compression methods and tried to solve the compressed matching problem. In contrast, the FM-index is a compressed self-index, which means that it compresses the data and indexes it at the same time.

FM-index data structure

An FM-index is created by first taking the Burrows–Wheeler transform (BWT) of the input text. For example, the BWT of the string <templatestyles src="Mono/styles.css" />T = "abracadabra$" is "ard$rcaaaabb", and here it is represented by the matrix <templatestyles src="Mono/styles.css" />M where each row is a rotation of the text, and the rows have been sorted lexicographically. The transform corresponds to the concatenation of the characters from the last column (labeled <templatestyles src="Mono/styles.css" />L).

<templatestyles src="Mono/styles.css" />I <templatestyles src="Mono/styles.css" />FL
1 <templatestyles src="Mono/styles.css" />$abracadabra
2 <templatestyles src="Mono/styles.css" />a$abracadabr
3 <templatestyles src="Mono/styles.css" />abra$abracad
4 <templatestyles src="Mono/styles.css" />abracadabra$
5 <templatestyles src="Mono/styles.css" />acadabra$abr
6 <templatestyles src="Mono/styles.css" />adabra$abrac
7 <templatestyles src="Mono/styles.css" />bra$abracada
8 <templatestyles src="Mono/styles.css" />bracadabra$a
9 <templatestyles src="Mono/styles.css" />cadabra$abra
10 <templatestyles src="Mono/styles.css" />dabra$abraca
11 <templatestyles src="Mono/styles.css" />ra$abracadab
12 <templatestyles src="Mono/styles.css" />racadabra$ab

The BWT in itself allows for some compression with, for instance, move to front and Huffman encoding, but the transform has even more uses. The rows in the matrix are essentially the sorted suffixes of the text and the first column F of the matrix shares similarities with suffix arrays. How the suffix array relates to the BWT lies at the heart of the FM-index.

<templatestyles src="Mono/styles.css" />C[c] of "<templatestyles src="Mono/styles.css" />ard$rcaaaabb"
<templatestyles src="Mono/styles.css" />c $ a b c d r
<templatestyles src="Mono/styles.css" />C[c] 0 1 6 8 9 10
<templatestyles src="Mono/styles.css" />Occ(c, k) of "<templatestyles src="Mono/styles.css" />ard$rcaaaabb"
a r d $ r c a a a a b b
1 2 3 4 5 6 7 8 9 10 11 12
$ 0 0 0 1 1 1 1 1 1 1 1 1
a 1 1 1 1 1 1 2 3 4 5 5 5
b 0 0 0 0 0 0 0 0 0 0 1 2
c 0 0 0 0 0 1 1 1 1 1 1 1
d 0 0 1 1 1 1 1 1 1 1 1 1
r 0 1 1 1 2 2 2 2 2 2 2 2

It is possible to make a last-to-first column mapping <templatestyles src="Mono/styles.css" />LF(i) from an index <templatestyles src="Mono/styles.css" />i to an index <templatestyles src="Mono/styles.css" />j, such that <templatestyles src="Mono/styles.css" />F[j] = <templatestyles src="Mono/styles.css" />L[i], with the help of a table <templatestyles src="Mono/styles.css" />C[c] and a function <templatestyles src="Mono/styles.css" />Occ(c, k).

  • <templatestyles src="Mono/styles.css" />C[c] is a table that, for each character <templatestyles src="Mono/styles.css" />c in the alphabet, contains the number of occurrences of lexically smaller characters in the text.
  • The function <templatestyles src="Mono/styles.css" />Occ(c, k) is the number of occurrences of character <templatestyles src="Mono/styles.css" />c in the prefix <templatestyles src="Mono/styles.css" />L[1..k]. Ferragina and Manzini showed[1] that it is possible to compute <templatestyles src="Mono/styles.css" />Occ(c, k) in constant time.

The last-to-first mapping can now be defined as <templatestyles src="Mono/styles.css" />LF(i) = C[L[i]] + Occ(L[i], i). For instance, on row 9, <templatestyles src="Mono/styles.css" />L is <templatestyles src="Mono/styles.css" />a and the same <templatestyles src="Mono/styles.css" />a can be found on row 5 in the first column <templatestyles src="Mono/styles.css" />F, so <templatestyles src="Mono/styles.css" />LF(9) should be 5 and <templatestyles src="Mono/styles.css" />LF(9) = C[a] + Occ(a, 9) = 5. For any row <templatestyles src="Mono/styles.css" />i of the matrix, the character in the last column <templatestyles src="Mono/styles.css" />L[i] precedes the character in the first column <templatestyles src="Mono/styles.css" />F[i] also in T. Finally, if <templatestyles src="Mono/styles.css" />L[i] = T[k], then <templatestyles src="Mono/styles.css" />L[LF(i)] = T[k - 1], and using the equality it is possible to extract a string of <templatestyles src="Mono/styles.css" />T from <templatestyles src="Mono/styles.css" />L.

The FM-index itself is a compression of the string <templatestyles src="Mono/styles.css" />L together with <templatestyles src="Mono/styles.css" />C and <templatestyles src="Mono/styles.css" />Occ in some form, as well as information that maps a selection of indices in <templatestyles src="Mono/styles.css" />L to positions in the original string <templatestyles src="Mono/styles.css" />T.

Count

The operation count takes a pattern <templatestyles src="Mono/styles.css" />P[1..p] and returns the number of occurrences of that pattern in the original text <templatestyles src="Mono/styles.css" />T. Since the rows of matrix <templatestyles src="Mono/styles.css" />M are sorted, and it contains every suffix of <templatestyles src="Mono/styles.css" />T, the occurrences of pattern <templatestyles src="Mono/styles.css" />P will be next to each other in a single continuous range. The operation iterates backwards over the pattern. For every character in the pattern, the range that has the character as a suffix is found. For example, the count of the pattern "bra" in "abracadabra" follows these steps:

  1. The first character we look for is <templatestyles src="Mono/styles.css" />a, the last character in the pattern. The initial range is set to <templatestyles src="Mono/styles.css" />[C[a] + 1 .. C[a+1]] = [2..6]. This range over <templatestyles src="Mono/styles.css" />L represents every character of <templatestyles src="Mono/styles.css" />T that has a suffix beginning with a.
  2. The next character to look for is <templatestyles src="Mono/styles.css" />r. The new range is <templatestyles src="Mono/styles.css" />[C[r] + Occ(r, start-1) + 1 .. C[r] + Occ(r, end)] = <templatestyles src="Mono/styles.css" />[10 + 0 + 1 .. 10 + 2] = <templatestyles src="Mono/styles.css" />[11..12], if <templatestyles src="Mono/styles.css" />start is the index of the beginning of the range and <templatestyles src="Mono/styles.css" />end is the end. This range over <templatestyles src="Mono/styles.css" />L is all the characters of <templatestyles src="Mono/styles.css" />T that have suffixes beginning with ra.
  3. The last character to look at is <templatestyles src="Mono/styles.css" />b. The new range is <templatestyles src="Mono/styles.css" />[C[b] + Occ(b, start-1) + 1 .. C[b] + Occ(b, end)] = <templatestyles src="Mono/styles.css" />[6 + 0 + 1 .. 6 + 2] = <templatestyles src="Mono/styles.css" />[7..8]. This range over <templatestyles src="Mono/styles.css" />L is all the characters that have a suffix that begins with bra. Now that the whole pattern has been processed, the count is the same as the size of the range: <templatestyles src="Mono/styles.css" />8 - 7 + 1 = 2.

If the range becomes empty or the range boundaries cross each other before the whole pattern has been looked up, the pattern does not occur in <templatestyles src="Mono/styles.css" />T. Because <templatestyles src="Mono/styles.css" />Occ(c, k) can be performed in constant time, count can complete in linear time in the length of the pattern: <templatestyles src="Mono/styles.css" />O(p) time.

Locate

The operation locate takes as input an index of a character in <templatestyles src="Mono/styles.css" />L and returns its position <templatestyles src="Mono/styles.css" />i in <templatestyles src="Mono/styles.css" />T. For instance <templatestyles src="Mono/styles.css" />locate(7) = 8. To locate every occurrence of a pattern, first the range of character is found whose suffix is the pattern in the same way the count operation found the range. Then the position of every character in the range can be located.

To map an index in <templatestyles src="Mono/styles.css" />L to one in <templatestyles src="Mono/styles.css" />T, a subset of the indices in <templatestyles src="Mono/styles.css" />L are associated with a position in <templatestyles src="Mono/styles.css" />T. If <templatestyles src="Mono/styles.css" />L[j] has a position associated with it, <templatestyles src="Mono/styles.css" />locate(j) is trivial. If it's not associated, the string is followed with <templatestyles src="Mono/styles.css" />LF(i) until an associated index is found. By associating a suitable number of indices, an upper bound can be found. Locate can be implemented to find occ occurrences of a pattern <templatestyles src="Mono/styles.css" />P[1..p] in a text <templatestyles src="Mono/styles.css" />T[1..u] in O(p + occ logε u)Script error: No such module "Check for unknown parameters". time with O(Hk(T)+loglogulogϵu) bits per input symbol for any k ≥ 0Script error: No such module "Check for unknown parameters"..[1]

Applications

DNA read mapping

FM index with backtracking has been successfully (>2000 citations) applied to approximate string matching/sequence alignment, See Bowtie http://bowtie-bio.sourceforge.net/index.shtml

See also

References

<templatestyles src="Reflist/styles.css" />

  1. a b c Paolo Ferragina and Giovanni Manzini (2000). "Opportunistic Data Structures with Applications". Proceedings of the 41st Annual Symposium on Foundations of Computer Science. p.390.
  2. Paolo Ferragina and Giovanni Manzini (2005). "Indexing Compressed Text". Journal of the ACM, 52, 4 (Jul. 2005). p. 553
  3. Script error: No such module "citation/CS1".
  4. P. Ferragina, G. Manzini, V. Mäkinen and G. Navarro. An Alphabet-Friendly FM-index. In Proc. SPIRE'04, pages 150-160. LNCS 3246.
  5. Script error: No such module "Citation/CS1".

Script error: No such module "Check for unknown parameters".

Script error: No such module "Navbox".