SST: An Algorithm for Finding Near-exact Sequence Matches
in Time Proportional to the Logarithm of the Database Size
Eldar Giladi, Michael G. Walker, James Z. Wang and Wayne Volkmuth
Abstract:
Motivation: Searches for near exact sequence matches are performed
frequently in large-scale sequencing projects and in comparative
genomics. The time and cost of performing these large-scale
sequence-similarity searches is prohibitive using even the fastest of
the extant algorithms. Faster algorithms are desired. Results: We have
developed an algorithm, called SST (Sequence Search Tree), that
searches a database of DNA sequences for near-exact matches, in time
propor-tional to the logarithm of the database size n.In SST,we
partition each sequence into fragments of fixed length called windows
using multiple offsets. Each window is mapped into a vector of
dimension 4 k which contains the frequency of occurrence of its
component k-tuples, with k a parameter typically in the range 4
6. Then we create a tree-structured index of the windows in vector
space, with tree-structured vector quantization (TSVQ). We identify
the nearest neighbors of a query sequence by partitioning the query
into windows and searching the tree-structured index for
nearest-neighbor windows in the database. When the tree is balanced
this yields an O log n complexity for the search. This complexity was
observed in our compu-tations. SST is most effective for applications
in which the target sequences show a high degree of similarity to the
query sequence, such as assembling shotgun sequences or matching ESTs
to genomic sequence. The algorithm is also an effective filtration
method. Specifically, it can be used as a preprocessing step for other
search methods to reduce the complexity of searching one large
database against another. For the problem of identifying overlapping
fragments in the assembly of 120 000 fragments from a 1.5 megabase
genomic sequence, SST is 15 times faster than BLAST when we consider
both building and searching the tree. For searching alone (i.e. after
building the tree index), SST 27 times faster than BLAST.
Full Paper in Color
(PDF)
Citation:
Eldar Giladi, Michael G. Walker, James Z. Wang and Wayne Volkmuth,
``SST: An Algorithm for Finding Near-Exact Sequence Matches in Time
Proportional to the Logarithm of the Database Size,'' Bioinformatics,
vol. 18, no. 6, pp. 873-879, 2002.
Copyright 2002 Oxford University Press.
Personal use of this
material is permitted. However, permission to reprint/republish this
material for advertising or promotional purposes or for creating new
collective works for resale or redistribution to servers or lists, or
to reuse any copyrighted component of this work in other works, must
be obtained from the publisher.
Last Modified:
March 1 2002
© 2002, James Z. Wang