Suffix trees with Ukkonen’s algorithm
Danny Yoo <dyoo@hashcollision.org>
This is an implementation of suffix trees and their linear-time construction with the Ukkonen algorithm. This implementation is based on notes from Gusfield, "Algorithms on Strings, Trees, and Sequences".
1 Example
Let’s rush into a minimal example:
> (require suffixtree) > (define tree (make-tree)) > (tree-add! tree (string->label "00010010$")) > (define root (tree-root tree)) > (node-children root) '(#<node> #<node> #<node>)
> (label->string (node-up-label (car (node-children root)))) "$"
2 Introduction
Suffix trees encode all the nonempty suffixes of a string. For example, the string "0100101$" corresponds to the following suffix tree.
root |
| |
V |
+--- $ |
| |
+--- 1 --- $ |
| | |
| +---- 0 --- 0101$ |
| | |
| +---- 1$ |
| |
+--- 0 --- 0101$ |
| |
+---- 1 ---- $ |
| |
+----- 0 --- 1$ |
| |
+---- 0101$ |
Every path from the root to any leaf spells out a suffix of the string "0100101$", and every suffix is accounted for. This in itself might not sound too sexy, but by preprocessing a string as a suffix tree, we can then do some amazing things.
For example, we can see if a substring is present in a suffix tree in time bounded by the length of the substring by following characters starting from the root. Suffix trees also allow us to find the longest common substring between strings in linear time. Dan Gusfield’s book "Algorithms on Strings, Trees, and Sequences" sings praises about suffix trees, and deservedly so.
Constructing a suffix tree can be done in linear-time; the algorithm used here is Ukkonen’s algorithm, since it’s one of the simplest to code. That being said, the algorithm is not quite simple; for more information on the construction algorithm, see the References section below.
3 API
(require suffixtree) | package: suffixtree |
The API consists of the main suffix tree algorithm, and auxillary utilities and applications.
The main structures are trees, nodes, and labels.
3.1 Trees
A suffix tree consists of a root. This implementation allows multiple labels to be added to the tree.
procedure
(make-tree) → tree
procedure
(tree? datum) → boolean
datum : any
procedure
(tree-root a-tree) → node
a-tree : tree
> (define a-tree (make-tree)) > (tree-add! a-tree (string->label "supercalifragilisticexpialidocious"))
procedure
(tree-walk a-tree a-label succeed-f fail-f) → (or/c A B)
a-tree : tree a-label : label succeed-f : (node number -> A) fail-f : (node number number -> B)
If the label matched completely, calls succeed-f with the position where the matching had succeeded.
If the label mismatched, calls fail-f with the tree position where the matching had failed.
The return value from tree-walk will either be A or B.
> (define a-tree (make-tree)) > (tree-add! a-tree (string->label "banana"))
> (define (success-f node up-label-offset) (list node up-label-offset))
> (define (fail-f node up-label-offset input-label-offset) (list node up-label-offset input-label-offset)) > (tree-walk a-tree (string->label "banana") success-f fail-f) '(#<node> 6)
> (tree-walk a-tree (string->label "ban") success-f fail-f) '(#<node> 3)
> (tree-walk a-tree (string->label "ana") success-f fail-f) '(#<node> 3)
> (tree-walk a-tree (string->label "apple") success-f fail-f) '(#<node> 1 1)
procedure
(tree-contains? a-tree a-label) → boolean
a-tree : tree a-label : label
> (define a-tree (make-tree)) > (tree-add! a-tree (string->label "0100101$")) > (tree-contains? a-tree (string->label "01")) #t
> (tree-contains? a-tree (string->label "001")) #t
> (tree-contains? a-tree (string->label "011")) #f
> (tree-contains? a-tree (string->label "0101")) #t
tree-contains? is an application of tree-walk:
(define (tree-contains? tree label) (tree-walk tree label (lambda (node up-label-offset) #t) (lambda (node up-label-offset input-label-offset) #f)))
3.2 Nodes
Nodes form the structure of the suffix tree, and link up children nodes as well. Every internal node I of a suffix tree will also have a suffix-node whose path-label is the immediate suffix of node I.
> (define a-tree (make-tree))
> (tree-add! a-tree (string->label "peter piper"))
> (for ([i (in-naturals)] [c (node-children (tree-root a-tree))]) (printf "~a: ~a\n" i (label->string (node-up-label c))))
0: iper
1: p
2: piper
3: r piper
4: e
5: ter piper
> (define p-node (node-find-child (tree-root a-tree) #\p))
> (for ([i (in-naturals)] [c (node-children p-node)]) (printf "~a: ~a\n" i (label->string (node-up-label c))))
0: e
1: iper
procedure
(node-up-label a-node) → label
a-node : node
procedure
(node-parent a-node) → (or/c node #f)
a-node : node
procedure
(node-suffix-link a-node) → node
a-node : node
procedure
(node-find-child a-node a-label-element) → (or/c node #f)
a-node : node a-label-element : label-element
procedure
(node-children a-node) → (listof node)
a-node : node
3.3 Labels
Labels represent an immutable sequence of label-elements. Label-elements can be anything that compare with equal?, but the most common label-elements will be characters. Labels can be sublabeled with efficiency.
procedure
(string->label a-string) → label
a-string : string
procedure
(string->label/with-sentinel a-str) → label
a-str : string
Note that label->string can’t be directly used on a label with a sentinel.
procedure
(label->string a-label) → string
a-label : label
procedure
(vector->label a-vec) → label
a-vec : vector
procedure
(vector->label/with-sentinel a-vec) → label
a-vec : vector
procedure
(label->vector a-label) → vector
a-label : label
procedure
(sublabel a-label left-offset [right-offset]) → label
a-label : label left-offset : number right-offset : number = (label-length label)
If right-offset is omitted, it defaults to (label-length label).
(<= left-offset right-offset) should be #t.
procedure
(label-ref a-label a-offset) → label-element
a-label : label a-offset : number
procedure
(label-length a-label) → number
a-label : label
procedure
(label-equal? label-1 label-2) → boolean
label-1 : label label-2 : label
Warning: two labels may have equal content, but come from different sources.
procedure
(label-source-eq? label-1 label-2) → boolean
label-1 : label label-2 : label
procedure
(label-source-id a-label) → number
a-label : label
(label-source-eq? label-1 label-2)
logically implies:
(= (label-source-id label-1) (label-source-id label-2))
3.4 Other utilities
This module provides some example applications of suffix trees.
procedure
(longest-common-substring string-1 string-2) → string string-1 : string string-2 : string
> (longest-common-substring "Lambda: the Ultimate Imperative" "Procedure Call Implementations Considered Harmful, or, Lambda: the Ultimate GOTO") "Lambda: the Ultimate "
procedure
(longest-common-sublabel label-1 label-2) → label
label-1 : label label-2 : label
procedure
(path-label a-node) → label
a-node : node
4 Extended example: keyword search
#lang racket/base (require suffixtree) ;; String reference (struct sref (str)) ;; string->label*: string -> label ;; Creates a label out of a string, but with a reference ;; to the string at the end. (define (string->label* s) (vector->label (list->vector (append (string->list s) (list (sref s)))))) ;; leaves: node -> (listof node) ;; Get the leaves of the tree rooted at node. (define (leaves a-node) (let loop ([a-node a-node] [acc '()]) (define children (node-children a-node)) (cond [(eq? children '()) (cons a-node acc)] [else (foldl loop acc children)]))) ;; leaf-str: node -> string ;; Given a leaf node, get back the string stored in the ;; terminating sref element. (define (leaf-str a-leaf) (define a-label (node-up-label a-leaf)) (sref-str (label-ref a-label (sub1 (label-length a-label))))) ;; find-matches: tree string -> (listof string) (define (find-matches a-tree str) (define (on-success node up-label-offset) (map leaf-str (leaves node))) (define (on-fail node up-label-offset input-label-offset) '()) (tree-walk a-tree (string->label str) on-success on-fail)) (define a-tree (make-tree)) (tree-add! a-tree (string->label* "peter")) (tree-add! a-tree (string->label* "piper")) (tree-add! a-tree (string->label* "pepper")) (find-matches a-tree "per") (find-matches a-tree "ter")
5 Caveats
The code in tree-add! assumes that the construction of the full suffix tree on its input string is possible. Certain strings don’t have an full explicit suffix tree, such as "foo".
+-- "foo" |
| |
+-- "oo" |
In this case, when we try to construct a suffix tree out of "foo", we have an implicit suffix tree, where not every leaf corresponds to a suffix of the input string. The suffix "o" is implicit in this tree.
In order to guarantee that all suffixes will have a place in the suffix tree, we’ll often add a sentinel character at the end the string to make sure all suffixes have a unique path in the suffix tree. For example, assuming that we use "$" as our sentinel:
+-- "foo$" |
| |
+-- "o" -- "o$" |
| | |
| +---- "$" |
| |
+-- "$" |
The API has the function string->label/with-sentinel to automatically add a unique sentinel character at the end of a string.
(let ([tree (make-tree)] [label (string->label/with-sentinel "foo")]) (tree-add! label))
so be sure to use this if you need to ensure the representation of all suffixes in the suffix tree.
6 References
Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, NY, 1997.
Lloyd Allison. Suffix Trees. http://www.allisons.org/ll/AlgDS/Tree/Suffix/
Mark Nelson. Fast String Searching With Suffix Trees. Dr. Dobb’s Journal, August, 1996. http://www.dogma.net/markn/articles/suffixt/suffixt.htm
Mummer: Ultra-fast alignment of large-scale DNA and protein sequences. http://mummer.sourceforge.net/