panthema / 2012 / 1119-eSAIS-Inducing-Suffix-and-LCP-Arrays-in-External-Memory
Example of the Inducing Process

eSAIS - Inducing Suffix and LCP Arrays in External Memory

Posted on 2012-11-19 15:49 by Timo Bingmann at Permlink with 2 Comments. Tags: research stringology stxxl c++

This web page accompanies our conference paper "Inducing Suffix and LCP Arrays in External Memory", which we presented at the Workshop on Algorithm Engineering and Experiments (ALENEX 2013). A PDF of the publication is available from this site as alenex13esais.pdf alenex13esais.pdf or from the online proceedings. The paper was joint work with my colleagues Johannes Fischer and Vitaly Osipov.

Download alenex13esais.pdf

The slides to my presentation of the paper on January 7th, 2013 in New Orleans, LA, USA is available: alenex13esais-slides.pdf alenex13esais-slides.pdf. They contain little text and an example of the eSAIS algorithm with a simplified PQ.

Download alenex13esais-slides.pdf

We have also submitted a full version of the eSAIS paper to a journal. Due to long publication cycles, we make a pre-print of the journal article is available here: esais-preprint.pdf esais-preprint.pdf. The full paper contains more details on the inducing algorithm for the LCP array and additional experimental details.

Download esais-preprint.pdf

Our implementations of eSAIS, the eSAIS-LCP variants, DC3 and DC3-LCP algorithms as described in the paper are available below under the GNU General Public License v3 (GPL).

eSAIS and DC3 with LCP version 0.5.4 (current) updated 2013-12-13
Source code archive:
(includes STXXL 1.4.0)
eSAIS-DC3-LCP-0.5.4.tar.bz2 eSAIS-DC3-LCP-0.5.4.tar.bz2 (1.37 MiB) Browse online
Git repositories Suffix and LCP construction algorithms
git clone https://github.com/bingmann/eSAIS
cd eSAIS; git submodule init; git submodule update
STXXL 1.4.0
git clone https://github.com/stxxl/stxxl

For more information about compiling and testing the implementation, please refer to the README included in the source.

We have also collected the real-world input samples used to test our algorithms. Note that these are files are very large and hosted externally:

Wikipedia XML Dump from June 2012 download (11.1 GiB) Creative Commons License
Gutenberg Concatenation from September 2012 download (5.1 GiB) Project Gutenberg License
UCSC Human Genome Assembly "hg19" download (610 MiB) Free by UCSC
First 8 Gi of Decimal Digits of Pi download (3.8 GiB) generated by y-cruncher

Notice that most of these files are compressed with xz and must be recompressed with gzip for the esactest test program to be able to automatically decompress them on-the-fly.

Abstract

We consider text index construction in external memory (EM). Our first contribution is an inducing algorithm for suffix arrays in external memory. Practical tests show that this outperforms the previous best EM suffix sorter [Dementiev et al., ALENEX 2005] by a factor of about two in time and I/O-volume. Our second contribution is to augment the first algorithm to also construct the array of longest common prefixes (LCPs). This yields the first EM construction algorithm for LCP arrays. The overhead in time and I/O volume for this extended algorithm over plain suffix array construction is roughly two. Our algorithms scale far beyond problem sizes previously considered in the literature (text size of 80 GiB using only 4 GiB of RAM in our experiments).

Older Versions

eSAIS and DC3 with LCP version 0.5.2 released 2013-03-30
Source code archive:
(includes STXXL)
eSAIS-DC3-LCP-0.5.2.tar.bz2 eSAIS-DC3-LCP-0.5.2.tar.bz2 (989 KiB) Browse online

eSAIS and DC3 with LCP version 0.5.0 released 2012-11-26
Source code archive:
(includes STXXL)
eSAIS-DC3-LCP-0.5.0.tar.bz2 eSAIS-DC3-LCP-0.5.0.tar.bz2 (979 KiB) Browse online

Comment by Seppi at 2012-12-03 11:04 UTC
this is great stuff!
Comment by homolog.us at 2013-02-06 19:38 UTC- http://homolog.us
Very good algorithm. We featured it in our blog.

http://www.homolog.us/blogs/2013/01/28/on-constructing-suffix-arrays-and-lcp-arrays-in-external-memory/
Post Comment
Name:
E-Mail or Homepage:
 

URLs (http://...) are displayed, e-mails are hidden and used for Gravatar.

Many common HTML elements are allowed in the text, but no CSS style.
RSS 2.0 Weblog Feed Atom 1.0 Weblog Feed Valid XHTML 1.1 Valid CSS (2.1)
Copyright 2005-2017 Timo Bingmann - Impressum