panthema / 2012 / 1119-eSAIS-Inducing-Suffix-and-LCP-Arrays-in-External-Memory / eSAIS-DC3-LCP-0.5.4 / README (Download File)
This is a short README for our implementations of eSAIS and DC3 including LCP
variants as described in the paper "Suffix and LCP Array in External Memory"
presented at ALENEX'13 and submitted to JEA's Special Edition.

--- Source Code Overview

The main algorithm implementations are in the directory "src/external/", inputs
are generated by the classes in "src/input/" and various auxiliary algorithms
are encapsulated in "src/tools/" header files. The STXXL library contained in
"stxxl/" is a modified version of 1.4.1 and contains many customizations we
added while implementing eSAIS.

Only one binary program is built when compiling with standard options. This
program is called "esactest" and includes all algorithms via the header
files. It is used to run speed tests on the various input instances and
automatically checks the calculated arrays.

--- Using esactest

This except from the esactest.cc describes how to call the test program:

 * This main program is designed to quickly test different external memory
 * suffix sorters on a wide variety of inputs. The <input> is specified on the
 * command line and can be:
 *
 * - plain data files located under $HOME/sac-corpus/
 *
 * - gzip compressed data files in $HOME/sac-corpus/, which are decompressed
 *   on-the-fly. These must be named datafile.123456.gz, where 123456 is the
 *   _decompressed_ size of the data file.
 *
 * - artificial inputs generated by functions in the input/ sub-directory.
 *
 * - verbatim strings can be processed as "simple/<string>"
 * 
 * Call the program without arguments for a list of available inputs. The input
 * is loaded into a stxxl vector and then processed by the external memory
 * suffix sorters that this main program was compiled with.
 *
 * Size of the input can be limited via the -s command line parameter.

The default external memory suffix sorter is eSAIS with LCP construction. The
distributed source code is by default configured to use 4 GiB of RAM and the
40-bit position data type. The STXXL library requires further configuration to
specify where temporary files are created; this .stxxl file is described below.

--- Building esactest

To build the program "cmake" version 2.8 or higher and a recent gcc C++
compiler are required. Having "eSAIS-DC3-LCP-X.Y.Z.tar" unpacked in the current
directory, the following commands will compile and run the esactest program,
compiled with the included custom STXXL library:

$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=Release ..
$ make
$ cd src/
$ ./esactest

If the cmake configuration routine does not finish successfully, you may need
to install some development packages via your Linux distribution's package
manager.

To build separate binaries for each algorithm variant add -DBUILD_ALL=ON to the
cmake configuration line.

Here is a simple example to generate suffix and LCP array of a random input:

$ mkdir artificial
$ ./esactest -s 10mb --writeinput --writeoutput artificial/rand10
$ ls -l artifical/

--- Configuring STXXL Temporary Files

The external memory library STXXL creates (huge) temporary files when
processing large data. It is configured by the file ".stxxl" in the current
directory. Each line specifies one temporary file, its maximum size, and access
method. We generally use

disk=/data01/stxxl,950000,syscall_unlink
disk=/data02/stxxl,950000,syscall_unlink

for two 1 TiB disks mounted at /data01 and /data02. The method "syscall_unlink"
creates a deleted file on the disk, which _does not appear_ in directory
listings and is automatically freed by the kernel when the program ends.

--- Further Information

We have collected source code, some test instances, and further information at

http://panthema.net/2012/1119-eSAIS-Inducing-Suffix-and-LCP-Arrays-in-External-Memory/

For information about STXXL see

http://stxxl.sourceforge.net/

--- Exits

Written 2012-11-26 by Timo Bingmann, updated 2013-03-30.