panthema / tags / thrill
Photo of a Samsung NVMe SSD

NVMe "Disk" Bandwidth and Latency for Batched Block Requests

Posted on 2019-03-22 16:00 by Timo Bingmann at Permlink with 0 Comments. Tags: c++ stxxl thrill

Last week I had the pleasure of being invited to the Dagstuhl seminar 19111 on Theoretical Models of Storage Systems. I gave a talk on the history of STXXL and Thrill, but also wanted to include some current developments. Most interesting I found is the gap closing between RAM and disk bandwidth due to the (relatively) new Non-Volatile Memory Express (NVMe) storage devices.

Since I am involved in many projects using external memory, I decided to perform a simple set of fundamental experiments to compare rotational disks and newer solid-state devices (SSDs). The results were interesting enough to write this blog article about.

Among the tools of STXXL/FOXXLL there are two benchmarks which perform two distinct access patterns: Scan (benchmark_disks) and Random (benchmark_disks_random).

The Scan experiment is probably the fastest access method as it reads or writes the disk (actually: storage device) sequentially. The Random experiment is good to determine the access latency of the disk as it first has to seek to the block and then transfer the data. Notice that the Random experiment does batched block accesses like one would perform in a query/answering system where the next set of random blocks depends on calculations performed with the preceding blocks (like in a B-Tree). This is a different experiment than done by most "throughput" measurement tools which issue a continuous stream of random block accesses.

This blog entry continues on the next page ...

First slide of the talk

Presentation "Scalable Construction of Text Indexes with Thrill" at IEEE Big Data 2018

Posted on 2018-12-12 16:00 by Timo Bingmann at Permlink with 0 Comments. Tags: talk thrill

Today, I gave a presentation of our paper "Scalable Construction of Text Indexes with Thrill" at the IEEE International Conference on Big Data 2018 in Seattle, WA, USA.

The slides of the presentation at the IEEE conference are available here:
slides-Scalable-Construction-of-Text-Indexes-with-Thrill.pdf slides-Scalable-Construction-of-Text-Indexes-with-Thrill.pdf.

The full paper is available from this webpage: paper-Scalable-Construction-of-Text-Indexes-with-Thrill.pdf paper-Scalable-Construction-of-Text-Indexes-with-Thrill.pdf
or refer to the longer version in my dissertation on scalable suffix array construction.

Download Scalable-Construction-of-Text-Indexes-with-Thrill.pdf

Abstract

The suffix array is the key to efficient solutions for myriads of string processing problems in different application domains, like data compression, data mining, or bioinformatics. With the rapid growth of available data, suffix array construction algorithms have to be adapted to advanced computational models such as external memory and distributed computing. In this article, we present five suffix array construction algorithms utilizing the new algorithmic big data batch processing framework Thrill, which allows scalable processing of input sizes on distributed systems in orders of magnitude that have not been considered before.


First slide of the talk

Presentation "Thrill: High-Performance Algorithmic Distributed Batch Data Processing with C++" at IEEE Big Data 2016

Posted on 2016-12-06 16:00 by Timo Bingmann at Permlink with 0 Comments. Tags: talk thrill

Today, I gave a presentation of our paper "Thrill: High-Performance Algorithmic Distributed Batch Data Processing with C++" at the IEEE International Conference on Big Data 2016 in Washington D.C., USA. An extended technical report of our paper is also available on this website or on arXiv.

The slides of the presentation at the IEEE conference are available here:
slides-Thrill-High-Performance-Algorithmic-Distributed-Batch-Data-Processing-with-CPP-TalkAsGiven.pdf slides-Thrill-High-Performance-Algorithmic-Distributed-Batch-Data-Processing-with-CPP-TalkAsGiven.pdf.

Below a longer version of the slides is available for download:
slides-Thrill-High-Performance-Algorithmic-Distributed-Batch-Data-Processing-with-CPP.pdf slides-Thrill-High-Performance-Algorithmic-Distributed-Batch-Data-Processing-with-CPP.pdf.
These slides contain additional figures which are useful to understand the DIA operations in Thrill, along with many extra design slides omitted from shorter talks.

Download slides-Thrill-High-Performance-Algorithmic-Distributed-Batch-Data-Processing-with-CPP.pdf

First slide of the talk

Presentation "STXXL and Thrill (Parallel Batch Processing)" at STXXL Workshop in DFG SPP 1736

Posted on 2016-09-21 20:00 by Timo Bingmann at Permlink with 0 Comments. Tags: talk thrill stxxl

Today, I gave a technical presentation comparing STXXL and Project Thrill at the STXXL Workshop organized within the DFG SPP 1736. The main topic of the workshop was to determine the future development course of STXXL, and the biggest question in this regard was how to bring more multi-core parallelization into STXXL. Thrill or an adaptation of its ideas may be the solution to this challenge: 2016-09-21 STXXL and Thrill Slides.pdf 2016-09-21 STXXL and Thrill Slides.pdf.

Download 2016-09-21 STXXL and Thrill Slides.pdf

A figure from the technical report

Thrill: High-Performance Algorithmic Distributed Batch Data Processing with C++

Posted on 2016-08-20 09:54 by Timo Bingmann at Permlink with 0 Comments. Tags: research c++ thrill

Our technical report on "Thrill: High-Performance Algorithmic Distributed Batch Data Processing with C++" is now available on arXiv as 1608.05634 or locally: 1608.05634v1.pdf 1608.05634v1.pdf with source 1608.05634v1.tar.gz 1608.05634v1.tar.gz (780 KiB).

This report is the first technical documentation about our new distributed computing prototype called Thrill. Thrill is written in modern C++14, and open source under the BSD-2 license. More information on Thrill is available from the project homepage.

Thrill's source is available from Github.

Download 1608.05634v1.pdf

Abstract

We present the design and a first performance evaluation of Thrill -- a prototype of a general purpose big data processing framework with a convenient data-flow style programming interface. Thrill is somewhat similar to Apache Spark and Apache Flink with at least two main differences. First, Thrill is based on C++ which enables performance advantages due to direct native code compilation, a more cache-friendly memory layout, and explicit memory management. In particular, Thrill uses template meta-programming to compile chains of subsequent local operations into a single binary routine without intermediate buffering and with minimal indirections. Second, Thrill uses arrays rather than multisets as its primary data structure which enables additional operations like sorting, prefix sums, window scans, or combining corresponding fields of several arrays (zipping).

We compare Thrill with Apache Spark and Apache Flink using five kernels from the HiBench suite. Thrill is consistently faster and often several times faster than the other frameworks. At the same time, the source codes have a similar level of simplicity and abstraction.


First slide of the talk

Presentation "Massive Suffix Array Construction with Thrill" at DFG SPP 1736 Annual Colloquium

Posted on 2015-10-01 19:40 by Timo Bingmann at Permlink with 0 Comments. Tags: c++ talk thrill

Today, we gave an overview presentation of the vision behind Project Thrill, its current state, and how it will be used to implement suffix and LCP array construction, and many other distributed algorithms: 2015-10-01 Massive Suffix Array Construction with Thrill.pdf 2015-10-01 Massive Suffix Array Construction with Thrill.pdf.

Download 2015-10-01 Massive Suffix Array Construction with Thrill.pdf

First slide of the talk showing sparks forming a C++

Presentation of DALKIT (work in progress) in Berlin

Posted on 2015-03-27 21:00 by Timo Bingmann at Permlink with 0 Comments. Tags: c++ talk thrill

Today, I presented our work in progress on a distributed computation platform for Big Data algorithms at the LSDMA All-Hands-Meeting in Berlin. One of the currently proposed names is DALKIT. The talk covers the current state our student project is in, which consists mainly of the design of the framework's interface, architecture and future components.

The slides of the presentation 2015-03-27 Project DALKIT.pdf 2015-03-27 Project DALKIT.pdf are available online. However, as usual, my slides are very difficult to understand without the audio track. For future "final" version presentations there will probably be more videos.

Download 2015-03-27 Project DALKIT.pdf

RSS 2.0 Weblog Feed Atom 1.0 Weblog Feed Valid XHTML 1.1 Valid CSS (2.1)
Copyright 2005-2019 Timo Bingmann - Impressum