Example Application: CSV-File Record Filter

The second example is a CSV-file filter.

It will read a tab-delimited csv data file with column headers and apply the given filter expression to each line. Each line is taken as a record of variables named by the column header. For each line the expression is evaluated and if it returns a "true" boolean value, then the line is copied to the output; if it returns a "false" boolean, the line is skipped. In case the expression value is not a boolean, the value is printed on stderr.

This example program is not supposed to be an "industrial-strength" CSV file parser. It cannot handle escaped delimiters or quoted fields. Beware to use it directly on Excel-generated files.

CSV sample files

The distribution contains three sample CSV files for the csvfilter. Two are taken from MySQL's world sample database, which can be found at http://dev.mysql.com/doc/world-setup/en/world-setup.html . The data contained in the world database is originally Copyright Statistics Finland, http://www.stat.fi/worldinfigures . The third CSV dataset contains some key statistic features extracted from the CIA World Factbook (https://www.cia.gov/library/publications/the-world-factbook/).

The mysql-world-city.csv file starts like this:

ID	Name	CountryCode	District	Population
1	Kabul	AFG	Kabol	1780000
2	Qandahar	AFG	Qandahar	237500
3	Herat	AFG	Herat	186800
4	Mazar-e-Sharif	AFG	Balkh	127800
5	Amsterdam	NLD	Noord-Holland	731200
6	Rotterdam	NLD	Zuid-Holland	593321
7	Haag	NLD	Zuid-Holland	440900
8	Utrecht	NLD	Utrecht	234323
9	Eindhoven	NLD	Noord-Brabant	201843
10	Tilburg	NLD	Noord-Brabant	193238
...

The csvfilter operates on tab (or otherwise) delimited input streams. The first line of the file is taken to be the column header; the header names can be used within the filter expression as variables. For each of the following data rows the expression is evaluated with variables set to the corresponding data fields.

An example run of the csvfilter could be:

This will output the following debug information on stderr:
Expression string: Population > 1000000 && CountryCode = "USA"
Parsed expression: ((Population > 1000000) && (CountryCode = "USA"))
Reading CSV column headers from input
Read 5 column headers.
Processed 4079 lines, copied 9 and skipped 4070 lines

and the filtered CSV lines on stdout:

ID      Name    CountryCode     District        Population
3793    New York        USA     New York        8008278
3794    Los Angeles     USA     California      3694820
3795    Chicago USA     Illinois        2896016
3796    Houston USA     Texas   1953631
3797    Philadelphia    USA     Pennsylvania    1517550
3798    Phoenix USA     Arizona 1321045
3799    San Diego       USA     California      1223400
3800    Dallas  USA     Texas   1188580
3801    San Antonio     USA     Texas   1144646

Detailed Example Code Guide

#include "ExpressionParser.h"

#include <iostream>
#include <string>
#include <vector>
#include <map>

First we include the file ExpressionParser.h, which contains the required classes, and four Standard Template Library files.

// use this as the delimiter. this can be changed to ';' or ',' if needed
const char delimiter = '\t';

The example program uses this constant as the delimiter character. It may be changed if required.

// read one line from instream and split it into tab (or otherwise) delimited
// columns. returns the number of columns read, 0 if eof.
unsigned int read_csvline(std::istream &instream,
                          std::vector<std::string> &columns)
{
    columns.clear();

    // read one line from the input stream
    std::string line;
    if (!std::getline(instream, line, '\n').good()) {
        return 0;
    }

    // parse line into tab separated columns, start with inital column
    columns.push_back("");

    for (std::string::const_iterator si = line.begin();
         si != line.end(); ++si)
    {
        if (*si == delimiter)
            columns.push_back("");
        else // add non-delimiter to last column
            columns.back() += *si;
    }

    return columns.size();
}

As state above this example program is not supposed to be an "industrial-strength" CSV file reader. The above function reads one line for the input file stream and splits it into a std::vector of field strings. No escaped delimiters or quoted fields are recognized.

// subclass stx::BasicSymbolTable and return variable values from the current
// csv row. the variable names are defined by the map containing the column
// header.
class CSVRowSymbolTable : public stx::BasicSymbolTable
{
public:
    // maps the column variable name to the vector index
    const std::map<std::string, unsigned int> &headersmap;

    // refernce to the reused data row vector.
    const std::vector<std::string> &datacolumns;

    CSVRowSymbolTable(const std::map<std::string, unsigned int> &_headersmap,
                      const std::vector<std::string> &_datacolumns)
        : stx::BasicSymbolTable(),
          headersmap(_headersmap),
          datacolumns(_datacolumns)
    {
    }

    virtual stx::AnyScalar lookupVariable(const std::string &varname) const
    {
        // look if the variable name is defined by the CSV file
        std::map<std::string, unsigned int>::const_iterator
            varfind = headersmap.find(varname);

        if (varfind == headersmap.end()) {
            // if not, let BasicSymbolTable check if it knows it
            return stx::BasicSymbolTable::lookupVariable(varname);
        }

        // return the variable value from the current vector. convert it to a
        // stx::AnyScalar but use the automatic type recognition for input
        // strings.
        if(varfind->second < datacolumns.size())
        {
            return stx::AnyScalar().setAutoString( datacolumns[varfind->second] );
        }
        else
        {
            return "";  // happens when a data row has too few delimited
                        // fields.
        }
    }
};

This is the main meat of the example: the CSVRowSymbolTable is subclassed from stx::BasicSymbolTable. This symbol table is used to represent the possible variables in the expression formula. It is constructed from the header row of the CSV file and a vector containing the current data row. However the symbol table object only contains references to the actual vector and map. This way the main program will be able to modify the data row and re-evaluate the parse tree using the new row.

The idea behind this set up is to minimize the number of times the input CSV data fields are copied: the vector containing the string fields is filled once and then only referenced via the symbol table object if the variable is actually used.

Another way to implement the symbol table would be to used an stx::BasicSymbolTable and set the variables using setVariable(). However this would be much slower than the reference symbol table approach shown above.

int main(int argc, char *argv[])
{
    // collect expression by joining all remaining input arguments
    std::string args;
    for(int i = 1; i < argc; i++) {
        if (!args.empty()) args += " ";
        args += argv[i];
    }

    std::cerr << "Expression string: " << args << "\n";

    // parse expression into a parse tree
    stx::ParseTree pt;
    try
    {
        pt = stx::parseExpression(args);
        std::cerr << "Parsed expression: " << pt.toString() << "\n";
    }
    catch (stx::ExpressionParserException &e)
    {
        std::cerr << "ExpressionParserException: " << e.what() << "\n";
        return 0;
    }

As shown in the first example application all command line arguments are collected into a single string. Then the parser in put into action on the expression string and produces the stx::ParseTree. This time the parse tree will be used multiple times.

    // read first line of CSV input as column headers
    std::cerr << "Reading CSV column headers from input\n";
    std::vector<std::string> headers;

    if (read_csvline(std::cin, headers) == 0) {
        std::cerr << "Error read column headers: no input\n";
        return 0;
    }

    std::cerr << "Read " << headers.size() << " column headers.\n";

    // create a header column lookup map for CSVRowSymbolTable and output the
    // column header line to std::cout
    std::map<std::string, unsigned int> headersmap;
    for(unsigned int headnum = 0; headnum < headers.size(); ++headnum)
    {
        headersmap[ headers[headnum] ] = headnum;

        if (headnum != 0) std::cout << delimiter;
        std::cout << headers[headnum];
    }
    std::cout << "\n";

The csvfilter reads the CSV file from the stdin stream. The first line is read and saved. It is considered to contain the column headers. This header row is inserted into the header map for faster lookup by the symbol table. The header row is also outputted to std::cout.

The following loop then iterates over the data rows read from the CSV input. The symbol table is constructed only once and for each row the referenced vector "datacolumns" is refilled. Using this symbol table the parse tree is re-evaluated for each data row.

The evaluation result is checked for a boolean type. In this case a filter expression was given and the row is either copied to std::cout or skipped depending on the filter's result.

If the expression returned a non-boolean type, it is taken to be some calculation and the result is printed on std::cerr.

    // iterate over the data lines of the CSV input
    unsigned int linesprocessed = 0, linesskipped = 0;
    std::vector<std::string> datacolumns;
    CSVRowSymbolTable csvsymboltable(headersmap, datacolumns);

    while( read_csvline(std::cin, datacolumns) > 0 )
    {
        // evaluate the expression for each row using the headers/datacolumns
        // as variables
        try
        {
            linesprocessed++;
            stx::AnyScalar val = pt.evaluate( csvsymboltable );

            if (val.isBooleanType())
            {
                if (!val.getBoolean()) {
                    linesskipped++;
                    continue;
                }
            }
            else {
                std::cerr << "evaluated: " << val << "\n";
            }

            // output this data row to std::cout
            for(std::vector<std::string>::const_iterator
                    coliter = datacolumns.begin();
                coliter != datacolumns.end(); ++coliter)
            {
                if (coliter != datacolumns.begin()) std::cout << delimiter;
                std::cout << *coliter;
            }
            std::cout << "\n";
        }
        catch (stx::UnknownSymbolException &e)
        {
            std::cerr << "evaluated: UnknownSymbolException: " << e.what() << "\n";
        }
        catch (stx::ExpressionParserException &e)
        {
            std::cerr << "evaluated: ExpressionParserException: " << e.what() << "\n";
        }
    }
    std::cerr << "Processed " << linesprocessed << " lines, "
              << "copied " << (linesprocessed - linesskipped) << " and "
              << "skipped " << linesskipped << " lines" << "\n";
}

Complete Example Source Code

The example can be found in the distribution in examples/csvfilter/. That directory also contains three CSV files to test the csvfilter: mysql-world-city.csv, mysql-world-country.csv and cia-world-factbook.csv

// $Id: csvfilter.cc 59 2007-07-17 14:43:23Z tb $

/*
 * STX Expression Parser C++ Framework v0.7
 * Copyright (C) 2007 Timo Bingmann
 *
 * This library is free software; you can redistribute it and/or modify it
 * under the terms of the GNU Lesser General Public License as published by the
 * Free Software Foundation; either version 2.1 of the License, or (at your
 * option) any later version.
 *
 * This library is distributed in the hope that it will be useful, but WITHOUT
 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
 * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
 * for more details.
 *
 * You should have received a copy of the GNU Lesser General Public License
 * along with this library; if not, write to the Free Software Foundation,
 * Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
 */

// CSV Parser and Filter using the Expression Parser
 
#include "ExpressionParser.h"

#include <iostream>
#include <string>
#include <vector>
#include <map>

// use this as the delimiter. this can be changed to ';' or ',' if needed
const char delimiter = '\t';

// read one line from instream and split it into tab (or otherwise) delimited
// columns. returns the number of columns read, 0 if eof.
unsigned int read_csvline(std::istream &instream,
                          std::vector<std::string> &columns)
{
    columns.clear();

    // read one line from the input stream
    std::string line;
    if (!std::getline(instream, line, '\n').good()) {
        return 0;
    }

    // parse line into tab separated columns, start with inital column
    columns.push_back("");

    for (std::string::const_iterator si = line.begin();
         si != line.end(); ++si)
    {
        if (*si == delimiter)
            columns.push_back("");
        else // add non-delimiter to last column
            columns.back() += *si;
    }

    return columns.size();
}

// subclass stx::BasicSymbolTable and return variable values from the current
// csv row. the variable names are defined by the map containing the column
// header.
class CSVRowSymbolTable : public stx::BasicSymbolTable
{
public:
    // maps the column variable name to the vector index
    const std::map<std::string, unsigned int> &headersmap;

    // refernce to the reused data row vector.
    const std::vector<std::string> &datacolumns;

    CSVRowSymbolTable(const std::map<std::string, unsigned int> &_headersmap,
                      const std::vector<std::string> &_datacolumns)
        : stx::BasicSymbolTable(),
          headersmap(_headersmap),
          datacolumns(_datacolumns)
    {
    }

    virtual stx::AnyScalar lookupVariable(const std::string &varname) const
    {
        // look if the variable name is defined by the CSV file
        std::map<std::string, unsigned int>::const_iterator
            varfind = headersmap.find(varname);

        if (varfind == headersmap.end()) {
            // if not, let BasicSymbolTable check if it knows it
            return stx::BasicSymbolTable::lookupVariable(varname);
        }

        // return the variable value from the current vector. convert it to a
        // stx::AnyScalar but use the automatic type recognition for input
        // strings.
        if(varfind->second < datacolumns.size())
        {
            return stx::AnyScalar().setAutoString( datacolumns[varfind->second] );
        }
        else
        {
            return "";  // happens when a data row has too few delimited
                        // fields.
        }
    }
};

int main(int argc, char *argv[])
{
    // collect expression by joining all remaining input arguments
    std::string args;
    for(int i = 1; i < argc; i++) {
        if (!args.empty()) args += " ";
        args += argv[i];
    }

    std::cerr << "Expression string: " << args << "\n";

    // parse expression into a parse tree
    stx::ParseTree pt;
    try
    {
        pt = stx::parseExpression(args);
        std::cerr << "Parsed expression: " << pt.toString() << "\n";
    }
    catch (stx::ExpressionParserException &e)
    {
        std::cerr << "ExpressionParserException: " << e.what() << "\n";
        return 0;
    }

    // read first line of CSV input as column headers
    std::cerr << "Reading CSV column headers from input\n";
    std::vector<std::string> headers;

    if (read_csvline(std::cin, headers) == 0) {
        std::cerr << "Error read column headers: no input\n";
        return 0;
    }

    std::cerr << "Read " << headers.size() << " column headers.\n";

    // create a header column lookup map for CSVRowSymbolTable and output the
    // column header line to std::cout
    std::map<std::string, unsigned int> headersmap;
    for(unsigned int headnum = 0; headnum < headers.size(); ++headnum)
    {
        headersmap[ headers[headnum] ] = headnum;

        if (headnum != 0) std::cout << delimiter;
        std::cout << headers[headnum];
    }
    std::cout << "\n";

    // iterate over the data lines of the CSV input
    unsigned int linesprocessed = 0, linesskipped = 0;
    std::vector<std::string> datacolumns;
    CSVRowSymbolTable csvsymboltable(headersmap, datacolumns);

    while( read_csvline(std::cin, datacolumns) > 0 )
    {
        // evaluate the expression for each row using the headers/datacolumns
        // as variables
        try
        {
            linesprocessed++;
            stx::AnyScalar val = pt.evaluate( csvsymboltable );

            if (val.isBooleanType())
            {
                if (!val.getBoolean()) {
                    linesskipped++;
                    continue;
                }
            }
            else {
                std::cerr << "evaluated: " << val << "\n";
            }

            // output this data row to std::cout
            for(std::vector<std::string>::const_iterator
                    coliter = datacolumns.begin();
                coliter != datacolumns.end(); ++coliter)
            {
                if (coliter != datacolumns.begin()) std::cout << delimiter;
                std::cout << *coliter;
            }
            std::cout << "\n";
        }
        catch (stx::UnknownSymbolException &e)
        {
            std::cerr << "evaluated: UnknownSymbolException: " << e.what() << "\n";
        }
        catch (stx::ExpressionParserException &e)
        {
            std::cerr << "evaluated: ExpressionParserException: " << e.what() << "\n";
        }
    }
    std::cerr << "Processed " << linesprocessed << " lines, "
              << "copied " << (linesprocessed - linesskipped) << " and "
              << "skipped " << linesskipped << " lines" << "\n";
}

Generated on Tue Jul 17 16:51:58 2007 for STX Expression Parser by  doxygen 1.5.2