Welcome to parquet-writer’s documentation!

parquet-writer is a C++ library that allows for easily writing Parquet files containing columns of (mostly) whatever type and structure you wish.

View me on GitHub

GitHub Actions Status: CI GitHub Actions Status: check-format Documentation Status

Easily declare and write Parquet files

The idea is for parquet-writer to make it simple to both specify the desired layout of a Parquet file (i.e. the number and structure of data columns) and to subsequently write your data to that file.

In summary, parquet-writer provides support for:

  • Specifying the layout of Parquet files using JSON

  • Storing numeric and boolean data types to output Parquet files

  • Storing struct objects (think: C/C++ structs) having any number of arbitrarily-typed fields

  • Storing 1, 2, and 3 dimensional lists of the supported data types

  • A simple interface for writing the supported data types to Parquet files

The Basics

parquet-writer provides users with the parquetwriter::Writer class, which they provide with a JSON object specifying the desired “layout” of their Parquet files and then fill accordingly.

An example JSON layout, stored in the file layout.json, could be:

{
    "fields": [
        {"name": "foo", "type": "float"},
        {"name": "bar", "type": "uint32"},
        {"name": "baz", "type": "list1d", "contains": {"type": "float"}}
    ]
}

The above describes an output Parquet file containing three data columns named foo, bar, and baz which contain data of types float (32-bit precision float), uint32 (32-bit unsigned integer), and list[float] (variable-lengthed 1-dimensional list of elements of type float), respectively.

The basics of initializing a parquetwriter::Writer instance with the above layout, writing some values to a single row, and storing the output is below:

#include "parquet_writer.h"

namespace pw = parquetwriter;
pw::Writer writer;
std::ifstream layout_file("layout.json"); // file containing JSON layout spec
writer.set_layout(layout_file);
writer.set_dataset("my_dataset"); // must give a name to the output
writer.initialize();

// generate some data for each of the columns
float foo_data{42.0};
uint32_t bar_data{42};
std::vector<float> baz_data{42.0, 42.1, 42.2, 42.3};

// call "fill" for each of the columns, giving the associated data
writer.fill("foo", foo_data);
writer.fill("bar", bar_data);
writer.fill("baz", baz_data);

// signal the end of a row
writer.end_row();

// call "finish" when done writing to the file
writer.finish();

The above would generate an output file called my_dataset.parquet. We can use parquet-tools to quickly dump the contents of the Parquet file:

$ parquet-tools show my_dataset.parquet
+------+------+--------------------------+
| foo  | bar  | baz                      |
|------+------+--------------------------|
| 42.0 | 42   | [42.0, 42.1, 42.2, 42.3] |
+------+------+--------------------------+

Indices and tables