DataFrame

The class DataFrame is an in-memory data storage providing essential operations for data processing, filtering and analysis. A DataFrame is composed of a set of columns with each column having a name and a type. The type must be a plain data type, i.e. no reference or cv-qualified type. Custom data types however are supported if they are default, move and copy constructible.

using Column = dacr::Column<"name", type>;

using DataFrame = dacr::DataFrame<
    Column1,
    Column2,
    ...
    ColumnN
>;

The individual elements of a column are stored consecutively in memory using a std::vector as underlying storage container. All columns are guaranteed to have the same size with all column values of some index forming a row in the DataFrame.

Given the data layout, the DataFrame represents an in-memory column-store database offering the following API known from database systems:

  • select columns by name
  • create a new column with values based on the existing column values
  • filter rows by user-provided conditions
  • join two dataframes by common columns
  • sort rows by multiple columns
  • aggregate colummns with group-by semantics
  • printing the contents in a tabular representation

The subsequent sections explain the API of the DataFrame in detail.

NOTE: Due to the early development state of the library, the DataFrame is not yet memory efficient in the sense that data copy operations are minimized. Instead all operations of the DataFrame perform copies of the data. Future versions will optimize the data access to reduce data copies.

Construction

A default constructor is provided with any special functionality.

NOTE: More advanced constructors, e.g. providing initial column values, will be provided in future releases.

using DataFrame = dacr::DataFrame<dacr::Column<"a", int>>;

DataFrame df {};

Insertion

Column data is inserted into the DataFrame by either using insert or insert_ranges.

Row-Wise Insertion

template <typename ...T>
void insert (T&& ...values);

The function insert adds a single new row to the DataFrame accepting one compatible value for each column in column-definition-order.

using DataFrameInsert = dacr::DataFrame<
    dacr::Column<"a", int>,
    dacr::Column<"b", double>,
    dacr::Column<"c", std::string>
>;

DataFrameInsert df{};

df.insert(10, 20.0, "abc");

Column-Wise Insertion

template <RangeWithSize ...Ranges>
std::size_t insert_ranges (Ranges&& ...ranges);

The function insert_ranges adds multiple values for each column at once. It accepts ranges for each column with one range for each column in column-definition order. Due to the invariant of the DataFrame that all columns must have an equal size, this function inserts only as many elements as are provided by the smallest range. The number of elements inserted is returned.

std::vector<int> range_a {1, 2, 3};
std::set<double> range_b {3.0, 4.0, 5.0};
std::array<std::string, 3> range_c {"a", "b"};

df.insert_ranges(range_a, range_b, range_c); // only two elements are inserted from each range

Column Filter

template <FixedString ...SelectNames>
NewDataFrame select ();

The function select filters out complete columns from the DataFrame returning a new DataFrame with only the selected columns as data. This operation currently performs a deep copy of the selected column data.

using DataFrameSelect = dacr::DataFrame<
    dacr::Column<"a", int>,
    dacr::Column<"b", double>
>;

DataFrameSelect df{};

auto df_select = df.select<"a">();
// decltype(df_select) == DataFrame<Column<"a", int>>

Row Filter

template <typename SelectNames = SelectAll, typename Func>
DataFrame query (Func&& query_function);

The function query filters rows by a user-defined lambda function. The argument passed to the lambda function is a NamedTuple with the field names and types corresponding to the column names and types of the DataFrame. The lambda function is expected to return a bool value indicating if a row shall be kept (true) or filtered out (false).

using DataFrameQuery = dacr::DataFrame<
    dacr::Column<"a", int>,
    dacr::Column<"b", double>
>;

DataFrameQuery df{};

auto df_query = df.query([](dacr_param) {
    return dacr_value("a") > dacr_value("b");
});

Query Lambda Function

In plain C++, the lambda function with access to the NamedTuple would look like this:

[](const auto& data) {
    return data.template get<"bool_column">();
}

The DataFrame API however provides two macros as syntactic sugar

[](dacr_param) {
    return dacr_value("bool_column");
}

which correspond to the plain C++ version above.

Column Selection for Query

If the query function should be executed on a reduced list of columns only, it is possible for slight performance improvements to use an additonal dacr::Select. In this case, the NamedTuple passed to the lambda function only contains the fields from the dacr::Select list.

auto df_query = df.query<dacr::Select<"a">>([](dacr_param) {
    return dacr_value("a") > 10; // access to dacr_value("b") is invalid
});

Column Extension

template <FixedString NewColumnName, typename SelectNames = SelectAll, typename Func>
NewDataFrame apply (Func&& apply_function);

The function apply creates a new column with name NewColumnName by invoking the passed lambda function for each row. The type for the new column is deduced from the return value of the passed function. The function returns a new DataFrame with an added column dacr::Column<NewColumnname, DeducedColumnType>. The data from the already existing columns is copied to the new instance.

NOTE: Future versions of this API will provide move-semantics for performance improvement.

using DataFrameApply = dacr::DataFrame<
    dacr::Column<"a", int>,
    dacr::Column<"b", double>
>;

DataFrameApply df{};

// new column type is: double
auto df_apply = df.apply<"c">([](dacr_param) {
    return dacr_value("a") * dacr_value("b");
});

Column Selection for Apply

If the apply function should be executed on a reduced list of columns only, it is possible for slight performance improvements to use an additonal dacr::Select. In this case, the NamedTuple passed to the lambda function only contains the fields from the dacr::Select list.

auto df_apply = df.apply<dacr::Select<"a">>([](dacr_param) {
    return dacr_value("a") > 10; // access to dacr_value("b") is invalid
});

Join Operation

template <Join JoinType, FixedString ...JoinNames, typename OtherDataFrame>
NewDataFrame join (const OtherDataFrame& otherDataFrame);

The function join merges two DataFrames together by a set of common columns using the to be specified JoinType. The common columns must be identical in name and type. The result is a new DataFrame consisting of the common columns and the union of the remaining columns of both DataFrames. The union of the remaining columns must be unique.

This API currently supports the following JoinTypes:

JoinType Description
Inner An inner join taking only rows where the common columns match by equality. Other columns get dropped.
using DataFrameJoin1 = dacr::DataFrame<
    dacr::Column<"id1", int>,
    dacr::Column<"id2", char>,
    dacr::Column<"value_left", double>
>;

using DataFrameJoin2 = dacr::DataFrame<
    dacr::Column<"id1", int>,
    dacr::Column<"id2", char>,
    dacr::Column<"value_right", std::string>
>;

DataFrameJoin1 df1{};
DataFrameJoin2 df2{};

auto df_joined = df1.join<dacr::Join::Inner, "id1", "id2">(df2);
// decltype(df_joined) == dacr::DataFrame<
//     dacr::Column<"id1", int>,
//     dacr::Column<"id2", char>,
//     dacr::Column<"value_left", double>,
//     dacr::Column<"value_right", std::string>
// >

Aggregation

template <typename GroupBySpec, typename ...Operations>
NewDataFrame summarize ();

The function summarize performs an aggregation of columns by applying a set pre-defined operations with an optional group-by semantic.

The GroupBySpec is either dacr::GroupByNone or dacr::GroupBy<ColumnNames>. If dacr::GroupByNone is specified, the aggregation is performed for all values of a column. If dacr::GroupBy is used, the aggregation is performed for each distinct set of column values as identifier by ColumnNames.

The Operation has the general syntax:

using Operation = dacr::OpName<ColumnName, AggregationColumName>;

The aggregated value for column ColumnName is stored in column AggregationColumnName.

The provided operations (OpNames) are:

OpName Applicable To Result Type Description
Sum Types with operator+ Column Type Compute the summation of all column values.
Min Types with operator< Column Type Determine the minimal column value.
Max Types with operator> Column Type Determine the maximal column value.
Avg Arithmetic Types double Compute the average of all column values.
StdDev Arithmetic Types double Compute the standard deviation of all column values. This operation currently stores all column values (higher memory usage) to compute firstly the average and then, secondly, the standard deviation.
CountIf Boolean Types int Count the true elements of a boolean column.
CountIfNot Boolean Types int Count the false elements of a boolean column.
using DataFrameSummarize = dacr::DataFrame<
    dacr::Column<"country", std::string>,
    dacr::Column<"continent", std::string>,
    dacr::Column<"age", int>,
    dacr::Column<"weight", int>,
    dacr::Column<"female", bool>
>;

DataFrameSummarize df{};
auto df_summarize = df.summarize<
    dacr::GroupBy<"country", "continent">,
    dacr::Avg<"age", "age_avg">,
    dacr::Max<"weight", "weight_max">,
    dacr::CountIf<"female", "number_women">
>;
// decltype(df_summarize) == dacr::DataFrame<
//     dacr::Column<"country", std::string>,
//     dacr::Column<"continent", std::string>,
//     dacr::Column<"age_avg", double>,
//     dacr::Column<"weight_max", int>,
//     dacr::Column<"number_women", int> 
// >;

Sorting

template <SortOrder Order, FixedString ...SortByNames>
DataFrame sort ();

The function sortBy sorts the DataFrame row-wise by multiple columns. The comparison between rows is performed by operator<. The precendence of comparisons between the columns is defined by the sequence of SortByNames, meaning:

  • check if rows are sorted by SortByName1
  • check if rows are sorted by SortByName2
  • check if rows are sorted by SortBynameN

The SortOrder is either Ascending or Descending.

using DataFrameSort = dacr::DataFrame<
    dacr::Column<"a", int>,
    dacr::Column<"b", double>,
    dacr::Column<"c", std::string>
>;

DataFrameSort df{};
auto df_sorted = df.sort<dacr::SortOrder::Ascending, "a", "b">();

Appending

DataFrame append (const DataFrame& other);

The function append adds the rows of a DataFrame of the same type.

using DataFrameAppend = dacr::DataFrame<
    dacr::Column<"a", int>,
    dacr::Column<"b", double>
>;

DataFrameAppend df1{}, df2{};
auto df_append = df1.append(df2);

Printing

void print (const dacr::PrintOptions& print_options, std::ostream& stream = std::cout);

The function print dumps the content of the DataFrame to a stream. It requires the operator<< to be implemented for custom types. It accepts two parameters as input: a set of PrintOptions to customize the output and the stream to dump to. The PrintOptions contain the following configuration options:

Option Default Description
int64_width 10 The maximum width for large integer types (at least 64-bit).
fixedpoint_precision 2 The precision of floating point types.
fixedpoint_width 10 The width of floating point types.
custom_width 10 The maximum width of custom types.
string_width 10 The maximum width of string-based types.
max_rows all The maximum number of rows to display.
using DataFramePrint = dacr::DataFrame<
    dacr::Column<"a", int>,
    dacr::Column<"b", std::string>
>;
data
DataFramePrint df{};
df.print({
    .string_width = 20,
});

Important: the print options must be specified in the order of the table if multiple parameters are set at once. This is due to the language rules for aggregate initialization of structs.

An additional dacr::Select may be specified to print selected columns only.

df.print<dacr::Select<"a">>();

Raw Data Access

Size of DataFrame

std::size_t getSize () const;

The function getSize returns the number of rows in the DataFrame.

Column Access

template <FixedString ColumnName>
std::vector<Type>& getColumn ();

The function getColumn returns an instance-qualified std::vector reference to the column data.

NOTE: This will later be changed into a DataSeries type in later versions once the DataSeries type is introduced.

Function Parameters

The DataFrame class may be passed into functions via template functions:

template <typename ... Columns>
void function (const DataFrame<Columns...>& df) {
}

The disadvantage of this approach is that the contract between caller and callee is implicit by how the function accesses the DataFrame. In future, a mode shall be supported that allows an easier way to express the contract via function arguments, e.g. by:

```cpp // NOT YET SUPPORTED using DataFrameInput = DataFrame< Column<”a”, int>, Column<”b”, float>, Column<”c”, double>

;

void callee (const DataFrame<Column<”a”, int»& df) { }

void caller () { callee(DataFrameInput{}); }