Raku Land

Data::Reshapers

zef:antononcube

Raku Data::Reshapers

SparkyCI Build Status

This Raku package has data reshaping functions for different data structures that are coercible to full arrays.

The supported data structures are:

The five data reshaping provided by the package over those data structures are:

The first four operations are fundamental in data wrangling and data analysis; see [AA1, Wk1, Wk2, AAv1-AAv2].

(Transposing of tabular data is, of course, also fundamental, but it also can be seen as a basic functional programming operation.)


Usage examples

Cross tabulation

Making contingency tables -- or cross tabulation -- is a fundamental statistics and data analysis operation, [Wk1, AA1].

Here is an example using the Titanic dataset (that is provided by this package through the function get-titanic-dataset):

use Data::Reshapers;

my @tbl = get-titanic-dataset();
my $res = cross-tabulate( @tbl, 'passengerSex', 'passengerClass');
say $res;
# {female => {1st => 144, 2nd => 106, 3rd => 216}, male => {1st => 179, 2nd => 171, 3rd => 493}}
to-pretty-table($res);
# +--------+-----+-----+-----+
# |        | 3rd | 1st | 2nd |
# +--------+-----+-----+-----+
# | female | 216 | 144 | 106 |
# | male   | 493 | 179 | 171 |
# +--------+-----+-----+-----+

Long format

Conversion to long format allows column names to be treated as data.

(More precisely, when converting to long format specified column names of a tabular dataset become values in a dedicated column, e.g. "Variable" in the long format.)

my @tbl1 = @tbl.roll(3);
.say for @tbl1;
# {id => 822, passengerAge => 30, passengerClass => 3rd, passengerSex => male, passengerSurvival => died}
# {id => 684, passengerAge => 40, passengerClass => 3rd, passengerSex => male, passengerSurvival => died}
# {id => 1243, passengerAge => -1, passengerClass => 3rd, passengerSex => male, passengerSurvival => died}
.say for to-long-format( @tbl1 );
# {AutomaticKey => 0, Value => died, Variable => passengerSurvival}
# {AutomaticKey => 0, Value => 3rd, Variable => passengerClass}
# {AutomaticKey => 0, Value => male, Variable => passengerSex}
# {AutomaticKey => 0, Value => 30, Variable => passengerAge}
# {AutomaticKey => 0, Value => 822, Variable => id}
# {AutomaticKey => 1, Value => died, Variable => passengerSurvival}
# {AutomaticKey => 1, Value => 3rd, Variable => passengerClass}
# {AutomaticKey => 1, Value => male, Variable => passengerSex}
# {AutomaticKey => 1, Value => 40, Variable => passengerAge}
# {AutomaticKey => 1, Value => 684, Variable => id}
# {AutomaticKey => 2, Value => died, Variable => passengerSurvival}
# {AutomaticKey => 2, Value => 3rd, Variable => passengerClass}
# {AutomaticKey => 2, Value => male, Variable => passengerSex}
# {AutomaticKey => 2, Value => -1, Variable => passengerAge}
# {AutomaticKey => 2, Value => 1243, Variable => id}
my @lfRes1 = to-long-format( @tbl1, 'id', [], variablesTo => "VAR", valuesTo => "VAL2" );
.say for @lfRes1;
# {VAL2 => male, VAR => passengerSex, id => 1243}
# {VAL2 => -1, VAR => passengerAge, id => 1243}
# {VAL2 => died, VAR => passengerSurvival, id => 1243}
# {VAL2 => 3rd, VAR => passengerClass, id => 1243}
# {VAL2 => male, VAR => passengerSex, id => 684}
# {VAL2 => 40, VAR => passengerAge, id => 684}
# {VAL2 => died, VAR => passengerSurvival, id => 684}
# {VAL2 => 3rd, VAR => passengerClass, id => 684}
# {VAL2 => male, VAR => passengerSex, id => 822}
# {VAL2 => 30, VAR => passengerAge, id => 822}
# {VAL2 => died, VAR => passengerSurvival, id => 822}
# {VAL2 => 3rd, VAR => passengerClass, id => 822}

Wide format

Here we transform the long format result @lfRes1 above into wide format -- the result has the same records as the @tbl1:

to-pretty-table( to-wide-format( @lfRes1, 'id', 'VAR', 'VAL2' ) );
# +----------------+-------------------+--------------+------+--------------+
# | passengerClass | passengerSurvival | passengerSex |  id  | passengerAge |
# +----------------+-------------------+--------------+------+--------------+
# |      3rd       |        died       |     male     | 1243 |      -1      |
# |      3rd       |        died       |     male     | 684  |      40      |
# |      3rd       |        died       |     male     | 822  |      30      |
# +----------------+-------------------+--------------+------+--------------+

Transpose

Using cross tabulation result above:

my $tres = transpose( $res );

to-pretty-table($res, title => "Original");
# +--------------------------+
# |         Original         |
# +--------+-----+-----+-----+
# |        | 2nd | 1st | 3rd |
# +--------+-----+-----+-----+
# | female | 106 | 144 | 216 |
# | male   | 171 | 179 | 493 |
# +--------+-----+-----+-----+
to-pretty-table($tres, title => "Transposed");
# +---------------------+
# |      Transposed     |
# +-----+--------+------+
# |     | female | male |
# +-----+--------+------+
# | 1st |  144   | 179  |
# | 2nd |  106   | 171  |
# | 3rd |  216   | 493  |
# +-----+--------+------+

Type system

There is a type "deduction" system in place. The type system conventions follow those of Mathematica's Dataset -- see the presentation "Dataset improvements".

Here we get the Titanic dataset, change the "passengerAge" column values to be numeric, and show dataset's dimensions:

my @dsTitanic = get-titanic-dataset(headers => 'auto');
@dsTitanic = @dsTitanic.map({$_<passengerAge> = $_<passengerAge>.Numeric; $_}).Array;
dimensions(@dsTitanic)
# (1309 5)

Here is a sample of dataset's records:

to-pretty-table(@dsTitanic.pick(5), field-names => <id passengerAge passengerClass passengerSex passengerSurvival>)
# +------+--------------+----------------+--------------+-------------------+
# |  id  | passengerAge | passengerClass | passengerSex | passengerSurvival |
# +------+--------------+----------------+--------------+-------------------+
# | 1305 |      10      |      3rd       |    female    |        died       |
# | 684  |      40      |      3rd       |     male     |        died       |
# | 721  |      20      |      3rd       |     male     |        died       |
# |  40  |      50      |      1st       |     male     |        died       |
# | 399  |      10      |      2nd       |     male     |      survived     |
# +------+--------------+----------------+--------------+-------------------+

Here is the type of a single record:

deduce-type(@dsTitanic[12])
# Struct([id, passengerAge, passengerClass, passengerSex, passengerSurvival], [Str, Int, Str, Str, Str])

Here is the type of single record's values:

deduce-type(@dsTitanic[12].values.List)
# Tuple([Atom((Str)), Atom((Str)), Atom((Str)), Atom((Str)), Atom((Int))])

Here is the type of the whole dataset:

deduce-type(@dsTitanic)
# Vector(Struct([id, passengerAge, passengerClass, passengerSex, passengerSurvival], [Str, Int, Str, Str, Str]), 1309)

TODO

  1. Simpler more convenient interface.

    • Currently, a user have to specify four different namespaces in order to be able to use all package functions.
  2. More extensive long format tests.

  3. More extensive wide format tests.

  4. Implement verifications for

    • Positional-of-hashes

    • Positional-of-arrays

    • Positional-of-key-to-array-pairs

    • Positional-of-hashes, each record of which has:

      • Same keys
      • Same type of values of corresponding keys
    • Positional-of-arrays, each record of which has:

      • Same length
      • Same type of values of corresponding elements
  5. Implement "nice tabular visualization" using Pretty::Table and/or Text::Table::Simple.

  6. Document examples using pretty tables.

  7. Implement transposing operation for:

    • hash of hashes
    • hash of arrays
    • array of hashes
    • array of arrays
    • array of key-to-array pairs
  8. Implement to-pretty-table for:

    • hash of hashes
    • hash of arrays
    • array of hashes
    • array of arrays
    • array of key-to-array pairs
  9. Implemented join-across:

    • inner, left, right, outer
    • single key-to-key pair
    • multiple key-to-key pairs
    • optional fill-in of missing values
    • handling collisions
  10. Implement to long format conversion for:

    • hash of hashes
    • hash of arrays
  11. Speed/performance profiling.

    • Come up with profiling tests
    • Comparison with R
    • Comparison with Python
  12. Type system.

    • Base type (Int, Str, Numeric)
    • Homogenous list detection
    • Association detection
    • Struct discovery
    • Enumeration detection
    • Dataset detection
      • List of hashes
      • Hash of hashes
      • List of lists
  13. "Simple" or fundamental functions

    • flatten
    • take-drop
    • tally
      • Currently in "Data::Summarizers".

References

Articles

[AA1] Anton Antonov, "Contingency tables creation examples", (2016), MathematicaForPrediction at WordPress.

[Wk1] Wikipedia entry, Contingency table.

[Wk2] Wikipedia entry, Wide and narrow data.

Functions, repositories

[AAf1] Anton Antonov, CrossTabulate, (2019), Wolfram Function Repository.

[AAf2] Anton Antonov, LongFormDataset, (2020), Wolfram Function Repository.

[AAf3] Anton Antonov, WideFormDataset, (2021), Wolfram Function Repository.

[AAf4] Anton Antonov, RecordsSummary, (2019), Wolfram Function Repository.

Videos

[AAv1] Anton Antonov, "Multi-language Data-Wrangling Conversational Agent", (2020), YouTube channel of Wolfram Research, Inc.. (Wolfram Technology Conference 2020 presentation.)

[AAv2] Anton Antonov, "Data Transformation Workflows with Anton Antonov, Session #1", (2020), YouTube channel of Wolfram Research, Inc..

[AAv3] Anton Antonov, "Data Transformation Workflows with Anton Antonov, Session #2", (2020), YouTube channel of Wolfram Research, Inc..