Rand Stats

ML::AssociationRuleLearning

zef:antononcube

Raku ML::AssociationRuleLearning

SparkyCI

This repository has the code of a Raku package for Association Rule Learning (ARL) functions, [Wk1].

The ARL framework includes the algorithms Apriori and Eclat, and the measures confidence, lift, and conviction, (and others.)

For computational introduction to ARL utilization (in Mathematica) see the article "Movie genre associations", [AA1].

The examples below use the packages "Data::Generators", "Data::Reshapers", and "Data::Summarizers", described in the article "Introduction to data wrangling with Raku", [AA2].


Installation

Via zef-ecosystem:

zef install ML::AssociationRuleLearning

From GitHub:

zef install https://github.com/antononcube/Raku-ML-AssociationRuleLearning

Frequent sets finding

Here we get the Titanic dataset (from "Data::Reshapers") and summarize it:

use Data::Reshapers;
use Data::Summarizers;
my @dsTitanic = get-titanic-dataset();
records-summary(@dsTitanic);
# +-----------------+-------------------+----------------+---------------+----------------+
# | id              | passengerSurvival | passengerClass | passengerSex  | passengerAge   |
# +-----------------+-------------------+----------------+---------------+----------------+
# | 972     => 1    | died     => 809   | 3rd => 709     | male   => 843 | 20      => 334 |
# | 546     => 1    | survived => 500   | 1st => 323     | female => 466 | -1      => 263 |
# | 896     => 1    |                   | 2nd => 277     |               | 30      => 258 |
# | 512     => 1    |                   |                |               | 40      => 190 |
# | 802     => 1    |                   |                |               | 50      => 88  |
# | 47      => 1    |                   |                |               | 60      => 57  |
# | 1227    => 1    |                   |                |               | 0       => 56  |
# | (Other) => 1302 |                   |                |               | (Other) => 63  |
# +-----------------+-------------------+----------------+---------------+----------------+

Problem: Find all combinations of values of the variables "passengerAge", "passengerClass", "passengerSex", and "passengerSurvival" that appear more than 200 times in the Titanic dataset.

Here is how we use the function frequent-sets to give an answer:

use ML::AssociationRuleLearning;
my @freqSets = frequent-sets(@dsTitanic, min-support => 200, min-number-of-items => 2, max-number-of-items => Inf):counts;
@freqSets.elems
# 11

The function frequent-sets returns the frequent sets together with their support.

Here we tabulate the result:

say to-pretty-table(@freqSets.map({ %( Frequent-set => $_.key.join(' '), Count => $_.value) }), align => 'l');
# +-------+-------------------------------------------------------------+
# | Count | Frequent-set                                                |
# +-------+-------------------------------------------------------------+
# | 208   | passengerAge:-1 passengerClass:3rd                          |
# | 206   | passengerAge:20 passengerClass:3rd                          |
# | 207   | passengerAge:20 passengerSex:male                           |
# | 207   | passengerAge:20 passengerSurvival:died                      |
# | 200   | passengerClass:1st passengerSurvival:survived               |
# | 216   | passengerClass:3rd passengerSex:female                      |
# | 493   | passengerClass:3rd passengerSex:male                        |
# | 418   | passengerClass:3rd passengerSex:male passengerSurvival:died |
# | 528   | passengerClass:3rd passengerSurvival:died                   |
# | 339   | passengerSex:female passengerSurvival:survived              |
# | 682   | passengerSex:male passengerSurvival:died                    |
# +-------+-------------------------------------------------------------+

We can verify the result by looking into these group counts, [AA2]:

my $obj = group-by( @dsTitanic, <passengerClass passengerSex>);
.say for $obj>>.elems.grep({ $_.value >= 200 });
$obj = group-by( @dsTitanic, <passengerClass passengerSurvival passengerSex>);
.say for $obj>>.elems.grep({ $_.value >= 200 });
# 3rd.female => 216
# 3rd.male => 493
# 3rd.died.male => 418

Or these contingency tables:

my $obj = group-by( @dsTitanic, "passengerClass") ;
$obj = $obj.map({ $_.key => cross-tabulate( $_.value, "passengerSex", "passengerSurvival" ) });
.say for $obj.Array;
# 3rd => {female => {died => 110, survived => 106}, male => {died => 418, survived => 75}}
# 1st => {female => {died => 5, survived => 139}, male => {died => 118, survived => 61}}
# 2nd => {female => {died => 12, survived => 94}, male => {died => 146, survived => 25}}

Remark: For datasets -- i.e. arrays of hashes -- frequent-sets preprocesses the data by concatenating column names with corresponding column values. This is done in order to prevent "collisions" of same values coming from different columns. If that concatenation is not desired then manual preprocessing like this can be used:

@dsTitanic.map({ $_.values.List }).Array

Remark: frequent-sets's argument min-support can take both integers greater than 1 and frequencies between 0 and 1. (If an integer greater than one is given, then the corresponding frequency is derived.)

Remark: By default frequent-sets uses the Eclat algorithm. The functions apriori and eclat call frequent-sets with the option settings method=>'Apriori' and method=>'Eclat' respectively.


Association rules finding

Here we find association rules with min support 0.3 and min confidence 0.7:

association-rules(@dsTitanic, min-support => 0.3, min-confidence => 0.7)
==> to-pretty-table
# +------------------------+-------+----------+----------+------------+------------+----------+-------------------------------------------+
# |       consequent       | count | leverage |   lift   | conviction | confidence | support  |                 antecedent                |
# +------------------------+-------+----------+----------+------------+------------+----------+-------------------------------------------+
# | passengerSurvival:died |  528  | 0.068615 | 1.204977 |  1.496229  |  0.744711  | 0.403361 |             passengerClass:3rd            |
# | passengerSurvival:died |  682  | 0.122996 | 1.309025 |  2.000009  |  0.809015  | 0.521008 |             passengerSex:male             |
# |   passengerSex:male    |  682  | 0.122996 | 1.309025 |  2.267729  |  0.843016  | 0.521008 |           passengerSurvival:died          |
# | passengerSurvival:died |  418  | 0.086564 | 1.371894 |  2.510823  |  0.847870  | 0.319328 |    passengerClass:3rd passengerSex:male   |
# |   passengerSex:male    |  418  | 0.059562 | 1.229290 |  1.708785  |  0.791667  | 0.319328 | passengerClass:3rd passengerSurvival:died |
# +------------------------+-------+----------+----------+------------+------------+----------+-------------------------------------------+

Reusing found frequent sets

The function frequent-sets takes the adverb ":object" that makes frequent-sets return an object of type ML::AssociationRuleLearning::Apriori or ML::AssociationRuleLearning::Eclat, which can be "pipelined" to find association rules.

Here we find frequent sets, return the corresponding object, and retrieve the result:

my $eclatObj = frequent-sets(@dsTitanic.map({ $_.values.List }).Array, min-support => 0.12, min-number-of-items => 2, max-number-of-items => 6):object;
$eclatObj.result.elems
# 23

Here we find association rules and pretty-print them:

$eclatObj.find-rules(min-confidence=>0.7)
==> to-pretty-table 
# +------------+------------+----------+----------+------------+-------+----------+------------+
# | antecedent | consequent | leverage | support  | conviction | count |   lift   | confidence |
# +------------+------------+----------+----------+------------+-------+----------+------------+
# |     -1     |    male    | 0.011938 | 0.141329 |  1.200349  |  185  | 1.092265 |  0.703422  |
# |    died    |    male    | 0.122996 | 0.521008 |  2.267729  |  682  | 1.309025 |  0.843016  |
# |    male    |    died    | 0.122996 | 0.521008 |  2.000009  |  682  | 1.309025 |  0.809015  |
# |  20 died   |    male    | 0.032122 | 0.134454 |  2.313980  |  176  | 1.313897 |  0.846154  |
# |  20 male   |    died    | 0.036249 | 0.134454 |  2.482811  |  176  | 1.369117 |  0.846154  |
# |  -1 died   |    male    | 0.027990 | 0.121467 |  2.181917  |  159  | 1.299438 |  0.836842  |
# |  -1 male   |    died    | 0.034121 | 0.121467 |  2.717870  |  159  | 1.390646 |  0.859459  |
# |  3rd died  |    male    | 0.059562 | 0.319328 |  1.708785  |  418  | 1.229290 |  0.791667  |
# |  3rd male  |    died    | 0.086564 | 0.319328 |  2.510823  |  418  | 1.371894 |  0.847870  |
# |   female   |  survived  | 0.122996 | 0.258976 |  2.267729  |  339  | 1.904511 |  0.727468  |
# |     -1     |    3rd     | 0.050076 | 0.158900 |  2.191819  |  208  | 1.460162 |  0.790875  |
# |     -1     |    died    | 0.020977 | 0.145149 |  1.376142  |  190  | 1.168931 |  0.722433  |
# |    3rd     |    died    | 0.068615 | 0.403361 |  1.496229  |  528  | 1.204977 |  0.744711  |
# |   -1 3rd   |    died    | 0.022498 | 0.120703 |  1.588999  |  158  | 1.229093 |  0.759615  |
# |  -1 died   |    3rd     | 0.042085 | 0.120703 |  2.721543  |  158  | 1.535313 |  0.831579  |
# +------------+------------+----------+----------+------------+-------+----------+------------+

Remark: Note that because of the specified min confidence, the number of association rules is "contained" -- a (much) larger number of rules would be produced with, say, min-confidence=>0.2.


Implementation considerations

UML diagram

Here is a UML diagram that shows package's structure:

The PlantUML spec and diagram were obtained with the CLI script to-uml-spec of the package "UML::Translators", [AAp6].

Here we get the PlantUML spec:

to-uml-spec ML::AssociationRuleLearning > ./resources/class-diagram.puml
# 

Here get the diagram:

to-uml-spec ML::AssociationRuleLearning | java -jar ~/PlantUML/plantuml-1.2022.5.jar -pipe > ./resources/class-diagram.png
# 

Remark: Maybe it is a good idea to have an abstract class named, say, ML::AssociationRuleLearning::AbstractFinder that is a parent of both ML::AssociationRuleLearning::Apriori and ML::AssociationRuleLearning::Eclat, but I have not found to be necessary. (At this point of development.)

Eclat

We can say that Eclat uses a "vertical database representation" of the transactions.

Eclat is based on Raku's sets, bags, and mixes functionalities.

Eclat represents the transactions as a hash of sets:

(In other words, for each item an inverse index is made.)

This representation allows for quick calculations of item combinations support.

Apriori

Apriori uses the standard, horizontal database transactions representation.

We can say that Apriori:

Apriori is usually (much) slower than Eclat. Historically, Apriori is the first ARL method, and its implementation in the package is didactic.

Association rules

We can say that the association rule finding function is a general one, but that function does require fast computation of confidence, lift, etc. Hence Eclat's transactions representation is used.

Association rules finding with Apriori is the same as with Eclat. The package function assocition-rules with the option setting method=>'Apriori' simply sends frequent sets found with Apriori to the Eclat based association rule finding.


References

Articles

[Wk1] Wikipedia entry, "Association Rule Learning".

[AA1] Anton Antonov, "Movie genre associations", (2013), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, "Introduction to data wrangling with Raku", (2021), RakuForPrediction at WordPress.

Packages

[AAp1] Anton Antonov, Implementation of the Apriori algorithm in Mathematica, (2014-2016), MathematicaForPrediction at GitHub/antononcube.

[AAp1a] Anton Antonov Implementation of the Apriori algorithm via Tries in Mathematica, (2022), MathematicaForPrediction at GitHub/antononcube.

[AAp2] Anton Antonov, Implementation of the Eclat algorithm in Mathematica, (2022), MathematicaForPrediction at GitHub/antononcube.

[AAp3] Anton Antonov, Data::Generators Raku package, (2021), GitHub/antononcube.

[AAp4] Anton Antonov, Data::Reshapers Raku package, (2021), GitHub/antononcube.

[AAp5] Anton Antonov, Data::Summarizers Raku package, (2021), GitHub/antononcube.

[AAp6] Anton Antonov, UML::Translators Raku package, (2022), GitHub/antononcube.

[AAp7] Anton Antonov, ML::TrieWithFrequencies Raku package, (2021), GitHub/antononcube.