Rand Stats

ML::Clustering

zef:antononcube

Raku ML::Clustering

SparkyCI

This repository has the code of a Raku package for Machine Learning (ML) Clustering (or Cluster analysis) functions, [Wk1].

The Clustering framework includes:

The data in the examples below is generated and manipulated with the packages "Data::Generators", "Data::Reshapers", and "Data::Summarizers", described in the article "Introduction to data wrangling with Raku", [AA1].

The plots are made with the package "Text::Plot", [AAp6].


Installation

Via zef-ecosystem:

zef install ML::Clustering

From GitHub:

zef install https://github.com/antononcube/Raku-ML-Clustering

Usage example

Here we derive a set of random points, and summarize it:

use Data::Generators;
use Data::Summarizers;
use Text::Plot;

my $n = 100;
my @data1 = (random-variate(NormalDistribution.new(5,1.5), $n) X random-variate(NormalDistribution.new(5,1), $n)).pick(30);
my @data2 = (random-variate(NormalDistribution.new(10,1), $n) X random-variate(NormalDistribution.new(10,1), $n)).pick(50);
my @data3 = [|@data1, |@data2].pick(*);
records-summary(@data3)
# +------------------------------+------------------------------+
# | 0                            | 1                            |
# +------------------------------+------------------------------+
# | Min    => 1.9152332258517637 | Min    => 3.5534718877004092 |
# | 1st-Qu => 5.981652120865826  | 1st-Qu => 5.688429919222849  |
# | Mean   => 8.134616163671051  | Mean   => 8.277055937453321  |
# | Median => 8.921578933301301  | Median => 9.365632049881459  |
# | 3rd-Qu => 10.002478575801664 | 3rd-Qu => 10.36445713798808  |
# | Max    => 12.12409031158045  | Max    => 11.905774375487244 |
# +------------------------------+------------------------------+

Here we plot the points:

use Text::Plot;
text-list-plot(@data3)
# +---+---------+---------+---------+----------+---------+---+       
# +                                                          +  12.00
# |                                    *  **** * *  **   *   |       
# |                                      *   **  *           |       
# +                                    **   *******   *      +  10.00
# |                                   **** **** *  *  *      |       
# |                                     *   * **  *          |       
# +                                                          +   8.00
# |                        *                                 |       
# +          *      *** *     *  *                           +   6.00
# |              *  * * **  * *               *              |       
# |        *  * * ***   *  *    *                            |       
# +   *    *                 *   *                           +   4.00
# |                                                          |       
# +---+---------+---------+---------+----------+---------+---+       
#     2.00      4.00      6.00      8.00       10.00     12.00

Problem: Group the points in such a way that each group has close (or similar) points.

Here is how we use the function find-clusters to give an answer:

use ML::Clustering;
my %res = find-clusters(@data3, 2, prop => 'All');
%res<Clusters>>>.elems
# (50 30)

Remark: The first argument is data points that is a list-of-numeric-lists. The second argument is a number of clusters to be found. (It is in the TODO list to have the number clusters automatically determined -- currently they are not.)

Remark: The function find-clusters can return results of different types controlled with the named argument "prop". Using prop => 'All' returns a hash with all properties of the cluster finding result.

Here are sample points from each found cluster:

.say for %res<Clusters>>>.pick(3);
# ((8.730149280196386 8.689860840806768) (9.199160845916436 11.23129146924298) (9.296283411759815 11.905774375487244))
# ((4.167575475531873 5.118250052000011) (4.856378776838952 6.519994510725237) (2.941219217209155 3.6148455159938666))

Here are the centers of the clusters (the mean points):

%res<MeanPoints>
# [(10.033388803123739 10.0788732614687) (6.382563428067344 6.117153830280937)]

We can verify the result by looking at the plot of the found clusters:

text-list-plot((|%res<Clusters>, %res<MeanPoints>), point-char => <▽ ☐ ●>, title => '▽ - 1st cluster; ☐ - 2nd cluster; ● - cluster centers')
# ▽ - 1st cluster; ☐ - 2nd cluster; ● - cluster centers    
# +---+---------+----------+---------+----------+---------+--+       
# +                                         ▽ ▽ ▽ ▽   ▽▽     +  12.00
# |                                     ▽   ▽▽  ▽      ▽   ▽ |       
# |                                       ▽   ▽ ▽▽ ▽▽   ▽    |       
# +                                     ▽▽▽  ▽ ▽●▽▽     ▽    +  10.00
# |                                    ▽▽▽▽ ▽▽▽ ▽   ▽        |       
# |                                      ▽   ▽  ▽  ▽         |       
# +                                                          +   8.00
# |                         ☐                                |       
# +                  ☐☐ ☐    ●☐   ☐                          +   6.00
# |          ☐   ☐   ☐☐  ☐   ☐☐                              |       
# |       ☐   ☐ ☐☐☐      ☐  ☐    ☐             ☐             |       
# +  ☐             ☐ ☐        ☐                              +   4.00
# |        ☐                      ☐                          |       
# +---+---------+----------+---------+----------+---------+--+       
#     2.00      4.00       6.00      8.00       10.00     12.00

Remark: By default find-clusters uses the K-means algorithm. The functions k-means and k-medoids call find-clusters with the option settings method=>'K-means' and method=>'K-medoids' respectively.


More interesting looking data

Here is more interesting looking two-dimensional data, data2D2:

use Data::Reshapers;
my $pointsPerCluster = 200;
my @data2D5 = [[10,20,4],[20,60,6],[40,10,6],[-30,0,4],[100,100,8]].map({ 
    random-variate(NormalDistribution.new($_[0], $_[2]), $pointsPerCluster) Z random-variate(NormalDistribution.new($_[1], $_[2]), $pointsPerCluster)
   }).Array;
@data2D5 = flatten(@data2D5, max-level=>1).pick(*);
@data2D5.elems
# 1000

Here is a plot of that data:

text-list-plot(@data2D5)
# +---------------+---------------+--------------+-----------+        
# |                                                          |        
# |                                           ******** *     |        
# +                                       * ************     +  100.00
# |                                       * ************ *   |        
# |                    *   *                 *   *   *       |        
# |                *********                                 |        
# +                ***********  *                            +   50.00
# |                    ****                                  |        
# |                ***** *   *                               |        
# |               *****************                          |        
# |   ******       ****   **********                         |        
# +   ******             ** *******                          +    0.00
# |                                                          |        
# +---------------+---------------+--------------+-----------+        
#                 0.00            50.00          100.00

Here we find clusters and plot them together with their mean points:

srand(32);
my %clRes = find-clusters(@data2D5, 5, prop=>'All');
text-list-plot([|%clRes<Clusters>, %clRes<MeanPoints>], point-char=><1 2 3 4 5 ●>)
# +--------------+----------------+---------------+----------+        
# +                                                  1       +  120.00
# |                                           11111111111    |        
# +                                        1 111111●11111    +  100.00
# |                                       1  11111111111 1 1 |        
# +                 2 2   2                   1   1    1     +   80.00
# |               2222●25555                                 |        
# +              22225555●555  5                             +   60.00
# +                   5555                                   +   40.00
# |                 4 4                                      |        
# +              44444444444444444                           +   20.00
# |3 33333       444444 ●44444444444                         |        
# +333●3333             4 444444444                          +    0.00
# | 333333                      4                            |        
# +--------------+----------------+---------------+----------+        
#                0.00             50.00           100.00

Detailed function pages

Detailed parameter explanations and usage examples for the functions provided by the package are given in:


Implementation considerations

UML diagram

Here is a UML diagram that shows package's structure:

The PlantUML spec and diagram were obtained with the CLI script to-uml-spec of the package "UML::Translators", [AAp6].

Here we get the PlantUML spec:

to-uml-spec ML::AssociationRuleLearning > ./resources/class-diagram.puml
# 

Here get the diagram:

to-uml-spec ML::Clustering | java -jar ~/PlantUML/plantuml-1.2022.5.jar -pipe > ./resources/class-diagram.png
# 

Remark: Maybe it is a good idea to have an abstract class named, say, ML::Clustering::AbstractFinder that is a parent of ML::Clustering::KMeans, ML::Clustering::KMedoids, ML::Clustering::BiSectionalKMeans, etc., but I have not found to be necessary. (At this point of development.)

Remark: It seems it is better to have a separate package for the distance functions, named, say, "ML::DistanceFunctions". (Although distance functions are not just for ML...) After thinking over package and function names I will make such a package.


TODO


References

Articles

[Wk1] Wikipedia entry, "Cluster Analysis".

[AA1] Anton Antonov, "Introduction to data wrangling with Raku", (2021), RakuForPrediction at WordPress.

Packages

[AAp1] Anton Antonov, Bi-sectional K-means algorithm in Mathematica, (2020), MathematicaForPrediction at GitHub/antononcube.

[AAp2] Anton Antonov, Data::Generators Raku package, (2021), GitHub/antononcube.

[AAp3] Anton Antonov, Data::Reshapers Raku package, (2021), GitHub/antononcube.

[AAp4] Anton Antonov, Data::Summarizers Raku package, (2021), GitHub/antononcube.

[AAp5] Anton Antonov, UML::Translators Raku package, (2022), GitHub/antononcube.

[AAp6] Anton Antonov, Text::Plot Raku package, (2022), GitHub/antononcube.