zef:antononcube

# Raku ML::Clustering

This repository has the code of a Raku package for Machine Learning (ML) Clustering (or Cluster analysis) functions, [Wk1].

The Clustering framework includes:

• The algorithms K-means and K-medoids, and others

• The distance functions Euclidean, Cosine, Hamming, Manhattan, and others, and their corresponding similarity functions

The data in the examples below is generated and manipulated with the packages "Data::Generators", "Data::Reshapers", and "Data::Summarizers", described in the article "Introduction to data wrangling with Raku", [AA1].

The plots are made with the package "Text::Plot", [AAp6].

## Installation

Via zef-ecosystem:

``````zef install ML::Clustering
``````

From GitHub:

``````zef install https://github.com/antononcube/Raku-ML-Clustering
``````

## Usage example

Here we derive a set of random points, and summarize it:

```use Data::Generators;
use Data::Summarizers;
use Text::Plot;

my \$n = 100;
my @data1 = (random-variate(NormalDistribution.new(5,1.5), \$n) X random-variate(NormalDistribution.new(5,1), \$n)).pick(30);
my @data2 = (random-variate(NormalDistribution.new(10,1), \$n) X random-variate(NormalDistribution.new(10,1), \$n)).pick(50);
my @data3 = [|@data1, |@data2].pick(*);
records-summary(@data3)
```
``````# +------------------------------+------------------------------+
# | 0                            | 1                            |
# +------------------------------+------------------------------+
# | Min    => 1.9418286393831807 | Min    => 2.5537453527288423 |
# | 1st-Qu => 5.23355791998377   | 1st-Qu => 5.802659698503382  |
# | Mean   => 7.812329106122415  | Mean   => 8.221053043444616  |
# | Median => 8.529233471757092  | Median => 8.859544342245552  |
# | 3rd-Qu => 9.74722507929462   | 3rd-Qu => 10.392817343154189 |
# | Max    => 12.161509848446896 | Max    => 11.851232468041157 |
# +------------------------------+------------------------------+
``````

Here we plot the points:

```use Text::Plot;
text-list-plot(@data3)
```
``````# +---+---------+---------+---------+---------+---------+----+
# +                                                          +  12.00
# |                                    *  *  *        *      |
# |                                 ****  * ** **   ***      |
# +                                   * *   ** **        *   +  10.00
# |                                   *   ****   **          |
# +                                  *    *  *        *      +   8.00
# |      *       *         *                                 |
# +      *   *        *   **                                 +   6.00
# |          * ** ****    *         *                        |
# |   *          *  *    *                                   |
# +             **   **        *                             +   4.00
# |         *                 *                              |
# +                                                          +   2.00
# +---+---------+---------+---------+---------+---------+----+
#     2.00      4.00      6.00      8.00      10.00     12.00
``````

Problem: Group the points in such a way that each group has close (or similar) points.

Here is how we use the function `find-clusters` to give an answer:

```use ML::Clustering;
my %res = find-clusters(@data3, 2, prop => 'All');
%res<Clusters>>>.elems
```
``````# (30 50)
``````

Remark: The first argument is data points that is a list-of-numeric-lists. The second argument is a number of clusters to be found. (It is in the TODO list to have the number clusters automatically determined -- currently they are not.)

Remark: The function `find-clusters` can return results of different types controlled with the named argument "prop". Using `prop => 'All'` returns a hash with all properties of the cluster finding result.

Here are sample points from each found cluster:

```.say for %res<Clusters>>>.pick(3);
```
``````# ((6.8442730684339805 2.5537453527288423) (5.988371242806578 6.690825577391333) (3.9052242620581974 5.826205768330279))
# ((10.302490764954882 10.91125277165973) (8.821337333605817 9.715938302825638) (8.278089768928224 8.666124184959127))
``````

Here are the centers of the clusters (the mean points):

```%res<MeanPoints>
```
``````# [(4.693753655533249 4.977193937166397) (9.411502770521118 9.768925531106525)]
``````

We can verify the result by looking at the plot of the found clusters:

```text-list-plot((|%res<Clusters>, %res<MeanPoints>), point-char => <▽ ☐ ●>, title => '▽ - 1st cluster; ☐ - 2nd cluster; ● - cluster centers')
```
``````# ▽ - 1st cluster; ☐ - 2nd cluster; ● - cluster centers
# +--+----------+---------+----------+---------+----------+--+
# +                                     ☐                    +  12.00
# |                                   ☐    ☐ ☐☐ ☐☐    ☐ ☐    |
# |                                 ☐ ☐☐☐    ☐☐ ☐    ☐  ☐    |
# +                                   ☐☐ ☐  ●☐☐ ☐☐         ☐ +  10.00
# |                                   ☐    ☐☐ ☐   ☐☐         |
# +              ▽                    ☐    ☐  ☐        ☐     +   8.00
# |     ▽                  ▽                                 |
# +     ▽   ▽         ▽   ▽▽                                 +   6.00
# |         ▽ ▽▽ ▽ ▽▽     ▽          ▽                       |
# |  ▽           ▽ ●▽    ▽                                   |
# +            ▽ ▽  ▽ ▽         ▽                            +   4.00
# |                                                          |
# +        ▽                   ▽                             +   2.00
# +--+----------+---------+----------+---------+----------+--+
#    2.00       4.00      6.00       8.00      10.00      12.00
``````

Remark: By default `find-clusters` uses the K-means algorithm. The functions `k-means` and `k-medoids` call `find-clusters` with the option settings `method=>'K-means'` and `method=>'K-medoids'` respectively.

## More interesting looking data

Here is more interesting looking two-dimensional data, `data2D2`:

```use Data::Reshapers;
my \$pointsPerCluster = 200;
my @data2D5 = [[10,20,4],[20,60,6],[40,10,6],[-30,0,4],[100,100,8]].map({
random-variate(NormalDistribution.new(\$_[0], \$_[2]), \$pointsPerCluster) Z random-variate(NormalDistribution.new(\$_[1], \$_[2]), \$pointsPerCluster)
}).Array;
@data2D5 = flatten(@data2D5, max-level=>1).pick(*);
@data2D5.elems
```
``````# 1000
``````

Here is a plot of that data:

```text-list-plot(@data2D5)
```
``````# +---------------+---------------+---------------+----------+
# |                                                          |
# |                                            ***** ***     |
# +                                        ***************   +  100.00
# |                                         **************   |
# |                    * *  *                   ***          |
# |                 ***********                              |
# +                 ***********                              +   50.00
# |                  ***  * *                                |
# |                 *****                                    |
# |               *******  *********                         |
# |   *******       **** * **********                        |
# +   *******            *  ** ***                           +    0.00
# |                                                          |
# +---------------+---------------+---------------+----------+
#                 0.00            50.00           100.00
``````

Here we find clusters and plot them together with their mean points:

```srand(32);
my %clRes = find-clusters(@data2D5, 5, prop=>'All');
text-list-plot([|%clRes<Clusters>, %clRes<MeanPoints>], point-char=><1 2 3 4 5 ●>)
```
``````# +--------------+-----------------+----------------+--------+
# +                                             3   33       +  120.00
# |                                            3 33333 5555  |
# +                                         333333●355●55555 +  100.00
# |                                            3335555555555 |
# +                   1 11  1                                +   80.00
# |                111111111 11                              |
# +                11111●111111                              +   60.00
# +                 1 11 11 1                                +   40.00
# |               2 2                                        |
# +              22222222 22 222222                          +   20.00
# | 444444        2222222 ●2222222222                        |
# +4444●444              2  222222222                        +    0.00
# | 44444 4                    2                             |
# +--------------+-----------------+----------------+--------+
#                0.00              50.00            100.00
``````

## Detailed function pages

Detailed parameter explanations and usage examples for the functions provided by the package are given in:

## Implementation considerations

### UML diagram

Here is a UML diagram that shows package's structure (in Mermaid-JS):

``````to-uml-spec ML::Clustering --format=mermaid
``````
``````classDiagram
class k_means {
<<routine>>
}
k_means --|> Routine
k_means --|> Block
k_means --|> Code
k_means --|> Callable

class find_clusters {
<<routine>>
}
find_clusters --|> Routine
find_clusters --|> Block
find_clusters --|> Code
find_clusters --|> Callable

class ML_Clustering_KMeans {
+BUILDALL()
+args-check()
+bray-curtis-distance()
+canberra-distance()
+chessboard-distance()
+cosine-distance()
+euclidean-distance()
+find-clusters()
+get-distance-function()
+hamming-distance()
+known-distance-function-specs()
+manhattan-distance()
+norm()
+squared-euclidean-distance()
}
ML_Clustering_KMeans --|> Math_DistanceFunctionish
``````

Remark: Maybe it is a good idea to have an abstract class named, say, `ML::Clustering::AbstractFinder` that is a parent of `ML::Clustering::KMeans`, `ML::Clustering::KMedoids`, `ML::Clustering::BiSectionalKMeans`, etc., but I have not found to be necessary. (At this point of development.)

Remark: It seems it is better to have a separate package for the distance functions, named, say, "ML::DistanceFunctions". (Although distance functions are not just for ML...) After thinking over package and function names I will make such a package.

## TODO

• DONE Factor-out the distance functions in a separate package.

• TODO Implement Bi-sectional K-means algorithm, [AAp1].

• TODO Implement K-medoids algorithm.

• TODO Automatic determination of the number of clusters.

• TODO Allow data points to be `Pair` objects the keys of which are point labels.

• Hence, the returned clusters consist of those labels, not points themselves.
• TODO Implement Agglomerate algorithm.

## References

### Articles

[Wk1] Wikipedia entry, "Cluster Analysis".

[AA1] Anton Antonov, "Introduction to data wrangling with Raku", (2021), RakuForPrediction at WordPress.

### Packages

[AAp1] Anton Antonov, Bi-sectional K-means algorithm in Mathematica, (2020), MathematicaForPrediction at GitHub/antononcube.

[AAp2] Anton Antonov, Data::Generators Raku package, (2021), GitHub/antononcube.

[AAp3] Anton Antonov, Data::Reshapers Raku package, (2021), GitHub/antononcube.

[AAp4] Anton Antonov, Data::Summarizers Raku package, (2021), GitHub/antononcube.

[AAp5] Anton Antonov, UML::Translators Raku package, (2022), GitHub/antononcube.

[AAp6] Anton Antonov, Text::Plot Raku package, (2022), GitHub/antononcube.