The uploading author of cpan:ANTONOV does not match the META author of github:antononcube.

cpan:ANTONOV

# Raku ML::TriesWithFrequencies

This Raku package has functions for creation and manipulation of Tries (Prefix trees) with frequencies.

The package provides Machine Learning (ML) functionalities, not "just" a Trie data structure.

This Raku implementation closely follows the Java implementation [AAp3].

The system of function names follows the one used in the Mathematica package [AAp2].

Remark: Below Mathematica and Wolfram Language (WL) are used as synonyms.

Remark: There is a Raku package with an alternative implementation, [AAp6], made mostly for comparison studies. (See the implementation notes below.) The package in this repository, `ML::TriesWithFrequencies`, is my primary Tries-with-frequencies package.

## Usage

Consider a trie (prefix tree) created over a list of words:

```use ML::TriesWithFrequencies;
my \$tr = trie-create-by-split( <bar bark bars balm cert cell> );
trie-say(\$tr);
```
``````# TRIEROOT => 6
# ├─b => 4
# │ └─a => 4
# │   ├─l => 1
# │   │ └─m => 1
# │   └─r => 3
# │     ├─k => 1
# │     └─s => 1
# └─c => 2
#   └─e => 2
#     ├─l => 1
#     │ └─l => 1
#     └─r => 1
#       └─t => 1
``````

Here we convert the trie with frequencies above into a trie with probabilities:

```my \$ptr = trie-node-probabilities( \$tr );
trie-say(\$ptr);
```
``````# TRIEROOT => 1
# ├─b => 0.6666666666666666
# │ └─a => 1
# │   ├─l => 0.25
# │   │ └─m => 1
# │   └─r => 0.75
# │     ├─k => 0.3333333333333333
# │     └─s => 0.3333333333333333
# └─c => 0.3333333333333333
#   └─e => 1
#     ├─l => 0.5
#     │ └─l => 1
#     └─r => 0.5
#       └─t => 1
``````

Here we shrink the trie with probabilities above:

```trie-say(trie-shrink(\$ptr));
```
``````# TRIEROOT => 1
# ├─ba => 0.6666666666666666
# │ ├─lm => 0.25
# │ └─r => 0.75
# │   ├─k => 0.3333333333333333
# │   └─s => 0.3333333333333333
# └─ce => 0.3333333333333333
#   ├─ll => 0.5
#   └─rt => 0.5
``````

Here we retrieve a sub-trie with a key:

```trie-say(trie-retrieve(\$ptr, 'bar'.comb))
```
``````# r => 0.75
# ├─k => 0.3333333333333333
# └─s => 0.3333333333333333
``````

## Representation

Each trie is a tree of objects of the class `ML::TriesWithFrequencies::Trie`. Such trees can be nicely represented as hash-maps. For example:

```my \$tr = trie-shrink(trie-create-by-split(<core cort>));
say \$tr.gist;
```
``````# {TRIEROOT => {TRIEVALUE => 2, cor => {TRIEVALUE => 2, e => {TRIEVALUE => 1}, t => {TRIEVALUE => 1}}}}
``````

The function `trie-say` uses that Hash-representation:

```trie-say(\$tr)
```
``````# TRIEROOT => 2
# └─cor => 2
#   ├─e => 1
#   └─t => 1
``````

### JSON

The JSON-representation follows the inherent object-tree representation with `ML::TriesWithFrequencies::Trie`:

```say \$tr.JSON;
```
``````# {"key":"TRIEROOT", "value":2, "children":[{"key":"cor", "value":2, "children":[{"key":"t", "value":1, "children":[]}, {"key":"e", "value":1, "children":[]}]}]}
``````

### XML

The XML-representation follows (resembles) the Hash-representation (and output from `trie-say`):

```say \$tr.XML;
```
``````# <TRIEROOT>
#  <TRIEVALUE>2</TRIEVALUE>
#  <cor>
#   <TRIEVALUE>2</TRIEVALUE>
#   <t>
#    <TRIEVALUE>1</TRIEVALUE>
#   </t>
#   <e>
#    <TRIEVALUE>1</TRIEVALUE>
#   </e>
#  </cor>
# </TRIEROOT>
``````

Using the XML representation allows for XPath searches, say, using the package `XML::XPath`. Here is an example:

```use XML::XPath;
my \$tr0 = trie-create-by-split(<bell best>);
trie-say(\$tr0);
```
``````# TRIEROOT => 2
# └─b => 2
#   └─e => 2
#     ├─l => 1
#     │ └─l => 1
#     └─s => 1
#       └─t => 1
``````

Convert to XML:

```say \$tr0.XML;
```
``````# <TRIEROOT>
#  <TRIEVALUE>2</TRIEVALUE>
#  <b>
#   <TRIEVALUE>2</TRIEVALUE>
#   <e>
#    <TRIEVALUE>2</TRIEVALUE>
#    <s>
#     <TRIEVALUE>1</TRIEVALUE>
#     <t>
#      <TRIEVALUE>1</TRIEVALUE>
#     </t>
#    </s>
#    <l>
#     <TRIEVALUE>1</TRIEVALUE>
#     <l>
#      <TRIEVALUE>1</TRIEVALUE>
#     </l>
#    </l>
#   </e>
#  </b>
# </TRIEROOT>
``````

Search for `<b e l>`:

```say XML::XPath.new(xml=>\$tr0.XML).find('//b/e/l');
```
``````# <l>
#     <TRIEVALUE>1</TRIEVALUE>
#     <l>
#      <TRIEVALUE>1</TRIEVALUE>
#     </l>
#    </l>
``````

### WL

The Hash-representation is used in the Mathematica package [AAp2]. Hence, such WL format is provided by the Raku package:

```say \$tr.WL;
```
``````# <|\$TrieRoot -> <|\$TrieValue -> 2, "cor" -> <|\$TrieValue -> 2, "t" -> <|\$TrieValue -> 1|>, "e" -> <|\$TrieValue -> 1|>|>|>|>
``````

## Implementation notes

This package is a Raku re-implementation of the Java Trie package [AAp3].

The initial implementation was:

• ≈ 5-6 times slower than the Mathematica implementation [AAp2]
• ≈ 100 times slower than the Java implementation [AAp3]

The initial implementation used:

• General types for Trie nodes, i.e. `Str` for the key and `Numeric` for the value
• Argument type verification with `where` statements in the signatures of the `trie-*` functions

After reading [RAC1] I refactored the code to use native types (`num`, `str`) and moved the `where` verifications inside the functions.

I also refactored the function `trie-merge` to use less copying of data and to take into account which of the two tries has smaller number of children.

After those changes the current Raku implementation is:

• ≈ 2.5 times slower than the Mathematica implementation [AAp2]
• ≈ 40 times slower than the Java implementation [AAp3]

After the (monumental) work on the new MoarVM dispatch mechanism, [JW1], was incorporated in standard Rakudo releases (September/October 2021) additional 20% speed-up was obtained. Currently this package is:

• ≈ 2.0 times slower than the Mathematica implementation [AAp2]
• ≈ 30 times slower than the Java implementation [AAp3]

These speed improvements are definitely not satisfactory. I strongly consider:

1. Re-implementing in Raku the Mathematica package [AAp2], i.e. to move into Tries that are hashes.

• (It turned out option 1 does not produce better results; see [AAp6].)
2. Re-implementing in C or C++ the Java package [AAp3] and hooking it up to Raku.

## TODO

In the following list the most important items are placed first.

• Implement "get words" and "get root-to-leaf paths" functions.

• See `trie-words` and `trie-root-to-leaf-paths`.
• Convert most of the WL unit tests in [AAp5] into Raku tests.

• Implement Trie traversal functions.

• The general `trie-map` function is in a separate role.

• A concrete traversal functionality is a class that does the role and provides additional context.
• Implement (sub-)trie removal functions.

• By threshold (below and above)

• By Pareto principle adherence (top and bottom)

• By regex over the keys

• Implement optional ULP spec argument for relevant functions:

• `trie-root-to-leaf-paths`

• `trie-words`

• Membership test functions?

• Design and code refactoring so trie objects to have OOP interface.

• Instead of just having `trie-words(\$tr, <c>)` we should be also able to say `\$tr.trie-words(<c>)`.
• Implement `trie-prune` function.

• Implement Trie-based classification.

• Investigate faster implementations.

• Re-implement the Trie functionalities using hash representation (instead of a tree of Trie-node objects.)

• See [AAp6].
• Make a C or C++ implementation and hook it up to Raku.

• Document examples of doing Trie-based text mining or data-mining.

• Program a trie-form visualization that is "wide", i.e. places the children nodes horizontally.

## References

### Articles

[AA1] Anton Antonov, "Tries with frequencies for data mining", (2013), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, "Removal of sub-trees in tries", (2013), MathematicaForPrediction at WordPress.

[AA3] Anton Antonov, "Tries with frequencies in Java", (2017), MathematicaForPrediction at WordPress. GitHub Markdown.

[JW1] Jonathan Worthington, "The new MoarVM dispatch mechanism is here!", (2021), 6guts at WordPress.

[RAC1] Tib, "Day 10: My 10 commandments for Raku performances", (2020), Raku Advent Calendar.

[WK1] Wikipedia entry, Trie.

### Packages

[AAp1] Anton Antonov, Tries with frequencies Mathematica Version 9.0 package, (2013), MathematicaForPrediction at GitHub.

[AAp2] Anton Antonov, Tries with frequencies Mathematica package, (2013-2018), MathematicaForPrediction at GitHub.

[AAp3] Anton Antonov, Tries with frequencies in Java, (2017), MathematicaForPrediction at GitHub.

[AAp4] Anton Antonov, Java tries with frequencies Mathematica package, (2017), MathematicaForPrediction at GitHub.

[AAp5] Anton Antonov, Java tries with frequencies Mathematica unit tests, (2017), MathematicaForPrediction at GitHub.

[AAp6] Anton Antonov, ML::HashTriesWithFrequencies Raku package, (2021), GitHub/antononcube.

### Videos

[AAv1] Anton Antonov, "Prefix Trees with Frequencies for Data Analysis and Machine Learning", (2017), Wolfram Technology Conference 2017, Wolfram channel at YouTube.