Raku ML::TriesWithFrequencies

This Raku package has functions for creation and manipulation of Tries (Prefix trees) with frequencies.

The package provides Machine Learning (ML) functionalities, not "just" a Trie data structure.

This Raku implementation closely follows the Java implementation [AAp3].

The system of function names follows the one used in the Mathematica package [AAp2].

Remark: Below Mathematica and Wolfram Language (WL) are used as synonyms.

Remark: There is a Raku package with an alternative implementation, [AAp6], made mostly for comparison studies. (See the implementation notes below.) The package in this repository, ML::TriesWithFrequencies, is my primary Tries-with-frequencies package.

Usage

Consider a trie (prefix tree) created over a list of words:

use ML::TriesWithFrequencies;
my $tr = trie-create-by-split( <bar bark bars balm cert cell> );
trie-say($tr);

# TRIEROOT => 6
# ├─b => 4
# │ └─a => 4
# │   ├─l => 1
# │   │ └─m => 1
# │   └─r => 3
# │     ├─k => 1
# │     └─s => 1
# └─c => 2
#   └─e => 2
#     ├─l => 1
#     │ └─l => 1
#     └─r => 1
#       └─t => 1

Here we convert the trie with frequencies above into a trie with probabilities:

my $ptr = trie-node-probabilities( $tr );
trie-say($ptr);

# TRIEROOT => 1
# ├─b => 0.6666666666666666
# │ └─a => 1
# │   ├─l => 0.25
# │   │ └─m => 1
# │   └─r => 0.75
# │     ├─k => 0.3333333333333333
# │     └─s => 0.3333333333333333
# └─c => 0.3333333333333333
#   └─e => 1
#     ├─l => 0.5
#     │ └─l => 1
#     └─r => 0.5
#       └─t => 1

Here we shrink the trie with probabilities above:

trie-say(trie-shrink($ptr));

# TRIEROOT => 1
# ├─ba => 0.6666666666666666
# │ ├─lm => 0.25
# │ └─r => 0.75
# │   ├─k => 0.3333333333333333
# │   └─s => 0.3333333333333333
# └─ce => 0.3333333333333333
#   ├─ll => 0.5
#   └─rt => 0.5

Here we retrieve a sub-trie with a key:

trie-say(trie-retrieve($ptr, 'bar'.comb))

# r => 0.75
# ├─k => 0.3333333333333333
# └─s => 0.3333333333333333

Representation

Each trie is a tree of objects of the class ML::TriesWithFrequencies::Trie. Such trees can be nicely represented as hash-maps. For example:

my $tr = trie-shrink(trie-create-by-split(<core cort>));
say $tr.gist;

# {TRIEROOT => {TRIEVALUE => 2, cor => {TRIEVALUE => 2, e => {TRIEVALUE => 1}, t => {TRIEVALUE => 1}}}}

The function trie-say uses that Hash-representation:

trie-say($tr)

# TRIEROOT => 2
# └─cor => 2
#   ├─e => 1
#   └─t => 1

JSON

The JSON-representation follows the inherent object-tree representation with ML::TriesWithFrequencies::Trie:

say $tr.JSON;

# {"key":"TRIEROOT", "value":2, "children":[{"key":"cor", "value":2, "children":[{"key":"t", "value":1, "children":[]}, {"key":"e", "value":1, "children":[]}]}]}

XML

The XML-representation follows (resembles) the Hash-representation (and output from trie-say):

say $tr.XML;

# <TRIEROOT>
#  <TRIEVALUE>2</TRIEVALUE>
#  <cor>
#   <TRIEVALUE>2</TRIEVALUE>
#   <t>
#    <TRIEVALUE>1</TRIEVALUE>
#   </t>
#   <e>
#    <TRIEVALUE>1</TRIEVALUE>
#   </e>
#  </cor>
# </TRIEROOT>

Using the XML representation allows for XPath searches, say, using the package XML::XPath. Here is an example:

use XML::XPath;
my $tr0 = trie-create-by-split(<bell best>);
trie-say($tr0);

# TRIEROOT => 2
# └─b => 2
#   └─e => 2
#     ├─l => 1
#     │ └─l => 1
#     └─s => 1
#       └─t => 1

Convert to XML:

say $tr0.XML;

# <TRIEROOT>
#  <TRIEVALUE>2</TRIEVALUE>
#  <b>
#   <TRIEVALUE>2</TRIEVALUE>
#   <e>
#    <TRIEVALUE>2</TRIEVALUE>
#    <s>
#     <TRIEVALUE>1</TRIEVALUE>
#     <t>
#      <TRIEVALUE>1</TRIEVALUE>
#     </t>
#    </s>
#    <l>
#     <TRIEVALUE>1</TRIEVALUE>
#     <l>
#      <TRIEVALUE>1</TRIEVALUE>
#     </l>
#    </l>
#   </e>
#  </b>
# </TRIEROOT>

Search for <b e l>:

say XML::XPath.new(xml=>$tr0.XML).find('//b/e/l');

# <l>
#     <TRIEVALUE>1</TRIEVALUE> 
#     <l>
#      <TRIEVALUE>1</TRIEVALUE> 
#     </l> 
#    </l>

WL

The Hash-representation is used in the Mathematica package [AAp2]. Hence, such WL format is provided by the Raku package:

say $tr.WL;

# <|$TrieRoot -> <|$TrieValue -> 2, "cor" -> <|$TrieValue -> 2, "t" -> <|$TrieValue -> 1|>, "e" -> <|$TrieValue -> 1|>|>|>|>

Implementation notes

This package is a Raku re-implementation of the Java Trie package [AAp3].

The initial implementation was:

≈ 5-6 times slower than the Mathematica implementation [AAp2]
≈ 100 times slower than the Java implementation [AAp3]

The initial implementation used:

General types for Trie nodes, i.e. Str for the key and Numeric for the value
Argument type verification with where statements in the signatures of the trie-* functions

After reading [RAC1] I refactored the code to use native types (num, str) and moved the where verifications inside the functions.

I also refactored the function trie-merge to use less copying of data and to take into account which of the two tries has smaller number of children.

After those changes the current Raku implementation is:

≈ 2.5 times slower than the Mathematica implementation [AAp2]
≈ 40 times slower than the Java implementation [AAp3]

After the (monumental) work on the new MoarVM dispatch mechanism, [JW1], was incorporated in standard Rakudo releases (September/October 2021) additional 20% speed-up was obtained. Currently this package is:

≈ 2.0 times slower than the Mathematica implementation [AAp2]
≈ 30 times slower than the Java implementation [AAp3]

These speed improvements are definitely not satisfactory. I strongly consider:

Re-implementing in Raku the Mathematica package [AAp2], i.e. to move into Tries that are hashes.
- (It turned out option 1 does not produce better results; see [AAp6].)
Re-implementing in C or C++ the Java package [AAp3] and hooking it up to Raku.

TODO

In the following list the most important items are placed first.

Implement "get words" and "get root-to-leaf paths" functions.
- See trie-words and trie-root-to-leaf-paths.
Convert most of the WL unit tests in [AAp5] into Raku tests.
Implement Trie traversal functions.
- The general trie-map function is in a separate role.
  - A concrete traversal functionality is a class that does the role and provides additional context.
Implement (sub-)trie removal functions.
- By threshold (below and above)
- By Pareto principle adherence (top and bottom)
- By regex over the keys
Implement optional ULP spec argument for relevant functions:
- trie-root-to-leaf-paths
- trie-words
- Membership test functions?
Design and code refactoring so trie objects to have OOP interface.
- Instead of just having trie-words($tr, <c>) we should be also able to say $tr.trie-words(<c>).
Implement trie-prune function.
Implement Trie-based classification.
Investigate faster implementations.
- Re-implement the Trie functionalities using hash representation (instead of a tree of Trie-node objects.)
  - See [AAp6].
- Make a C or C++ implementation and hook it up to Raku.
Document examples of doing Trie-based text mining or data-mining.
Program a trie-form visualization that is "wide", i.e. places the children nodes horizontally.