Rand Stats

The uploading author of cpan:ANTONOV does not match the META author of github:antononcube.

ML::TriesWithFrequencies

cpan:ANTONOV

Raku ML::TriesWithFrequencies

This Raku package has functions for creation and manipulation of Tries (Prefix trees) with frequencies.

The package provides Machine Learning (ML) functionalities, not "just" a Trie data structure.

This Raku implementation closely follows the Java implementation [AAp3].

The system of function names follows the one used in the Mathematica package [AAp2].

Remark: Below Mathematica and Wolfram Language (WL) are used as synonyms.

Remark: There is a Raku package with an alternative implementation, [AAp6], made mostly for comparison studies. (See the implementation notes below.) The package in this repository, ML::TriesWithFrequencies, is my primary Tries-with-frequencies package.


Usage

Consider a trie (prefix tree) created over a list of words:

use ML::TriesWithFrequencies;
my $tr = trie-create-by-split( <bar bark bars balm cert cell> );
trie-say($tr);
# TRIEROOT => 6
# ├─b => 4
# │ └─a => 4
# │   ├─l => 1
# │   │ └─m => 1
# │   └─r => 3
# │     ├─k => 1
# │     └─s => 1
# └─c => 2
#   └─e => 2
#     ├─l => 1
#     │ └─l => 1
#     └─r => 1
#       └─t => 1

Here we convert the trie with frequencies above into a trie with probabilities:

my $ptr = trie-node-probabilities( $tr );
trie-say($ptr);
# TRIEROOT => 1
# ├─b => 0.6666666666666666
# │ └─a => 1
# │   ├─l => 0.25
# │   │ └─m => 1
# │   └─r => 0.75
# │     ├─k => 0.3333333333333333
# │     └─s => 0.3333333333333333
# └─c => 0.3333333333333333
#   └─e => 1
#     ├─l => 0.5
#     │ └─l => 1
#     └─r => 0.5
#       └─t => 1

Here we shrink the trie with probabilities above:

trie-say(trie-shrink($ptr));
# TRIEROOT => 1
# ├─ba => 0.6666666666666666
# │ ├─lm => 0.25
# │ └─r => 0.75
# │   ├─k => 0.3333333333333333
# │   └─s => 0.3333333333333333
# └─ce => 0.3333333333333333
#   ├─ll => 0.5
#   └─rt => 0.5

Here we retrieve a sub-trie with a key:

trie-say(trie-retrieve($ptr, 'bar'.comb))
# r => 0.75
# ├─k => 0.3333333333333333
# └─s => 0.3333333333333333

Representation

Each trie is a tree of objects of the class ML::TriesWithFrequencies::Trie. Such trees can be nicely represented as hash-maps. For example:

my $tr = trie-shrink(trie-create-by-split(<core cort>));
say $tr.gist;
# {TRIEROOT => {TRIEVALUE => 2, cor => {TRIEVALUE => 2, e => {TRIEVALUE => 1}, t => {TRIEVALUE => 1}}}}

The function trie-say uses that Hash-representation:

trie-say($tr)
# TRIEROOT => 2
# └─cor => 2
#   ├─e => 1
#   └─t => 1

JSON

The JSON-representation follows the inherent object-tree representation with ML::TriesWithFrequencies::Trie:

say $tr.JSON;
# {"key":"TRIEROOT", "value":2, "children":[{"key":"cor", "value":2, "children":[{"key":"t", "value":1, "children":[]}, {"key":"e", "value":1, "children":[]}]}]}

XML

The XML-representation follows (resembles) the Hash-representation (and output from trie-say):

say $tr.XML;
# <TRIEROOT>
#  <TRIEVALUE>2</TRIEVALUE>
#  <cor>
#   <TRIEVALUE>2</TRIEVALUE>
#   <t>
#    <TRIEVALUE>1</TRIEVALUE>
#   </t>
#   <e>
#    <TRIEVALUE>1</TRIEVALUE>
#   </e>
#  </cor>
# </TRIEROOT>

Using the XML representation allows for XPath searches, say, using the package XML::XPath. Here is an example:

use XML::XPath;
my $tr0 = trie-create-by-split(<bell best>);
trie-say($tr0);
# TRIEROOT => 2
# └─b => 2
#   └─e => 2
#     ├─l => 1
#     │ └─l => 1
#     └─s => 1
#       └─t => 1

Convert to XML:

say $tr0.XML;
# <TRIEROOT>
#  <TRIEVALUE>2</TRIEVALUE>
#  <b>
#   <TRIEVALUE>2</TRIEVALUE>
#   <e>
#    <TRIEVALUE>2</TRIEVALUE>
#    <s>
#     <TRIEVALUE>1</TRIEVALUE>
#     <t>
#      <TRIEVALUE>1</TRIEVALUE>
#     </t>
#    </s>
#    <l>
#     <TRIEVALUE>1</TRIEVALUE>
#     <l>
#      <TRIEVALUE>1</TRIEVALUE>
#     </l>
#    </l>
#   </e>
#  </b>
# </TRIEROOT>

Search for <b e l>:

say XML::XPath.new(xml=>$tr0.XML).find('//b/e/l');
# <l>
#     <TRIEVALUE>1</TRIEVALUE> 
#     <l>
#      <TRIEVALUE>1</TRIEVALUE> 
#     </l> 
#    </l>

WL

The Hash-representation is used in the Mathematica package [AAp2]. Hence, such WL format is provided by the Raku package:

say $tr.WL;
# <|$TrieRoot -> <|$TrieValue -> 2, "cor" -> <|$TrieValue -> 2, "t" -> <|$TrieValue -> 1|>, "e" -> <|$TrieValue -> 1|>|>|>|>

Implementation notes

This package is a Raku re-implementation of the Java Trie package [AAp3].

The initial implementation was:

The initial implementation used:

After reading [RAC1] I refactored the code to use native types (num, str) and moved the where verifications inside the functions.

I also refactored the function trie-merge to use less copying of data and to take into account which of the two tries has smaller number of children.

After those changes the current Raku implementation is:

After the (monumental) work on the new MoarVM dispatch mechanism, [JW1], was incorporated in standard Rakudo releases (September/October 2021) additional 20% speed-up was obtained. Currently this package is:

These speed improvements are definitely not satisfactory. I strongly consider:

  1. Re-implementing in Raku the Mathematica package [AAp2], i.e. to move into Tries that are hashes.

    • (It turned out option 1 does not produce better results; see [AAp6].)
  2. Re-implementing in C or C++ the Java package [AAp3] and hooking it up to Raku.


TODO

In the following list the most important items are placed first.


References

Articles

[AA1] Anton Antonov, "Tries with frequencies for data mining", (2013), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, "Removal of sub-trees in tries", (2013), MathematicaForPrediction at WordPress.

[AA3] Anton Antonov, "Tries with frequencies in Java", (2017), MathematicaForPrediction at WordPress. GitHub Markdown.

[JW1] Jonathan Worthington, "The new MoarVM dispatch mechanism is here!", (2021), 6guts at WordPress.

[RAC1] Tib, "Day 10: My 10 commandments for Raku performances", (2020), Raku Advent Calendar.

[WK1] Wikipedia entry, Trie.

Packages

[AAp1] Anton Antonov, Tries with frequencies Mathematica Version 9.0 package, (2013), MathematicaForPrediction at GitHub.

[AAp2] Anton Antonov, Tries with frequencies Mathematica package, (2013-2018), MathematicaForPrediction at GitHub.

[AAp3] Anton Antonov, Tries with frequencies in Java, (2017), MathematicaForPrediction at GitHub.

[AAp4] Anton Antonov, Java tries with frequencies Mathematica package, (2017), MathematicaForPrediction at GitHub.

[AAp5] Anton Antonov, Java tries with frequencies Mathematica unit tests, (2017), MathematicaForPrediction at GitHub.

[AAp6] Anton Antonov, ML::HashTriesWithFrequencies Raku package, (2021), GitHub/antononcube.

Videos

[AAv1] Anton Antonov, "Prefix Trees with Frequencies for Data Analysis and Machine Learning", (2017), Wolfram Technology Conference 2017, Wolfram channel at YouTube.