Raku ML::TriesWithFrequencies
This Raku package has functions for creation and manipulation of Tries (Prefix trees) with frequencies.
The package provides Machine Learning (ML) functionalities, not "just" a Trie data structure.
This Raku implementation closely follows the Java implementation [AAp3].
The system of function names follows the one used in the Mathematica package [AAp2].
Remark: Below Mathematica and Wolfram Language (WL) are used as synonyms.
Remark: There is a Raku package with an alternative implementation, [AAp6],
made mostly for comparison studies. (See the implementation notes below.)
The package in this repository, ML::TriesWithFrequencies
, is my primary
Tries-with-frequencies package.
Usage
Consider a trie (prefix tree) created over a list of words:
use ML::TriesWithFrequencies; my $tr = trie-create-by-split( <bar bark bars balm cert cell> ); trie-say($tr);
# TRIEROOT => 6
# ├─b => 4
# │ └─a => 4
# │ ├─l => 1
# │ │ └─m => 1
# │ └─r => 3
# │ ├─k => 1
# │ └─s => 1
# └─c => 2
# └─e => 2
# ├─l => 1
# │ └─l => 1
# └─r => 1
# └─t => 1
Here we convert the trie with frequencies above into a trie with probabilities:
my $ptr = trie-node-probabilities( $tr ); trie-say($ptr);
# TRIEROOT => 1
# ├─b => 0.6666666666666666
# │ └─a => 1
# │ ├─l => 0.25
# │ │ └─m => 1
# │ └─r => 0.75
# │ ├─k => 0.3333333333333333
# │ └─s => 0.3333333333333333
# └─c => 0.3333333333333333
# └─e => 1
# ├─l => 0.5
# │ └─l => 1
# └─r => 0.5
# └─t => 1
Here we shrink the trie with probabilities above:
trie-say(trie-shrink($ptr));
# TRIEROOT => 1
# ├─ba => 0.6666666666666666
# │ ├─lm => 0.25
# │ └─r => 0.75
# │ ├─k => 0.3333333333333333
# │ └─s => 0.3333333333333333
# └─ce => 0.3333333333333333
# ├─ll => 0.5
# └─rt => 0.5
Here we retrieve a sub-trie with a key:
trie-say(trie-retrieve($ptr, 'bar'.comb))
# r => 0.75
# ├─k => 0.3333333333333333
# └─s => 0.3333333333333333
Representation
Each trie is a tree of objects of the class ML::TriesWithFrequencies::Trie
.
Such trees can be nicely represented as hash-maps. For example:
my $tr = trie-shrink(trie-create-by-split(<core cort>)); say $tr.gist;
# {TRIEROOT => {TRIEVALUE => 2, cor => {TRIEVALUE => 2, e => {TRIEVALUE => 1}, t => {TRIEVALUE => 1}}}}
The function trie-say
uses that Hash-representation:
trie-say($tr)
# TRIEROOT => 2
# └─cor => 2
# ├─e => 1
# └─t => 1
JSON
The JSON-representation follows the inherent object-tree
representation with ML::TriesWithFrequencies::Trie
:
say $tr.JSON;
# {"key":"TRIEROOT", "value":2, "children":[{"key":"cor", "value":2, "children":[{"key":"t", "value":1, "children":[]}, {"key":"e", "value":1, "children":[]}]}]}
XML
The XML-representation follows (resembles) the Hash-representation
(and output from trie-say
):
say $tr.XML;
# <TRIEROOT>
# <TRIEVALUE>2</TRIEVALUE>
# <cor>
# <TRIEVALUE>2</TRIEVALUE>
# <t>
# <TRIEVALUE>1</TRIEVALUE>
# </t>
# <e>
# <TRIEVALUE>1</TRIEVALUE>
# </e>
# </cor>
# </TRIEROOT>
Using the XML representation allows for
XPath
searches, say, using the package
XML::XPath
.
Here is an example:
use XML::XPath; my $tr0 = trie-create-by-split(<bell best>); trie-say($tr0);
# TRIEROOT => 2
# └─b => 2
# └─e => 2
# ├─l => 1
# │ └─l => 1
# └─s => 1
# └─t => 1
Convert to XML:
say $tr0.XML;
# <TRIEROOT>
# <TRIEVALUE>2</TRIEVALUE>
# <b>
# <TRIEVALUE>2</TRIEVALUE>
# <e>
# <TRIEVALUE>2</TRIEVALUE>
# <s>
# <TRIEVALUE>1</TRIEVALUE>
# <t>
# <TRIEVALUE>1</TRIEVALUE>
# </t>
# </s>
# <l>
# <TRIEVALUE>1</TRIEVALUE>
# <l>
# <TRIEVALUE>1</TRIEVALUE>
# </l>
# </l>
# </e>
# </b>
# </TRIEROOT>
Search for <b e l>
:
say XML::XPath.new(xml=>$tr0.XML).find('//b/e/l');
# <l>
# <TRIEVALUE>1</TRIEVALUE>
# <l>
# <TRIEVALUE>1</TRIEVALUE>
# </l>
# </l>
WL
The Hash-representation is used in the Mathematica package [AAp2]. Hence, such WL format is provided by the Raku package:
say $tr.WL;
# <|$TrieRoot -> <|$TrieValue -> 2, "cor" -> <|$TrieValue -> 2, "t" -> <|$TrieValue -> 1|>, "e" -> <|$TrieValue -> 1|>|>|>|>
Implementation notes
This package is a Raku re-implementation of the Java Trie package [AAp3].
The initial implementation was:
- ≈ 5-6 times slower than the Mathematica implementation [AAp2]
- ≈ 100 times slower than the Java implementation [AAp3]
The initial implementation used:
- General types for Trie nodes, i.e.
Str
for the key andNumeric
for the value - Argument type verification with
where
statements in the signatures of thetrie-*
functions
After reading [RAC1] I refactored the code to use native types (num
, str
)
and moved the where
verifications inside the functions.
I also refactored the function trie-merge
to use less copying of data and
to take into account which of the two tries has smaller number of children.
After those changes the current Raku implementation is:
- ≈ 2.5 times slower than the Mathematica implementation [AAp2]
- ≈ 40 times slower than the Java implementation [AAp3]
After the (monumental) work on the new MoarVM dispatch mechanism, [JW1], was incorporated in standard Rakudo releases (September/October 2021) additional 20% speed-up was obtained. Currently this package is:
- ≈ 2.0 times slower than the Mathematica implementation [AAp2]
- ≈ 30 times slower than the Java implementation [AAp3]
These speed improvements are definitely not satisfactory. I strongly consider:
Re-implementing in Raku the Mathematica package [AAp2], i.e. to move into Tries that are hashes.
- (It turned out option 1 does not produce better results; see [AAp6].)
Re-implementing in C or C++ the Java package [AAp3] and hooking it up to Raku.
TODO
In the following list the most important items are placed first.
Implement "get words" and "get root-to-leaf paths" functions.
- See
trie-words
andtrie-root-to-leaf-paths
.
- See
Convert most of the WL unit tests in [AAp5] into Raku tests.
Implement Trie traversal functions.
The general
trie-map
function is in a separate role.- A concrete traversal functionality is a class that does the role and provides additional context.
Implement (sub-)trie removal functions.
By threshold (below and above)
By Pareto principle adherence (top and bottom)
By regex over the keys
Implement optional ULP spec argument for relevant functions:
trie-root-to-leaf-paths
trie-words
Membership test functions?
Design and code refactoring so trie objects to have OOP interface.
- Instead of just having
trie-words($tr, <c>)
we should be also able to say$tr.trie-words(<c>)
.
- Instead of just having
Implement
trie-prune
function.Implement Trie-based classification.
Investigate faster implementations.
Re-implement the Trie functionalities using hash representation (instead of a tree of Trie-node objects.)
- See [AAp6].
Make a C or C++ implementation and hook it up to Raku.
Document examples of doing Trie-based text mining or data-mining.
Program a trie-form visualization that is "wide", i.e. places the children nodes horizontally.
References
Articles
[AA1] Anton Antonov, "Tries with frequencies for data mining", (2013), MathematicaForPrediction at WordPress.
[AA2] Anton Antonov, "Removal of sub-trees in tries", (2013), MathematicaForPrediction at WordPress.
[AA3] Anton Antonov, "Tries with frequencies in Java", (2017), MathematicaForPrediction at WordPress. GitHub Markdown.
[JW1] Jonathan Worthington, "The new MoarVM dispatch mechanism is here!", (2021), 6guts at WordPress.
[RAC1] Tib, "Day 10: My 10 commandments for Raku performances", (2020), Raku Advent Calendar.
[WK1] Wikipedia entry, Trie.
Packages
[AAp1] Anton Antonov, Tries with frequencies Mathematica Version 9.0 package, (2013), MathematicaForPrediction at GitHub.
[AAp2] Anton Antonov, Tries with frequencies Mathematica package, (2013-2018), MathematicaForPrediction at GitHub.
[AAp3] Anton Antonov, Tries with frequencies in Java, (2017), MathematicaForPrediction at GitHub.
[AAp4] Anton Antonov, Java tries with frequencies Mathematica package, (2017), MathematicaForPrediction at GitHub.
[AAp5] Anton Antonov, Java tries with frequencies Mathematica unit tests, (2017), MathematicaForPrediction at GitHub.
[AAp6] Anton Antonov, ML::HashTriesWithFrequencies Raku package, (2021), GitHub/antononcube.
Videos
[AAv1] Anton Antonov, "Prefix Trees with Frequencies for Data Analysis and Machine Learning", (2017), Wolfram Technology Conference 2017, Wolfram channel at YouTube.