Math::DistanceFunctions::Edit
Raku package of fast Damerau-Levenshtein distance functions based on C code via "NativeCall".
For a pure Raku implementation see "Text::Levenshtein::Damerau", [NLp1].
Usage examples
The main function provided by this package is edit-distance
.
Here is comparison invocation with dld
from "Text::Levenshtein::Damerau"
over two string arguments:
use Math::DistanceFunctions::Edit;
use Text::Levenshtein::Damerau;
my ($w1, $w2) = ('examples', 'samples');
say 'edit-distance : ', edit-distance($w1, $w2);
say 'dld : ', dld($w1, $w2);
# edit-distance : 2
# dld : 2
Vectors of integers, booleans, or strings can be also used:
edit-distance(<bark alma area arc>, <Arc alma area>):ignore-case;
# 2
edit-distance([True, False, False, True], [True, False, False]);
# 1
Remark: Currently, elements of integer lists are converted to int32
.
If larger integers are used then convert to Str
first.
Motivation
The motivation for making this package was the slow performance of the DSL translation functions in the package
"DSL::Translators", [AAp1].
After profiling, it turned out about 50% of the time is spent in the function dld
by "Text::Levenshtein::Demerau".
That is the case because of the fuzzy marching which "DSL::Translators" does:
use DSL::Translators;
dsl-translation('use @dfTitanic; group by sex; show couns;', to => 'Raku')<CODE>
#ERROR: Possible misspelling of 'count' as 'couns'.
# $obj = @dfTitanic ;
# $obj = group-by($obj, "sex") ;
# say "counts: ", $obj>>.elems
The slowdown effect of the "expensive" to compute results by dld
can be addressed by:
- Certain clever checks can be made before invoking
dld
. - Create a new function called
edit-distance
in C and set up a "NativeCall" connection to it.
So, at this point, both approaches were taken: the first in "DSL::Shared", [AAp2], the second by "Math::DistanceFunctions::Edit".
Implementation
The design of "NativeCall" hook-up is taken from "Algorithm::KdTree", [ITp1].
The actual C-implementation was made by several iterations of LLM code generation.
I considered re-programming to C the Raku code of dld
in [NLp1], but since
Damerau-Levenshtein distance is a
very well known, popular topic
LLM generations with simple prompts were used.
(And, yes, I read the code and tested it.)
Since the speed is the most important reason for this package, after its complete initial version,
profiling was done each refactoring step. See the file "faster-word-distances.raku".
- For ASCII (non-UTF-8) strings
edit-distance
is ≈70 times faster than dld
. - For UTF-8 strings ≈5 times faster.
Here is en example output of the normalized profiling times done with the script "faster-word-distances.raku":
StrDistance => 1
dld => 0.847204294559419
edit-distance => 0.011560672845434399
rosetta => 2.5342606961356466
sift => 0.021171925438510746
Remark: The timing of Raku's built-in StrDistance
is used to normalize the rest of the timings.
Remark: In the profiling also sift4
from "Text::Diff::Sift4", [MDp1], was used.
(NQP-based implementation.)
References
[AAp1] Anton Antonov,
DSL::Translators Raku package,
(2020-2024),
GitHub/antononcube.
[AAp2] Anton Antonov,
DSL::Shared Raku package,
(2020-2024),
GitHub/antononcube.
[ITp1] Itsuki Toyota,
Algorithm::KdTree Raku package,
(2016-2024),
GitHub/titsuki.
[MDp1] MaterDuke17,
Text::Diff::Sift4 Raku package,
(2016-2021),
GitHub/MaterDuke17.
[NLp1] Nick Logan,
Text::Levenshtein::Damerau Raku package,
(2016-2022),
GitHub/ugexe.