Rand Stats

Math::DistanceFunctions::Edit

zef:antononcube

Math::DistanceFunctions::Edit

Actions Status Actions Status

Raku package of fast Damerau-Levenshtein distance functions based on C code via "NativeCall".

For a pure Raku implementation see "Text::Levenshtein::Damerau", [NLp1].


Usage examples

The main function provided by this package is edit-distance. Here is comparison invocation with dld from "Text::Levenshtein::Damerau" over two string arguments:

use Math::DistanceFunctions::Edit;
use Text::Levenshtein::Damerau;

my ($w1, $w2) = ('examples', 'samples');
say 'edit-distance : ', edit-distance($w1, $w2);
say 'dld           : ', dld($w1, $w2);
# edit-distance : 2
# dld           : 2

Vectors of integers, booleans, or strings can be also used:

edit-distance(<bark alma area arc>, <Arc alma area>):ignore-case;
# 2
edit-distance([True, False, False, True], [True, False, False]);
# 1

Remark: Currently, elements of integer lists are converted to int32. If larger integers are used then convert to Str first.


Motivation

The motivation for making this package was the slow performance of the DSL translation functions in the package "DSL::Translators", [AAp1]. After profiling, it turned out about 50% of the time is spent in the function dld by "Text::Levenshtein::Demerau".

That is the case because of the fuzzy marching which "DSL::Translators" does:

use DSL::Translators;

dsl-translation('use @dfTitanic; group by sex; show couns;', to => 'Raku')<CODE>
#ERROR: Possible misspelling of 'count' as 'couns'.
# $obj = @dfTitanic ;
# $obj = group-by($obj, "sex") ;
# say "counts: ", $obj>>.elems

The slowdown effect of the "expensive" to compute results by dld can be addressed by:

So, at this point, both approaches were taken: the first in "DSL::Shared", [AAp2], the second by "Math::DistanceFunctions::Edit".


Implementation

The design of "NativeCall" hook-up is taken from "Algorithm::KdTree", [ITp1].

The actual C-implementation was made by several iterations of LLM code generation.

I considered re-programming to C the Raku code of dld in [NLp1], but since Damerau-Levenshtein distance is a very well known, popular topic LLM generations with simple prompts were used.

(And, yes, I read the code and tested it.)


Profiling and performance

Since the speed is the most important reason for this package, after its complete initial version, profiling was done each refactoring step. See the file "faster-word-distances.raku".

Here is en example output of the normalized profiling times done with the script "faster-word-distances.raku":

StrDistance => 1
dld => 0.847204294559419
edit-distance => 0.011560672845434399
rosetta => 2.5342606961356466
sift => 0.021171925438510746

Remark: The timing of Raku's built-in StrDistance is used to normalize the rest of the timings.

Remark: In the profiling also sift4 from "Text::Diff::Sift4", [MDp1], was used. (NQP-based implementation.)


References

[AAp1] Anton Antonov, DSL::Translators Raku package, (2020-2024), GitHub/antononcube.

[AAp2] Anton Antonov, DSL::Shared Raku package, (2020-2024), GitHub/antononcube.

[ITp1] Itsuki Toyota, Algorithm::KdTree Raku package, (2016-2024), GitHub/titsuki.

[MDp1] MaterDuke17, Text::Diff::Sift4 Raku package, (2016-2021), GitHub/MaterDuke17.

[NLp1] Nick Logan, Text::Levenshtein::Damerau Raku package, (2016-2022), GitHub/ugexe.