NAME
String::FuzzyIndex - Fuzzy index routine based on Smith-Waterman algorithm
SYNOPSIS
use String::FuzzyIndex;
my @fzmatch = fzindex($haystack, $needle);
my @fzmatch = fzindex($haystack, $needle, :s_id(6), :s_ocr(3), :s_gap(-1), :s_nid(-1));
my @fzmatch = fzindex($haystack, $needle, :ignore_diacritics);
my @fzmatch = fzindex($haystack, $needle, :glossover_ocrerrors);
DESCRIPTION
String::FuzzyIndex
is a pure Raku module for fuzzy string matchingi based on the Smith-Waterman algorithm. It provides a single subroutine named fzindex
.
SUBROUTINES
fzindex
The subroutine locates the best match(es) of a needle string in a haystack string and returns the best-match data in the form of a List
. In case no best match was found, the List
is empty. Otherwise the returned List
contains, for each best match, an eight-element sub-List
describing that match.
The structure of a sub-List
is as follows:
Index | Type | Description |
:------|:----------|:------------------------------------------------------------ | | |
0 | Rat | Match score as fraction of maximum score for perfect match |
1 | Int | Index of start position of match in haystack |
2 | Int | Index of end position of match in haystack |
3 | Int | Index of start position of match in needle |
4 | Int | Index of end position of match in needle |
5 | List[Str] | Traceback buffer for match in haystack |
6 | List[Str] | Traceback buffer for match in needle |
7 | List[Int] | Traceback buffer comprising the match's similarity scores |
Similarity scoring scheme
The operation of fzindex
is based on the Smith-Waterman algorithm. This algorithm is used to identify the best match(es) by performing so-called 'local sequence alignment', i.e. by comparing segments of the haystack and the needle of all possible lengths, and optimizing a similarity measure.
In the process, individual characters from the haystack and the needle are pairwise compared and scored based on how similar they are. If two characters are identical, their alignment is assigned the similarity score s_id
. If they are not identical, their alignment is assigned the similarity score s_nid
, unless the options ignore_diacritics
and/or :glossover_ocrerrors
are used (see below).
The alignment of haystack and needle segments may also call for the acknowledgement of gaps: positions in an aligned segment that correspond to neither identical nor non-identical elements. To this end, fzindex
implements a linear gap penalty scheme according to which each such gap is assigned a gap penalty s_gap
. In the traceback buffers reported for a match, gaps are denoted with a dash ('-').
When working with strings extracted from OCRed data, mismatches in the alignment of segments may arise from the improper recognition of diacritics and/or the mixup of two characters. A 'ç' may, for instance, have been recognized as a 'c' and so frustrate alignment. Similarly, an 'I' is commonly OCRed as an 'l', or an '|'. fzindex
offers two options to help mitigate such OCR-related issues:
Causes diacritics to be ignored in assessing whether two characters are identical. Hence, the pairwise comparison of 'ç' and 'c' results in assignment of the similarity score s_id
rather than the score s_nid
.
Causes the alignment of two characters that are commonly confused in OCR to be assigned the score s_ocr
rather than the score s_nid
.
The default scores, which may be changed using correspondingly named arguments, are as follows:
Score | Default value | Description |
:------|:--------------:|:-------------------------------------------------------------- | | |
s_id | +6 | Score awarded to identical/matching characters |
s_ocr | +3 | Score awarded to possibly identical characters in OCR context |
s_gap | -1 | Penalty for a gap (a dash in the traceback buffer) |
s_nid | -1 | Score awarded to non-identical characters |
Intended use The Raku built-in index
localizes the first perfect match of a needle within a haystack. It is fast, but when it comes up empty, one may be left wondering if either the haystack or the needle perhaps contains a typo, spelling variation, OCR error or the like that caused index
to report the absence of the needle from the haystack. In such a case, fzindex
may be used as a backup for index
.
However, since index
is comparably fast, fzindex
internally executes an index
-based (actually: indices
-based) match attempt before starting a local sequence alignment attempt.
Because of this, there is generally no need to implement the use of fzindex
as an actual backup to the use of the index
built-in. Instead, the fzindex
routine may be used instead of the built-in index
. Such use of fzindex
obviously comes at a performance penalty, but has the advantages that (i) all perfect matches are reported at once, and (ii) any best fuzzy matches are reported in case no perfect matches exist.
Optimization
The Raku implementation of the Smith-Waterman algorithm is naturally relatively slow, and in the present version of String::FuzzyIndex
no significant optimization attempt has been made. Should you need fuzzy string matching for a large number of lengthy strings, you are probably better off with an implementation in a low level language like C.
That said, some performance enhancing code has already been tested and is likely to appear in future versions.
Example
use String::FuzzyIndex;
my $haystack = 'The squeeky weel gets the grease.';
my $needle = 'squeaky wheel';
my @r = fzindex($haystack, $needle);
say "{@r.elems} match(es) found.";
for @r -> ($score, $hsp, $hep, $nsp, $nep, $htb, $ntb, $stb) {
say "Match score : ", $score;
say "Matching haystack portion : ", $haystack.substr: $hsp, ($hep-$hsp+1);
say "Matching needle portion : ", $needle.substr: $nsp, ($nep-$nsp+1);
say "Scoring traceback buffer : ", $stb;
say "Haystack traceback buffer : ", $htb;
say "Needle traceback buffer : ", $ntb;
}
1 match(es) found.
Match score : 0.820513
Matching haystack portion : squeeky weel
Matching needle portion : squeaky wheel
Scoring traceback buffer : (6 12 18 24 23 29 35 41 47 46 52 58 64)
Haystack traceback buffer : (s q u e e k y w - e e l)
Needle traceback buffer : (s q u e a k y w h e e l)
AUTHOR
threadless-screw
COPYRIGHT AND LICENSE
Copyright 2019 threadless-screw
Copyright 2024 Raku Community
This library is free software; you can redistribute it and/or modify it under the Artistic License 2.0.