Text::Homoglyph::ASCII
A Raku module for cleaning ASCII text from look-alike Unicode characters (homoglyphs).
Description
This module detects and replaces Unicode characters that visually resemble ASCII characters but have different code points. This is useful for:
- Security: Preventing homograph attacks in URLs and identifiers
- Data normalization: Ensuring consistent text representation
- Text processing: Cleaning user input that may contain fancy Unicode characters
Installation
zef install Text::Homoglyph::ASCII
Usage
Basic Usage
use Text::Homoglyph::ASCII;
# Clean text by replacing homoglyphs with ASCII
my $text = 'Неllo Wоrld'; # Contains Cyrillic 'Н' and 'о'
my $cleaned = clean-ascii($text);
say $cleaned; # Output: Hello World
Detecting Homoglyphs
# Detect homoglyphs without modifying the text
my $text = 'Неllo Wоrld';
my @homoglyphs = detect-ascii-homoglyphs($text);
for @homoglyphs -> $h {
say "Found '{$h<char>}' at position {$h<position>}, maps to '{$h<ascii>}'";
}
# Output:
# Found 'Н' at position 0, maps to 'H'
# Found 'о' at position 7, maps to 'o'
Verbose Cleaning
# Get detailed information about the cleaning process
my %result = clean-ascii-verbose('Неllo Wоrld');
say "Original: {%result<original>}";
say "Cleaned: {%result<cleaned>}";
say "Changed: {%result<changed>}";
say "Replacements: {%result<replacements>.elems}";
Supported Character Sets
The module recognizes and converts homoglyphs from:
- Cyrillic: А, В, Е, К, М, Н, О, Р, С, Т, У, Х, а, е, о, р, с, у, х, etc.
- Greek: Α, Β, Ε, Η, Ι, Κ, Μ, Ν, Ο, Ρ, Τ, α, β, ε, η, ι, κ, μ, ν, ο, ρ, τ, etc.
- Cherokee: Ꭺ, Ᏼ, Ꮯ, Ꭰ, Ꭼ, Ꮐ, Ꮋ, Ꭸ, etc.
- Armenian: Ա, Մ, Օ, Տ, ա, օ, ս, etc.
- Georgian: Ⴍ, Ⴎ, Ⴐ, Ⴝ, ო, ი, etc.
- Mathematical Alphanumeric Symbols: 𝐀-𝐙, 𝐚-𝐳, 𝟎-𝟗, 𝔸-ℤ, etc.
- Fullwidth Forms: A-Z, a-z, 0-9
- Roman Numerals: Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ, etc.
- Various Unicode blocks: Including bold, italic, script, fraktur, sans-serif variants
- Ligatures: fi→fi, ff→ff, fl→fl, etc.
Examples
Security: URL Homograph Detection
my $suspicious-url = 'https://gооgle.com'; # Contains Cyrillic 'о'
my $safe-url = clean-ascii($suspicious-url);
say $safe-url; # https://google.com
if $suspicious-url ne $safe-url {
say "Warning: URL contains homoglyphs!";
}
Data Normalization
# Normalize fancy social media text
my $fancy = '𝐇𝐞𝐥𝐥𝐨 𝕎𝕠𝕣𝕝𝕕! 🎉 Look at this!';
my $normal = clean-ascii($fancy);
say $normal; # Hello World! 🎉 Look at this!
Batch Processing
my @texts = <
'Fullwidth'
'𝐁𝐨𝐥𝐝'
'Ⅻ'
'Неllo'
>;
my @cleaned = @texts.map({ clean-ascii($_) });
say @cleaned; # [Fullwidth Bold XII Hello]
Enhanced Cleaning with Accent Removal
# Handle accented characters and diacritics
my $text = 'café naïve résumé';
my $cleaned = clean-ascii-more($text);
say $cleaned; # cafe naive resume
# Works with homoglyphs AND accents, preserves emoji
my $mixed = 'Héllo Wörld! 👋😊'; # Fullwidth H + accented letters + emoji
my $cleaned = clean-ascii-more($mixed);
say $cleaned; # Hello World! 👋😊
# Combining marks are removed
my $combining = "e\x[0301]"; # e + combining acute accent
say clean-ascii-more($combining); # e
Pure ASCII Output
# Force everything to ASCII
my $mixed = 'café 👋 €100 →next←';
my $pure = clean-ascii-pure($mixed);
say $pure; # cafe _ _100 _next_
# Useful for filenames
my $filename = 'My Résumé (2024).pdf';
my $safe = clean-ascii-pure($filename);
say $safe; # My Resume (2024).pdf
# Handles all types of characters
my $complex = 'Héllo™ café 中文 👍';
my $ascii-only = clean-ascii-pure($complex);
say $ascii-only; # Hello_ cafe __ _
API Reference
clean-ascii(Str $text --> Str)
Replaces all homoglyphs in the text with their ASCII equivalents.
detect-ascii-homoglyphs(Str $text --> Array)
Returns an array of hashes describing each homoglyph found:
char
: The homoglyph characterascii
: The ASCII replacementposition
: Character position in the stringlength
: Length of the homoglyph (usually 1, but can be more for ligatures)
clean-ascii-verbose(Str $text --> Hash)
Returns a hash with:
original
: The original textcleaned
: The cleaned textreplacements
: Array of replacement details (same as detect-ascii-homoglyphs)changed
: Boolean indicating if any replacements were made
clean-ascii-more(Str $text --> Str)
An enhanced cleaning function that:
- Replaces all homoglyphs with ASCII equivalents (same as
clean-ascii
) - Decomposes Unicode characters (uses Raku’s built in normalization, similar to NFKD)
- Removes combining marks/diacritics from decomposable characters
This function preserves non-decomposable Unicode characters (like emoji) while converting accented characters to their base forms (é→e, ñ→n, etc.). Use this when you want to normalize accented text while keeping other Unicode symbols intact.
clean-ascii-pure(Str $text --> Str)
The most aggressive cleaning function that ensures pure ASCII output:
- Applies all transformations from
clean-ascii-more
- Replaces any remaining non-ASCII characters with underscore (_)
This is useful for systems that require strict ASCII-only text, such as:
- Legacy file systems with ASCII-only filenames
- URLs or identifiers that must be pure ASCII
- Systems that cannot handle any Unicode characters
Contributing
https://github.com/slavenskoj/Raku-Text-Homoglyph-ASCII
Author
Danslav Slavenskoj
License
This module is distributed under the Artistic License 2.0.
See Also