Rand Stats

Text::Homoglyph::ASCII

zef:slavenskoj

Text::Homoglyph::ASCII

A Raku module for cleaning ASCII text from look-alike Unicode characters (homoglyphs).

Description

This module detects and replaces Unicode characters that visually resemble ASCII characters but have different code points. This is useful for:

Installation

zef install Text::Homoglyph::ASCII

Usage

Basic Usage

use Text::Homoglyph::ASCII;

# Clean text by replacing homoglyphs with ASCII
my $text = 'Неllo Wоrld';  # Contains Cyrillic 'Н' and 'о'
my $cleaned = clean-ascii($text);
say $cleaned;  # Output: Hello World

Detecting Homoglyphs

# Detect homoglyphs without modifying the text
my $text = 'Неllo Wоrld';
my @homoglyphs = detect-ascii-homoglyphs($text);

for @homoglyphs -> $h {
    say "Found '{$h<char>}' at position {$h<position>}, maps to '{$h<ascii>}'";
}
# Output:
# Found 'Н' at position 0, maps to 'H'
# Found 'о' at position 7, maps to 'o'

Verbose Cleaning

# Get detailed information about the cleaning process
my %result = clean-ascii-verbose('Неllo Wоrld');

say "Original: {%result<original>}";
say "Cleaned: {%result<cleaned>}";
say "Changed: {%result<changed>}";
say "Replacements: {%result<replacements>.elems}";

Supported Character Sets

The module recognizes and converts homoglyphs from:

Examples

Security: URL Homograph Detection

my $suspicious-url = 'https://gооgle.com';  # Contains Cyrillic 'о'
my $safe-url = clean-ascii($suspicious-url);
say $safe-url;  # https://google.com

if $suspicious-url ne $safe-url {
    say "Warning: URL contains homoglyphs!";
}

Data Normalization

# Normalize fancy social media text
my $fancy = '𝐇𝐞𝐥𝐥𝐨 𝕎𝕠𝕣𝕝𝕕! 🎉 Look at this!';
my $normal = clean-ascii($fancy);
say $normal;  # Hello World! 🎉 Look at this!

Batch Processing

my @texts = <
    'Fullwidth'
    '𝐁𝐨𝐥𝐝'
    'Ⅻ'
    'Неllo'
>;

my @cleaned = @texts.map({ clean-ascii($_) });
say @cleaned;  # [Fullwidth Bold XII Hello]

Enhanced Cleaning with Accent Removal

# Handle accented characters and diacritics
my $text = 'café naïve résumé';
my $cleaned = clean-ascii-more($text);
say $cleaned;  # cafe naive resume

# Works with homoglyphs AND accents, preserves emoji
my $mixed = 'Héllo Wörld! 👋😊';  # Fullwidth H + accented letters + emoji
my $cleaned = clean-ascii-more($mixed);
say $cleaned;  # Hello World! 👋😊

# Combining marks are removed
my $combining = "e\x[0301]";  # e + combining acute accent
say clean-ascii-more($combining);  # e

Pure ASCII Output

# Force everything to ASCII
my $mixed = 'café 👋 €100 →next←';
my $pure = clean-ascii-pure($mixed);
say $pure;  # cafe _ _100 _next_

# Useful for filenames
my $filename = 'My Résumé (2024).pdf';
my $safe = clean-ascii-pure($filename);
say $safe;  # My Resume (2024).pdf

# Handles all types of characters
my $complex = 'Héllo™ café 中文 👍';
my $ascii-only = clean-ascii-pure($complex);
say $ascii-only;  # Hello_ cafe __ _

API Reference

clean-ascii(Str $text --> Str)

Replaces all homoglyphs in the text with their ASCII equivalents.

detect-ascii-homoglyphs(Str $text --> Array)

Returns an array of hashes describing each homoglyph found:

clean-ascii-verbose(Str $text --> Hash)

Returns a hash with:

clean-ascii-more(Str $text --> Str)

An enhanced cleaning function that:

  1. Replaces all homoglyphs with ASCII equivalents (same as clean-ascii)
  2. Uses NFKD (Unicode Normalization Form KD) to decompose accented characters
  3. Removes combining marks/diacritics from decomposable characters

This function preserves non-decomposable Unicode characters (like emoji) while converting accented characters to their base forms (é→e, ñ→n, etc.). Use this when you want to normalize accented text while keeping other Unicode symbols intact.

Technical Note: Ligatures like ß, æ, œ are handled by the homoglyph mapping (step 1), not by NFKD decomposition. NFKD specifically handles diacritical marks on characters (accents, tildes, umlauts, etc.) by decomposing them into base characters plus combining marks, which are then filtered out.

clean-ascii-pure(Str $text --> Str)

The most aggressive cleaning function that ensures pure ASCII output:

  1. Applies all transformations from clean-ascii-more
  2. Replaces any remaining non-ASCII characters with underscore (_)

This is useful for systems that require strict ASCII-only text, such as:

Contributing

https://github.com/slavenskoj/Raku-Text-Homoglyph-ASCII

Author

Danslav Slavenskoj

License

This module is distributed under the Artistic License 2.0.

See Also