Raku-Lingua-Stem-Bulgarian
Introduction
This Raku package is for stemming Bulgarian words. It implements the BulStem algorithm presented in [PN1].
Usage examples
The BulStem
function is used to find stems:
use Lingua::Stem::Bulgarian; say BulStem('покълване')
# покълв
BulStem
also works with lists of words:
say BulStem('Покълването на посевите се очаква с търпение, пиене и сланина.'.words)
# (Покълване на посеви се очакв с търпени пи и слани)
The function bg-word-stem
can be used as a synonym of BulStem
.
Command Line Interface (CLI)
The package provides the CLI function BulStem
. Here is its usage message:
> BulStem --help Usage: BulStem [--splitter=<Str>] [--format=<Str>] <text> -- Finds stems of Bulgarian words in text. BulStem [--format=<Str>] [<words> ...] -- Finds stems of Bulgarian words. BulStem [--format=<Str>] -- Finds stems of Bulgarian words in (pipeline) input. <text> Text to spilt and its words stemmed. --splitter=<Str> String to make a split regex with. [default: '\W+'] --format=<Str> Output format one of 'text', 'lines', or 'raku'. [default: 'text'] [<words> ...] Words to be stemmed.
Here are example shell commands of using the CLI function BulStem
:
> BulStem Какви Какв > BulStem --format=raku "Какви са стъблата на тези думи" # ["Какв", "с", "стъблат", "н", "тез", "дум"] > BulStem Какви са стъблата на тези думи # Какв с стъблат н тез дум
Here is a pipeline example using the CLI function GetTokens
of the package
"Grammar::TokenProcessing",
[AAp1]:
GetTokens ./RecommenderPhrases-template | BulStem --format=raku # ("colnames", "rownames", "агрегац", "агрегир", "ан", "аномал", "аномал", "близос", # "взем", "внуш", "глобал", "глобалн", "гъстот", "докаж", "доказателств", "доказателств", # "елемент", "етик", "заред", "изполва", "индексира", "истор", "консума", "латент", # "латентн", "локал", "локалн", "матриц", "матриц", "напречн", "нещ", "нещ", # "номализац", "нормализато", "нормализира", "обед", "обедин", "обработ", "обрат", # "обратн", "обясн", "обясн", "обясн", "повечет", "подход", "подход", "пр", "пр", # "препор", "препор", "препоръч", "препоръч", "препоръча", "препоръч", "препоръч", # "препоръчителк", "препоръчк", "препоръчк", "при", "проф", "размер", "разреденос", # "редиц", "свидетелств", "свидетелств", "свойств", "свойств", "семантич", "съсед", # "терм", "функци", "характеристи", "характеристи", "честот", "чре")
Remark: These kind of tokens (literals) transformations are used in the package "DSL::Bulgarian", [AAp2].
Other implementations
C#, GATE plugin (Java) Java (JDK 1.4), Perl (Original), Python2, Python3, Ruby
Implementation notes
The resource files are essential for the implementation of
BulStem
.- I had problems ingesting the stem-rules files in [PNp1] with my OS/IDE setup, so I used the files in [MHp1].
The resource files are used to make the Bulgarian stemming rules.
The stemming rules
Hash
object is made at compile time.There are 120765 stemming rules with frequencies (counts) ≥ 1.
- By default rules with count ≥ 2 are loaded used.
TODO
DONE Respect the word case in the returned result.
BulStem('ТАБЛА')
should return'ТАБЛ'
.- (Not
'табл'
as it currently does.)
DONE CLI that can be inserted in UNIX pipelines.
DONE (Re-)ingestion of stem rules with different min counts.
TODO Performance statistics.
TODO More detailed documentation.
References
Articles
[PN1] Preslav Nakov, "BulStem: Design and evaluation of inflectional stemmer for Bulgarian", In Workshop on Balkan Language Resources and Tools (Balkan Conference in Informatics).
Packages
[AAp1] Anton Antonov, Grammar::TokenProcessing Raku package, (2022), GitHub/antononcube.
[AAp2] Anton Antonov, DSL::Bulgarian Raku package, (2022), GitHub/antononcube.
[MHp1] Momchil Hardalov, bulstem-py Python package, (2020), (Release: v0.3.3), GitHub/mhardalov.
[PNp1] Preslav Nakov, BulStem: Inflectional Stemmer for Bulgarian, (2002), http://lml.bas.bg/~nakov.