Lingua::Stem::Portuguese Raku package
Introduction
This Raku package is for stemming Portuguese words.
It implements the Snowball algorithm presented in
[SNa1].
Usage examples
The PortugueseStem
function is used to find stems:
use Lingua::Stem::Portuguese;
say PortugueseStem('brotação')
# brot
PortugueseStem
also works with lists of words:
say PortugueseStem('Os brotos são aguardados com paciência, bebida e bacon.'.words)
# (Os brot sao aguard com paciencia, beb e bacon.)
The function portuguese-word-stem
can be used as a synonym of PortugueseStem
.
Command Line Interface (CLI)
The package provides the CLI function PortugueseStem
. Here is its usage message:
PortugueseStem --help
# Usage:
# PortugueseStem <text> [--splitter=<Str>] [--format=<Str>] -- Finds stems of Portuguese words in text.
# PortugueseStem [<words> ...] [--format=<Str>] -- Finds stems of Portuguese words.
# PortugueseStem [--format=<Str>] -- Finds stems of Portuguese words in (pipeline) input.
#
# <text> Text to spilt and its words stemmed.
# --splitter=<Str> String to make a split regex with. [default: '\W+']
# --format=<Str> Output format one of 'text', 'lines', or 'raku'. [default: 'text']
# [<words> ...] Words to be stemmed.
Here are example shell commands of using the CLI function PortugueseStem
:
PortugueseStem Boataria
# Boat
PortugueseStem --format=raku "Módulo Raku que fornece um procedimento para a língua portuguesa."
# ["Modul", "Raku", "que", "fornec", "um", "proced", "par", "a", "lingu", "portugu", ""]
PortugueseStem Verificar a exatidão da seleção usando dicionários e regras
# Verific a exatid da selec us dicion e regr
Here is a pipeline example using the CLI function get-tokens
of the package
"Grammar::TokenProcessing",
[AAp1]:
get-tokens ./DataQueryPhrases-template | PortugueseStem --format=raku
Remark: These kind of tokens (literals) transformations are used in the packages
"DSL::Bulgarian", [AAp2],
"DSL::Portuguese", [AAp3],
and
"DSL::Russian", [AAp4],
Implementation notes
TODO
TODO Respect the word case in the returned result.
PortugueseStem('TABLADO')
should return 'TABL'
.- (Not
'tabl'
as it currently does.)
DONE CLI that can be inserted in UNIX pipelines.
TODO Gallician stemmer.
TODO Performance statistics.
TODO More detailed documentation.
References
Articles
[SNa1] Snowball Team,
Portuguese stemming algorithm,
(2002),
snowball.tartarus.org.
Packages
[AAp1] Anton Antonov,
Grammar::TokenProcessing Raku package,
(2022),
GitHub/antononcube.
[AAp2] Anton Antonov,
DSL::Bulgarian Raku package,
(2022),
GitHub/antononcube.
[AAp3] Anton Antonov,
DSL::Portuguese Raku package,
(2023),
GitHub/antononcube.
[AAp3] Anton Antonov,
DSL::Russian Raku package,
(2022),
GitHub/antononcube.