Rand Stats

LLM::Chat

zef:apogee

Actions Status

LLM::Chat

Simple framework for LLM inferencing in Raku. Supports multiple backends (OpenAI-compatible, KoboldCpp), chat templates (ChatML, Llama 3/4, Mistral, Gemma 2, and any HuggingFace Jinja2 template), conversation management with context shifting, and token counting.

Synopsis

use LLM::Chat::Backend::KoboldCpp;
use LLM::Chat::Template::ChatML;
use LLM::Chat::Conversation;

my $backend = LLM::Chat::Backend::KoboldCpp.new(
    api_url  => 'http://localhost:5001/v1',
    template => LLM::Chat::Template::ChatML.new,
);

my $conv = LLM::Chat::Conversation.new;
$conv.add-message('user', 'Hello!');

my $response = $backend.text-completion($conv.messages);

Templates

Built-in Templates

use LLM::Chat::Template::ChatML;
use LLM::Chat::Template::Llama3;
use LLM::Chat::Template::Llama4;
use LLM::Chat::Template::MistralV7;
use LLM::Chat::Template::Gemma2;

my $template = LLM::Chat::Template::ChatML.new;

Jinja2 Templates (HuggingFace)

Load any HuggingFace chat_template directly from a tokenizer_config.json:

use LLM::Chat::Template::Jinja2;

# From tokenizer_config.json
my $json = 'tokenizer_config.json'.IO.slurp;
my $template = LLM::Chat::Template::Jinja2.from-tokenizer-config($json);

# Or provide the template string directly
my $template = LLM::Chat::Template::Jinja2.new(
    template  => $jinja2-string,
    bos-token => '<s>',
    eos-token => '</s>',
);

The Jinja2 template support is powered by Template::Jinja2, a complete Jinja2 engine for Raku with byte-identical output to Python Jinja2.

Backends

KoboldCpp

use LLM::Chat::Backend::KoboldCpp;

my $backend = LLM::Chat::Backend::KoboldCpp.new(
    api_url   => 'http://localhost:5001/v1',
    template  => $template,  # for text completions
    max_tokens => 200,
);

OpenAI-compatible

Any OpenAI-compatible API (vLLM, Ollama, etc.):

use LLM::Chat::Backend::OpenAICommon;

my $backend = LLM::Chat::Backend::OpenAICommon.new(
    api_url => 'http://localhost:8000/v1',
    model   => 'my-model',
);

Mock (for tests)

Canned-response backend for unit and integration tests. Returns pre-configured responses in order, records every call for assertions, and can be scripted to fail on specific calls to exercise retry / fallback paths in downstream consumers.

use LLM::Chat::Backend::Mock;
use LLM::Chat::Backend::Settings;

my $mock = LLM::Chat::Backend::Mock.new(
    settings  => LLM::Chat::Backend::Settings.new,
    responses => ['first', 'second', 'third'],
    # Optional: script per-call failures by index. Returning a defined
    # hash fails that call; returning Nil proceeds normally.
    error-producer => -> $i {
        when $i == 0 { { class => 'http', status => 503,
                         message => 'bad gateway' } }
        default      { Nil }
    },
);

my $resp = $mock.chat-completion(@messages);
# $mock.recorded-calls[0]<messages>, <response>, <error>, <call-index>, ...
# $mock.call-index — monotonic counter, bumped on every call

See LLM::Chat::Backend::Mock for the full attribute list and recording contract.

Response

Every completion method returns an LLM::Chat::Backend::Response (or ::Stream for streaming calls). Callers poll .is-done, read .msg on success, and inspect .err on failure.

Responses also carry structured error metadata on the failure path so consumers can classify errors without regex-parsing raw messages:

until $resp.is-done { sleep 0.01 }

if $resp.is-success {
    say $resp.msg;
}
else {
    say "failed: {$resp.err}";
    say "  class:  {$resp.error-class  // '(none)'}";   # 'http' / 'timeout' /
                                                        # 'connection' /
                                                        # 'response' / 'unknown'
    say "  status: {$resp.error-status // '(none)'}";   # HTTP code when
                                                        # error-class eq 'http'
}

error-class values:

LLM::Data::Inference::Task reads these fields to decide between abort / retry-same / advance in its model-fallback policy — consumers that want the same policy without depending on that module can implement it against the Response.error-class / .error-status pair directly.

Provider-reported usage is also available on the Response when the backend emits it:

$resp.prompt-tokens;       # Int, undefined on backends that don't emit usage
$resp.completion-tokens;   # Int
$resp.total-tokens;        # Int
$resp.cost;                # Num (credits)
$resp.model-used;          # Str, provider-reported routed model
$resp.provider-id;         # Str, provider-assigned request id
$resp.finish-reason;       # Str ('stop' / 'length' / 'content_filter' / ...)

Conversation Management

use LLM::Chat::Conversation;

my $conv = LLM::Chat::Conversation.new;
$conv.add-message('system', 'You are helpful.');
$conv.add-message('user', 'Hello!');
$conv.add-message('assistant', 'Hi there!');

# Access messages
say $conv.messages;

Token Counting

use LLM::Chat::TokenCounter;

my $counter = LLM::Chat::TokenCounter.new(
    tokenizer-path => 'path/to/tokenizer.json',
    template       => $template,
);

my $count = $counter.count-messages(@messages);

Dependencies

Author

Matt Doughty

License

Artistic-2.0