LLM::Chat

Introduction

LLM::Chat is a module for inferencing large language models.

It automatically manages pruning old messages, retaining the system prompt (:sysprompt) & other sticky (:sticky) messages, and inserting messages at depth (:depth).

Example Usage

This is an implementation of a terminal-based conversational loop with LLM::Chat:

#!/usr/bin/env raku

use lib 'lib';

use LLM::Chat::Backend::OpenAICommon;
use LLM::Chat::Backend::Settings;
use LLM::Chat::Conversation;
use LLM::Chat::Conversation::Message;
use LLM::Chat::Template::MistralV7;
use LLM::Chat::TokenCounter;
use Tokenizers;

## EDIT THESE TO MATCH YOUR ENVIRONMENT
constant $API_URL     = 'http://192.168.1.193:5001/v1';
constant $MAX_TOKENS  = 1024;
constant $MAX_CONTEXT = 32768;

my @conversation = (
	LLM::Chat::Conversation::Message.new(
		role      => 'system',
		content   => 'You are a helpful assistant.',
		sysprompt => True
	);
);

my $template      = LLM::Chat::Template::MistralV7.new;

my $tokenizer     = Tokenizers.new-from-json(
	slurp('t/fixtures/tokenizer.json')
);

my $token-counter = LLM::Chat::TokenCounter.new(
	tokenizer => $tokenizer,
	template  => $template,
);

my $settings = LLM::Chat::Backend::Settings.new(
	max_tokens => $MAX_TOKENS,
	max_context => $MAX_CONTEXT,
);

my $con = LLM::Chat::Conversation.new(
	token-counter  => $token-counter,
	context-budget => $MAX_CONTEXT - $MAX_TOKENS,
);

my $backend = LLM::Chat::Backend::OpenAICommon.new(
	api_url  => $API_URL,
	settings => $settings,
);

loop {
	my @lines;
	say "Enter your input. Type 'DONE' on a line by itself when finished:\n";

	loop {
		print "> ";
		my $line = $*IN.get // last;
		last if $line.trim eq 'DONE';
		@lines.push: $line;
	}

	last if @lines.elems == 0;
	@conversation.push: LLM::Chat::Conversation::Message.new(
		role    => 'user',
		content => @lines.join("\n"),
	);
	my @prompt = $con.prepare-for-inference(@conversation);
	my $resp   = $backend.chat-completion-stream(@prompt);

	my $last;
	loop {
		my $new = $resp.latest.subst(/^$last/, '');
		$last   = $resp.latest;

		print $new if $new ne "";
		last if $resp.is-done;
		sleep(0.1);
	}

	print "\n";

	print "ERROR: {$resp.err}\n" if !$resp.is-success;
}

See examples/* and t/* for more usage examples.

Current Support

Inference Types

Chat completion (with or without streaming)
Text completion (with or without streaming)

API Types

OpenAI compatible (most backends) - LLM::Chat::Backend::OpenAICommon
KoboldCpp (additional samplers & cancel function) - LLM::Chat::Backend::KoboldCpp

To implement more API types, just extend LLM::Chat::Backend.

Chat Templates

ChatML (LLM::Chat::Template::ChatML)
Gemma 2 (LLM::Chat::Template::Gemma2)
Llama 3 (LLM::Chat::Template::Llama3)
Llama 4 (LLM::Chat::Template::Llama4)
Mistral V7 (LLM::Chat::Template::MistralV7)

To implement more chat templates, just extend LLM::Chat::Template. You will need a correct chat template for accurate context shifting and/or text completion.

Planned

Tool Calling
VLM Capabilities
More APIs & templates
Automatically fetching tokenizers & chat template from HF model identifiers

Contributing

Pull requests and issues welcome.

License

It is extracted from Mistral Nemo, which is an Apache 2.0 licensed model.