ML::FindTextualAnswer Raku package

In brief

This package provides function(s) for finding sub-strings in texts that appear to be answers to given questions according to certain Machine Learning (ML) algorithms or Large Language Models (LLMs).

Remark: Currently only LLMs are used via the packages "WWW::OpenAI", [AAp1], and "WWW::PaLM", [AAp2],

Remark: The LLMs are utilized via the packages "LLM::Functions", [AAp3], and "Text::SubParsers", [AAp4].

Remark: One of the primary motivations for implementing this package is to provide the fundamental functionality of extracting parameter values from (domain specific) texts needed for the implementation of "ML::NLPTemplateEngine", [AAp5].

Installation

Package installations from both sources use zef installer (which should be bundled with the "standard" Rakudo installation file.)

To install the package from Zef ecosystem use the shell command:

zef install ML::FindTextualAnswer

To install the package from the GitHub repository use the shell command:

zef install https://github.com/antononcube/Raku-ML-FindTextualAnswer.git

Usage examples

Here is an example of finding textual answers:

use ML::FindTextualAnswer;

my $text = "Lake Titicaca is a large, deep lake in the Andes 
on the border of Bolivia and Peru. By volume of water and by surface 
area, it is the largest lake in South America";

find-textual-answer($text, "Where is Titicaca?")

# Lake Titicaca is on the border of Bolivia and Peru.

By default find-textual-answer tries to give short answers. If the option "request" is Whatever then depending on the number of questions the request is one those phrases:

"give the shortest answer of the question:"
"list the shortest answers of the questions:"

In the example above the full query given to LLM is

Given the text "Lake Titicaca is a large, deep lake in the Andes on the border of Bolivia and Peru. By volume of water and by surface area, it is the largest lake in South America" give the shortest answer of the question:
Where is Titicaca?

Here we get a longer answer by changing the value of "request":

find-textual-answer($text, "Where is Titicaca?", request => "answer the question:")

# Titicaca is located in the Andes, on the border of Bolivia and Peru.

Remark: The function find-textual-answer is inspired by the Mathematica function FindTextualAnswer, [WRI1]; see [JL1] for details. Unfortunately, at this time implementing the full signature of FindTextualAnswer with APIs of OpenAI and PaLM is not easy.

Multiple answers

Consider the text:

my $textCap = q:to/END/;
Born and raised in the Austrian Empire, Tesla studied engineering and physics in the 1870s without receiving a degree,
gaining practical experience in the early 1880s working in telephony and at Continental Edison in the new electric power industry.

In 1884 he emigrated to the United States, where he became a naturalized citizen.
He worked for a short time at the Edison Machine Works in New York City before he struck out on his own.
With the help of partners to finance and market his ideas,
Tesla set up laboratories and companies in New York to develop a range of electrical and mechanical devices.
His alternating current (AC) induction motor and related polyphase AC patents, licensed by Westinghouse Electric in 1888,
earned him a considerable amount of money and became the cornerstone of the polyphase system which that company eventually marketed.
END

$textCap.chars

# 861

Here we ask a single question and request 3 answers:

find-textual-answer($textCap, 'Where lived?', 3, finder => 'PaLM')

# 1. Austrian Empire
# 2. United States
# 3. New York

Here is a rerun without number of answers argument:

find-textual-answer($textCap, 'Where lived?', finder => 'PaLM')

# United States

Multiple questions

If several questions are given to the function find-textual-answer then all questions are spliced with the given text into one query (that is sent to LLM.)

For example, consider the following text and questions:

my $query = 'Make a classifier with the method RandomForest over the data dfTitanic; show precision and accuracy.';

my @questions =
        ['What is the dataset?',
         'What is the method?',
         'Which metrics to show?'
        ];

# [What is the dataset? What is the method? Which metrics to show?]

Then the query send to the LLM (ChatGPT/PaLM/YandexGPT) is:

Given the text: "Make a classifier with the method RandomForest over the data dfTitanic; show precision and accuracy." list the shortest answers of the questions:
What is the dataset?
What is the method?
Which metrics to show?

The answers are assumed to be given in the same order as the questions, each answer in a separated line. Hence, by splitting the LLM result into lines we get the answers corresponding to the questions.

If the questions are missing question marks, it is likely that the result may have a completion as a first line followed by the answers. In that situation the answers are not parsed and a warning message is given.

Here is example of requesting answers to multiple questions and specifying that result should be a list of pairs:

my %res = find-textual-answer($query, @questions, finder => 'PaLM', :pairs);

.say for %res;

# Which metrics to show? => [precision accuracy]
# What is the method? => RandomForest
# What is the dataset? => dfTitanic

LLM functions

This package, "ML::FindTextualAnswer", uses LLMs via the package "LLM::Functions".

Upon installation the package "LLM::Functions" knows how to access the LLMs ChatGPT and PaLM. (I.e. "LLM::Functions" dependents on "WWW::OpenAI" and "WWW:PaLM".)

In some situations it would be preferable to have a pre-configured LLM function for finding the textual answers. Such functions can be obtained with llm-textual-answer-function. Here is an example:

my &fta = llm-textual-answer-function(llm-evaluator => 'PaLM'):pairs;

&fta($query, @questions)

# {dataset => dfTitanic, method => RandomForest, metrics => [precision accuracy]}

That is roughly equivalent to making of the LLM function:

use LLM::Functions;
use Text::SubParsers;
use ML::FindTextualAnswer::LLM::TextualAnswer;

my &fta2 =
        llm-function(
        { "Given the text: $^a \nAnswer the following questions:\n$^b." },
                llm-evaluator => llm-configuration('PaLM', prompts => default-prompt),
                form => sub-parser('JSON'));

# -> **@args, *%args { #`(Block|3613591096848) ... }

Command Line Interface

The package provides a CLI script for finding textual answers:

find-textual-answer --help

# Usage:
#   find-textual-answer [<words> ...] -q|--questions=<Str> [--llm=<Str>] [--mt|--max-tokens[=UInt]] [--temp|--temperature[=Real]] [-r|--request=<Str>] [-p|--pairs] [-a|--auth-key=<Str>] [--timeout[=UInt]] [--echo] [-f|--format=<Str>] [--method=<Str>] -- Command given as a sequence of words.
#   
#     [<words> ...]                  Text to be questioned.
#     -q|--questions=<Str>           Questions separated with '?' or ';'.
#     --llm=<Str>                    Large Language Model, one of 'openai', 'palm', or 'Whatever'. [default: 'Whatever']
#     --mt|--max-tokens[=UInt]       The maximum number of tokens to generate in the completion. [default: 300]
#     --temp|--temperature[=Real]    Temperature. [default: 0.7]
#     -r|--request=<Str>             Request. [default: 'Whatever']
#     -p|--pairs                     Should question-answer pairs be returned or not? [default: False]
#     -a|--auth-key=<Str>            Authorization key (to use OpenAI API.) [default: 'Whatever']
#     --timeout[=UInt]               Timeout. [default: 10]
#     --echo                         Should the query, result, answer be echoed or not? [default: False]
#     -f|--format=<Str>              Format of the result; one of "json", "hash", "values", or "Whatever". [default: 'values']
#     --method=<Str>                 Method for the HTTP POST query; one of "tiny" or "curl". [default: 'tiny']

Here is an example invocation:

find-textual-answer 'Colors in preference order: blue, red, green, white, pink, cherry, light brown.' -q='What is the favorite color?'

# The favorite color is blue.

Mermaid diagram

The following flowchart corresponds to the conceptual steps in the package function find-textual-answer with a finder spec that is an LLM::Functions::Evaluator object ("LLM" stands for "Large Language Models"):

graph TD
	UI[/"1) Natural language text<br>2) Questions"/]
	TO[/"Answers"/]
	WR[[Web request]]
	OpenAI{{OpenAI}}
	PaLM{{PaLM}}
	PJ[Parse JSON]
	Q{Return<br>hash?}
	MSTC[Compose query]
	MURL[[Make URL]]
	TTC[Process]
	QAK{Auth key<br>supplied?}
	EAK[["Try to find<br>API key<br>in %*ENV"]]
	QEAF{Auth key<br>found?}
	NAK[/Cannot find auth key/]
	UI --> QAK
	QAK --> |yes|MSTC
	QAK --> |no|EAK
	EAK --> QEAF
	MSTC --> TTC
	QEAF --> |no|NAK
	QEAF --> |yes|TTC
	TTC -.-> MURL -.-> WR -.-> TTC
	WR -.-> |URL|OpenAI 
	WR -.-> |URL|PaLM 
	OpenAI -.-> |JSON|WR
	TTC --> Q 
	Q --> |yes|PJ
	Q --> |no|TO
	PJ --> TO

At this point for "LLM finders" the functions find-textual-answer uses the function ML::FindTextualAnswer::LLM::TextualAnswer::Fetch, which, in turn, is based on the the packages "LLM::Functions" and "Text::SubParsers".

TODO

TODO LLM implementation

DONE Heuristic for splitting and assigning multiple answers
DONE Separate functions:
- DONE llm-find-textual-answer
- DONE llm-find-textual-answer-function
- DONE llm-classify
DONE Refactor using "LLM::Functions"
- DONE Fetch
- DONE llm-textual-answer
- DONE llm-classify
TODO Post-processing
- DONE Implement post-processing of sub-parser('JSON') LLM function calls.
- TODO Implement grammar-based post processing
  - This requires investigating a fair amount of cases.
TODO CLI
- DONE find-textual-answer
- TODO llm-classify
TODO Documentation
- TODO Document of all parameters
  - TODO number of answers per question
  - TODO pairs
  - TODO prelude
  - TODO request
  - TODO strip-with
- TODO More detailed primary use cases
- TODO Classification over a large set of DSL commands
  - TODO DSL commands from previous work
  - TODO Precision and recall llm-classify