LLM::RetrievalAugmentedGeneration

Raku package for doing LLM Retrieval Augment Generation (RAG).

Motivation and general procedure

Assume we have a large (or largish) collection of (Markdown) documents and we want to interact with it as if a certain LLM model has been specially trained with that collection.

Here is one way to achieve this:

The "data wrangling problem" is the conversion of the a collection of documents into Markdown files, and then partitioning those files into text chunks.
- There are several packages and functions that can do the conversion.
- It is not trivial to partition texts into reasonable text chunks.
  - Certain text paragraphs might too big for certain LLMs to make embeddings for.
Each of the text chunks is "vectorized" via LLM embedding.
Then the vectors are put in a vector database or "just" into a "nearest neighbors" finding function object.
- A large nearest neighbors finding object can be made with "Math::Nearest", [AAp6].
- Alternatively, a recommender system can be used like "ML::StreamsBlendingRecommender", [AAp8].
When a user query is given:
- The LLM embedding vector is being found.
- The closest text chunk vectors are found.
The corresponding closest text chunks are given to the LLM to formulate a response to user's query.

Workflow

Here is the Retrieval Augmented Generation (RAG) workflow we consider:

The document collection is ingested.
The documents are split into chunks of relevant sizes.
- LLM embedding models have token limit that have to be respected.
- It might be beneficial or desirable to split into "meaningful" chunks.
  - I.e. complete sentences or paragraphs.
Large Language Model (LLM) embedding vectors are obtained for all chunks.
A vector database is created with these embedding vectors and stored locally. Multiple local databases can be created.
A relevant local database is imported for use.
An input query is provided to a retrieval system.
The retrieval system retrieves relevant documents based on the query.
The top K documents are selected for further processing.
The model is fine-tuned using the selected documents.
The fine-tuned model generates an answer based on the query.
The output answer is presented to the user.

Component diagram

Here is a Mermaid-JS component diagram that shows the components of performing the Retrieval Augmented Generation (RAG) workflow:

flowchart TD
    subgraph LocalVDB[Local Folder]
        A(Vector Database 1)
        B(Vector Database 2)
        C(Vector Database N)
    end
    ID[Ingest document collection]
    SD[Split Documents]
    EV[Get LLM Embedding Vectors]
    CD[Create Vector Database]
    ID --> SD --> EV --> CD

    EV <-.-> LLMs
    
    CD -.- CArray[[CArray<br>representation]]

    CD -.-> |export| LocalVDB

    subgraph Creation
        ID
        SD
        EV
        CD
    end

    LocalVDB -.- JSON[[JSON<br>representation]]

    LocalVDB -.-> |import|D[Ingest Vector Database]
 
    D -.- CArray
    F -.- |nearest neighbors<br>distance function|CArray
    D --> E
    E[/User Query/] --> F[Retrieval]
    F --> G[Document Selection]
    G -->|Top K documents| H(Model Fine-tuning)
    H --> I[[Generation]]
    I <-.-> LLMs
    I -->J[/Output Answer/]
    G -->|Top K passages| K(Model Fine-tuning)
    K --> I

    subgraph RAG[Retrieval Augmented Generation]
        D 
        E
        F
        G
        H
        I
        J
        K
    end
    
    subgraph LLMs
        direction LR
        OpenAI{{OpenAI}}
        Gemini{{Gemini}}
        MistralAI{{MistralAI}}
        LLaMA{{LLaMA}}
    end

In this diagram:

Document collections are ingested, processed, and corresponding vector databases are made.
- LLM embedding models are used for obtain the vectors.
There are multiple local vector databases that are stored and maintained locally.
A vector database from the local collection is selected and ingested.
An input query provided by the user initiates the RAG workflow.
The workflow then proceeds with:
- retrieval
- document selection
- model fine-tuning
- answer generation
- presenting the final output

Implementation notes

Fast nearest neighbors

Since Vector DataBases (VDBs) are slow and "expensive" to compute, their stored in local directory.
- By default XDG_DATA_HOME is used; for example, ~/.local/share/raku/LLM/SemanticSearchIndex.
LLM embeddings produce large, dense vectors, hence nearest neighbors finding algorithms like K-d Tree do not apply. (Although, those algorithms perform well in low-dimensions.)
- For example we can have 500 vectors each with dimension 1536.
Hence, fast C-implementations of the common distance functions were made; see [AAp7].

Smaller export files, faster imports

Exporting VDBs files in JSON format produces large files.
- For example:
  - Latest LLaMA models make vectors with dimension 4096
  - So the transcript of "normal" ≈ 3.5 hours long podcast would produce ≈ 55 MB JSON file size
  - It takes ≈ 13 seconds to JSON-import that file.
Hence, a format for smaller file size and faster import should be investigated.
I investigated the use of CBOR via "CBOR::Simple".
In order to facilitate the use of CBOR:
- The VDB class .export method takes :format argument.
- The .import method uses the file extension to determine with which format to import with.
- The package "Math::DistanceFunctions::Native:ver<0.1.1>" works with num64 (double) and num32 (float) C-arrays.
- There is (working) precision attribute $num-type in the VDB class that can be num32 or num64.
Using CBOR instead of JSON to export/import VDB objects:
- Produces ≈ 2 times smaller files using num64; ≈ 3 times with num32
- Exporting is 30% faster with CBOR
- Importing VDB CBOR files is ≈ 3.5 times faster
- Importing num32 CBOR exported files is problematic
- Importing using CBOR is "too slow" to make VDBs summaries (done with regexes over JSON text blobs)

TODO

TODO Implementation
- DONE "Short file spec"
- TODO Weak and strong equivalence of VDBs
TODO Unit testing
- DONE Ingest VDB
- DONE Joining VDBs
- DONE Using vector-database-objects
- DONE Round trip export and import with CBOR and JSON formats
- TODO Expected "correct" nearest neighbors tests
TODO Documentation
- DONE VDB creation notebook
- DONE VDB ingestion and RAG notebook
- DONE Raku-RAG demo notebook and video
- TODO Review (and complete if needed)
- TODO Overview blog post
TODO Applications and workflows
- DONE Simple RAG demos
- DONE Making nearest neighbor graphs
- TODO Using a recommender system
  - I.e. "ML::StreamsBlendingRecommender"
- TODO Complete RAG workflow over a real life, largish Raku package
- TODO RAG over "App::Rak"
- TODO RAG over docs.raku.org

References

Packages

[AAp1] Anton Antonov, WWW::OpenAI Raku package, (2023), GitHub/antononcube.

[AAp2] Anton Antonov, WWW::PaLM Raku package, (2023), GitHub/antononcube.

[AAp3] Anton Antonov, LLM::Functions Raku package, (2023-2024), GitHub/antononcube.

[AAp4] Anton Antonov, LLM::Prompts Raku package, (2023-2024), GitHub/antononcube.

[AAp5] Anton Antonov, ML::FindTextualAnswer Raku package, (2023-2024), GitHub/antononcube.

[AAp6] Anton Antonov, Math::Nearest Raku package, (2024), GitHub/antononcube.

[AAp7] Anton Antonov, Math::DistanceFunctions::Native Raku package, (2024), GitHub/antononcube.

[AAp8] Anton Antonov, ML::StreamsBlendingRecommender Raku package, (2021-2023), GitHub/antononcube.