Sunday, January 21, 2024

Retrieval Augmented Generation (RAG)

Retrieve data that can be augmented to your prompt when generating an answer from an LLM.

When your asking an LLM like ChatGPT a question you get an answer back that you hope is correct and relevant, but it might not be as it has a huge amount of data to reference, and LLM’s can be prone to “hullicinations”. Lawyer references fake case from chatgpt

This may be fine in the case of ChatGPT as it a general question and answer model, but to make answers more helpful and specific for say a lawyer it would be helpful if the model could reference a source of known cases and likely be able to give a better answer. There are two ways to achieve this.

The first is to create your own model through fine-tuning, platforms like openai allow you to fine tune your own model based on one of their existing pre-trained models.

Much like prompting though, fine-tuning is less an exact science and more of an art form, you need to generate a fairly large set of data to train your model on as well as a seperate set to validate the model with, this can be very time consuming going back and forth adjusting your dataset through trial and error.

The second is retrieval augmented generation, rather than fine-tuning your own model, use a general LLM and augment the prompt with your own data. You are limited by the token input of your LLM, meaning you’ll need to be clever how you split up your text and retrieve it.

Here’s how we can create an app to “Chat with your pdf documents”.

The steps to create a simple RAG implementation are:

Extract our source text and split it into chunks
Create and store embeddings for the chunk and store alongside the raw text
Take the user input and create an embedding
Use the embedding to query our source embedding and find matches
Use these matched results as additional context for our LLM prompt

First we need to read our pdf and extract the text, there are many expensive api options but I found pdfjs worked very well on most pdfs I tried.

Now we have our source text we need to break it into chunks to create embeddings, some research and trial and error suggested a token size of approx 350 was a sweet point. We need enough text to have enough context to match our question but not enough to capture too much data.

Embeddings: “Embeddings are a numerical representation of text that can be used to measure the relatedness between two pieces of text”

js-tiktoken can be used to count token length of a string as you figure out the correct amount of data to embed.

In our case this is roughly one page of text, we can store this alongside the page number allowing us to reference the source page when returning an answer.

We don’t blindly store each page, we use a sliding window method, taking some text from the previous and next pages to create the sentences, as we are splitting our text into sentences we could miss parts of sentences across pages, this method gives us a better chance of catching whole sentences.

Using NLP we can be clever how we split this text depending on our use case, tools like wink can help with this.

We now store the text and embeddings together or separately depending on the datastore used, we can use a shared id allowing us to retrieve the original text when an embedding is found.

We have our retrieval data stored and ready for querying.

Taking a question string as input we create an embedding and use that to query our vector store looking for a match.

Using cloudflares vector store:

const vectorQuery = await c.env.VECTOR.query(embeddings.data[0].embedding, {
    topK: 1,
    returnMetadata: true,
    filter: {doc_id: parseInt(docid)}
});

The topK param limits the amount of matches found. Depending on the token input limit of your model and your dataset you may want to increase your matches to increase the context you provide the prompt.

We can then filter the results based on the similarity score:

const vecIds = vectorQuery.matches
    .filter(vec => vec.score > 0.7)
    .map(vec => vec.vectorId)

I found a score of 0.7 worked well for my scenario.

Now we have a match, retrieve the original text and use that as the context for our prompt.

There are many good guides that will help you create good prompts: DeepLearning, OpenAi

const systemPrompt = `
    When answering the question or responding, use the context in triple quotes
    '''${context}'''`;

There you now have your RAG system working, and hopefully providing better and more accurate answers using your own data.

Want to give it a try https://pdf.sheboygin.com/