Intro
LLM revolution is still going strong. With the plethora of tools and frameworks available, as well as new and improved models being released weekly, almost anyone can build simple yet impressive prototypes. Unfortunately, frameworks tend to obstruct and hide what happens under the hook. In this blog post, we’ll explore what exactly happens during RAG-based question answering, even before the first byte of a prompt is sent to GPT.
What is RAG?
RAG (Retrieval Augmented Generation) is a question-answering technique in which:
relevant pieces of information are injected into the prompt
large language model is used to predict the continuation of the prompt
Of course, this simplified definition does not help us too much here. Natural questions arise:
where can we find this information?
how do we know that information is relevant?
what does it mean to inject this information?
This post will walk through the official langchain tutorial for RAGs and demystify what happens under the hood.
Knowledge base
First of all, we need to establish the knowledge base.
The tutorial uses a very simplistic approach of using a single document as a knowledge base:
loader = WebBaseLoader(
web_paths=[
"https://lilianweng.github.io/posts/2023-06-23-agent/",
],
bs_kwargs=dict(
parse_only=bs4.SoupStrainer(
class_=("post-content", "post-title", "post-header")
)
),
)
docs = loader.load()
# [Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='\n\n LLM Powered Autonomous Agents\n \nDate: June 23, 2023 | Es...")]
Remember that in practical problems, you’ll likely ingest multiple documents, probably from various sources. Each source and each document may require a slightly different approach to fetch and parse it to proper plain text form which LLM can use.
The second step is to chunk indexed information. Chunking is done to help embedding model to encode relevant information in the output vector. Longer texts tend to average out vector weights and make overall vectors closer to each other. At the same time, smaller and more precise information injected into the LLM prompt helps to come up with meaningful predictions later.
In principle, after chunking we get 1+ chunks for each input document. Overlap is used to avoid situations where complete information is never available because it’s split across two chunks.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
# len(splits) is 66 in our case
Semantic search
Once we have our split documents, we load them into the vector database. Here, Chroma is used as a simple in-memory database.
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
During data indexing we execute a critical step for overall RAG flow - we compute (and store) embedding vectors for each of our chunks. Embedding a high-dimensional vector that represents the meaning of a given text A vector database is a component that is responsible for finding relevant context for user questions. Relevance is determined as a vector similarity problem, typically solved via approximated nearest neighbor search. Please notice that the database itself does not impose the model used. In our case, we use embedding models provided by OpenAI, but many self-hosted models can be found on Huggingface. In terms of actual database operations, of course, 66 chunks are negligible value for any modern database. Once you go into the millions range, database selection becomes a critical decision for fulfilling functional and non-functional requirements. If you are more interested strictly in vector databases, I recommend to you the tech talk that I did for CodiLime.
When executing RAG Question Answering flow, a vector database is used to retrieve relevant context. In particular:
user input is read and vectorized
db is queried using computed vector
top N documents are returned and injected into a prompt
Prompt engineering
Only after all of these steps are done, LLM is called. For the example from our tutorial, LLM prompt template was:
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:
Actual prompt (after vector database lookup) is as follows (in bold there is a user question, in italics, you get context retrieved from the vector database):
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: What is Task Decomposition?
Context: Fig. 1. Overview of a LLM-powered autonomous agent system.
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.
Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.
Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.
Resources:
1. Internet access for searches and information gathering.
2. Long Term memory management.
3. GPT-3.5 powered Agents for delegation of simple tasks.
4. File output.
Performance Evaluation:
1. Continuously review and analyze your actions to ensure you are performing to the best of your abilities.
2. Constructively self-criticize your big-picture behavior constantly.
3. Reflect on past decisions and strategies to refine your approach.
4. Every command has a cost, so be smart and efficient. Aim to complete tasks in the least number of steps.
(3) Task execution: Expert models execute on the specific tasks and log results.
Instruction:
With the input and the inference results, the AI assistant needs to describe the process and results. The previous stages can be formed as - User Input: {{ User Input }}, Task Planning: {{ Tasks }}, Model Selection: {{ Model Assignment }}, Task Execution: {{ Predictions }}. You must first answer the user's request in a straightforward manner. Then describe the task process and show your analysis and model inference results to the user in the first person. If inference results contain a file path, must tell the user the complete file path.
Answer:
With this prompt in place, GPT ends up recommending sensible continuation.
Task Decomposition is the process of breaking down a complex task into smaller, more manageable steps. This can be achieved through techniques like Chain of Thought (CoT), which encourages step-by-step reasoning, or by using task-specific instructions and human inputs. It allows for better planning and execution of tasks by transforming a large problem into simpler sub-tasks.
Common culprits
Funny enough, even if we consider RAG as an LLM-centric solution, most things that can go wrong happen way earlier than LLM is even used.
If your knowledge base is incomplete, LLM predictions will be naturally limited by the lack of information injected into the prompt.
If your knowledge base indexing is incorrect, LLM predictions will become tainted:
if the embedding model is too simple for your data, you’ll end up with prompt injections that are not relevant to the user question, which in consequence will cause nonsensical preconditions from LLM
if splits are too coarse, output vectors will fail to recognize the relevant context to be injected; if splits are too dense, relevant context will be too fragmented to be useful for LLM
if data scrapping/chunking is incorrect, similarity you’ll end up with prompt injections that fail to deliver knowledge that is present in the knowledge base
If your knowledge base is large, you’ll encounter vector database at-scale issues similar to any other database solution
I’d argue that strong Software, Data, and DevOps Engineering principles are still critical for such solutions, especially when you move from small-scale proof of concept to an actual production environment.
Hopefully, after reading this article, you know more about what exactly happens under the hood.
Benefits
Enough about culprits. If done right, the benefits of RAG approach are fairly obvious:
additional data provided to LLM can contain proprietary information that was not available for model training
context provided to LLM can now be newer than the knowledge base used for model training
LLM response can now be traced back to the source of information, which improves the transparency and trustworthiness of the solution
the same general model can be used to answer questions based on different knowledge bases, without fine-tuning or re-training
Are you already operating RAG-like solutions in production? Or do you prepare yourself to launch one soon? Let me know in the comments below!