RAG System

What is RAG?


Retrieval-Augmented Generation (RAG) is the process of optimizing the output of Large Language Models (LLMs). It helps LLMs refer to external knowledge outside of training data sources before generating a response. RAG extends the capabilities of LLMs to specific domains or an organization's internal knowledge base, without the need to retrain the model.


RAG is the solution for the limitation of LLMs, they are trained on tons of data but they're not trained on your data. Because user data is protected by Firewall, which prevents OpenAI from accessing. At the same time, LLMs need access permission to process user data.


But LLM cannot handle all of your data at once for they simply cannot handle it. If ChatGPT were given all of  the user’s documents at once it should be limited by Context-Windows. Even in a hypothetical scenario where LLMs had infinite Context-Windows, it would still not be practical to give LLM all of the data every time. It has to be selective about which retrieval part of Retrieval Augmented Generation should be accessed.


In addition, the LLMs generation often has a common problem called `hallucination`, where the models generate sentences that sound fluent and logical, but are in fact incorrect. The RAG technique was created as a method to solve this problem.

Use cases of RAG Systems


How to perform RAG


The core of RAG is `Retrieval` and there are some ways to do that at the highest level.



1. Vector embeddings

Converting text into numbers is therefore known as embedding text into Vector space and the resulting numbers are just called `embeddings` for short.

2. Search by meaning

3. Feed to LLMs

So, that we do RAG


Vector search, Keyword search or structured queries can be utilized individually or collectively at the same time.

RAG pipeline

RAG challenges


Known challenges of LLMs include:

How to improve RAG with Advanced Techniques


Chunking

Choosing the right context is very important. A large chunk of embeddings is difficult for LLM to find the right context. So it is very important to have the right `chunk size`. It is usually better to have a smaller chunk of embeddings rather than a large one. It helps LLMs have an easier time finding the most relevant.

However, finding the right chunk size requires many iterations and experimentations. 

But LlamaIndex and Langchain provide a way to easily test different chunk sizes. It takes a bit of time but actually helps improve response times, relevancy, and also faithfulness.



Small-to-big retrieval

Small-to-big technique: First smaller chunks of text are retrieved and used to identify the relevant larger chunks.

For example, large documents are broken up into very small chunks like single sentences and retrieval of those very specific sentences is performed for maximum precision. But that sentence by itself, while being closely related to the query, might not be enough context to answer the question. So at the synthesis stage before the query and the context are handed to LLM, five or 10 sentences worth of context are retrieved before and after the sentence we retrieved. This gives the LLM more context to work with while maintaining the precision.

Conversation memory

Conversation memory, this is theoretically simple. As queries are run, all history should be included as context to further queries.

ChatEngine can be learned via LlamaIndex, this interface has been built to support Conversation memory.

Sub Question Query Engine

Sub Question Query Engine is a specialized component designed to handle complex queries by breaking them down into smaller, more manageable sub-questions.


How it Works:

Query Decomposition: LLM is utilized to take a complex query and analyze it to identify potential sub-questions that can be answered independently. 

Sub-Question Routing: Each sub-question is then directed to the most appropriate data source or retrieval function. This ensures that each question is answered using the most relevant and accurate information available.

Response Synthesis: The engine collects the answers to all the sub-questions and synthesizes a final response that addresses the original complex query.

Self reflection

Self-Reflection refers to the ability of the RAG system to evaluate its own generated responses and decide whether they are accurate, relevant, and complete.

The LLM can tell whether the LLM is doing a good job. A human can be involved in the loop to tell the agent whether it is getting closer to its goal.

How it Works:

1. Information Retrieval

2. Generate Response: Based on the retrieved information, the system generates an initial response

3. Self-Assessment: The system evaluates the initial response by considering factors such as, accuracy, completeness

4. Refinement and Improvement: Based on the self-assessment results, the system can refine and improve the initial response, or simply re-Generate Response

We are a software development company based in Vietnam.

We offer DevOps development remotely to support the growth of your business.

If there is anything we can help with, please feel free to consult us.