Vitalify Asia | Mobile/Web, XR(AR/VR), AI Development - アプリ開発ならバイタリフィアジア

RAG System

What is RAG?

Retrieval-Augmented Generation (RAG) is the process of optimizing the output of Large Language Models (LLMs). It helps LLMs refer to external knowledge outside of training data sources before generating a response. RAG extends the capabilities of LLMs to specific domains or an organization's internal knowledge base, without the need to retrain the model.

RAG is the solution for the limitation of LLMs, they are trained on tons of data but they're not trained on your data. Because user data is protected by Firewall, which prevents OpenAI from accessing. At the same time, LLMs need access permission to process user data.

But LLM cannot handle all of your data at once for they simply cannot handle it. If ChatGPT were given all of the user’s documents at once it should be limited by Context-Windows. Even in a hypothetical scenario where LLMs had infinite Context-Windows, it would still not be practical to give LLM all of the data every time. It has to be selective about which retrieval part of Retrieval Augmented Generation should be accessed.

In addition, the LLMs generation often has a common problem called `hallucination`, where the models generate sentences that sound fluent and logical, but are in fact incorrect. The RAG technique was created as a method to solve this problem.

Use cases of RAG Systems

Chat with your PDF documents: Imagine a chatbot that can answer your questions directly from your own documents. That's the power of RAG systems.
Smart Chatbot for customer support: You've probably used chatbots before, but "smart" chatbots powered by RAG can quickly and accurately answer the questions by finding the right information within a company's knowledge base.
Personalized Suggestions: Websites that recommend products or music you might like? RAG is analyzing your tastes to find the best matches.
Legal Research: Lawyers can use RAG to sift through vast amounts of legal documents, saving them valuable time and ensuring they have the most up-to-date information…

How to perform RAG

The core of RAG is `Retrieval` and there are some ways to do that at the highest level.

Key word search: This is the traditional way, the same algorithms that have been used in search engines. It works well for exact query-keyword matches.
Structured queries: Relational DB should not be entirely dumped into text and given to an LLM. Because the valuable relational data is all about how one data relates to the other data in DB. Instead, LLMs are becoming more efficient at writing SQL and it can query the database directly and retrieve data in that way.
Vector search: Vector search powers semantic or similarity search. Since the meaning and context is captured in the embedding, vector search finds what users mean, without requiring an exact keyword match.

1. Vector embeddings

Converting text into numbers is therefore known as embedding text into Vector space and the resulting numbers are just called `embeddings` for short.

2. Search by meaning

3. Feed to LLMs

So, that we do RAG

Embedding data
Embedding the query
Retrieve context that's nearby to query in vector space
Feed the context and query to an LLM, and get a completion for your query.

Vector search, Keyword search or structured queries can be utilized individually or collectively at the same time.

RAG pipeline

RAG challenges

Known challenges of LLMs include:

Presenting false information when it does not have the answer. (Accuracy)
Presenting out-of-date or generic information when the user expects a specific, current response. (Recency)
Creating a response from non-authoritative sources. (Provenance)
Creating inaccurate responses due to terminology confusion, wherein different training sources use the same terminology to talk about different things. (Faithfulness)

How to improve RAG with Advanced Techniques

Chunking

Choosing the right context is very important. A large chunk of embeddings is difficult for LLM to find the right context. So it is very important to have the right `chunk size`. It is usually better to have a smaller chunk of embeddings rather than a large one. It helps LLMs have an easier time finding the most relevant.

However, finding the right chunk size requires many iterations and experimentations.

But LlamaIndex and Langchain provide a way to easily test different chunk sizes. It takes a bit of time but actually helps improve response times, relevancy, and also faithfulness.

Small-to-big retrieval

Small-to-big technique: First smaller chunks of text are retrieved and used to identify the relevant larger chunks.

For example, large documents are broken up into very small chunks like single sentences and retrieval of those very specific sentences is performed for maximum precision. But that sentence by itself, while being closely related to the query, might not be enough context to answer the question. So at the synthesis stage before the query and the context are handed to LLM, five or 10 sentences worth of context are retrieved before and after the sentence we retrieved. This gives the LLM more context to work with while maintaining the precision.

Conversation memory

Conversation memory, this is theoretically simple. As queries are run, all history should be included as context to further queries.

ChatEngine can be learned via LlamaIndex, this interface has been built to support Conversation memory.

Sub Question Query Engine

Sub Question Query Engine is a specialized component designed to handle complex queries by breaking them down into smaller, more manageable sub-questions.

How it Works:

Query Decomposition: LLM is utilized to take a complex query and analyze it to identify potential sub-questions that can be answered independently.

Sub-Question Routing: Each sub-question is then directed to the most appropriate data source or retrieval function. This ensures that each question is answered using the most relevant and accurate information available.

Response Synthesis: The engine collects the answers to all the sub-questions and synthesizes a final response that addresses the original complex query.

Self reflection

Self-Reflection refers to the ability of the RAG system to evaluate its own generated responses and decide whether they are accurate, relevant, and complete.

The LLM can tell whether the LLM is doing a good job. A human can be involved in the loop to tell the agent whether it is getting closer to its goal.

How it Works:

1. Information Retrieval

2. Generate Response: Based on the retrieved information, the system generates an initial response

3. Self-Assessment: The system evaluates the initial response by considering factors such as, accuracy, completeness

4. Refinement and Improvement: Based on the self-assessment results, the system can refine and improve the initial response, or simply re-Generate Response