Retrieval Augmented Generation (RAG)¶

Learning outcomes

Understand the key components of RAG applications by looking at what popular open-source RAG libraries provide
Perform an Agentic RAG task

RAG meme

LLMs are not trained on your personal data or fairly recent data.
RAG can help provide richer and accurate responses based on external knowledge.
It incurs significantly lower computation cost compared to long-context LLMs.
We will learn RAG through the lens of popular open-source RAG libraries, viz. LangChain and LlamaIndex.

Basics¶

RAG figures

Naive RAG

Naive Retrieval System

Naive Retrieval

Stages in RAG 🔄¶

Atomic unit

LangChain's atomic unit is a Document.
LlamaIndex's atomic unit is a Node. A collection of Nodes consitutes a Document.

Loading 📥¶

Loading/Parsing data from source and creating well-formatted Documents with metadata. (1)
This step includes splitting texts such that it can be embedded into lower dimensions.

flowchart LR
  A[Text/image + metadata] --> B[Chunking/Splitting] --> C[Document]

Document parsing sequence

LangChain creates Documents first and then performs chunking.
LlamaIndex performs chunking first, that becomes a Node and then creates Documents of multiple Nodes.

Indexing 📊¶

Creating data structure and/or reducing dimensions of the data for easy querying of data.(2)

flowchart LR
  A[Document] --> B[Embeddings]

Storing 💾¶

Storing Documents, metadata and embeddings in a persistant manner (Ex. Vector Stores). (3)

flowchart LR
  A[Document] --> C[Vector Store/Storage Context]
  B[Embeddings] --> C

Querying ❓¶

Retrieving relavent Documents for a user Query and feeding it to LLM for added context. (4)

flowchart LR
  A[Vector Store/Storage Context] --> D[LLM + tools]
  B[Query] --> D
  C[Prompt] --> D
  D --> E[Response]

Evaluation 📈¶

Trace inspection, meterics, comparisons to test if full pipeline gives desired results. (5)

LlamaIndex ex.: SimpleDirectoryReader class
LangChain ex.: document_loaders module, langchain_text_splitters module
LlamaIndex ex.: VectorStoreIndex class
LangChain ex.: Embeddings class
LlamaIndex ex.: StorageContext
LangChain ex.: VectorStore
LlamaIndex ex.: RetrieverQueryEngine class
LangChain ex.: Retriever class
LlamaIndex ex.: LLM-Evaluator
LangChain ex.: LangSmith, QAEvalChain
Retrieval techniques
QA/chat
Misc: Reranker model, GraphRAG, RAPTOR, EraRAG, multimodal

Exercise

DIY: vllm with langchain

When and when not to use RAG ⚖️

It was found¹ that RAG lags behing Long-Context LLMs in the following scenarios: (1)
- Query requiring multi-step reasoning.
- General queries to which embeddings model does not perform well.
- Long and complex queries.
- Implicit queries requiring the reader to connect the dots.
Way easier than just fine-tuning on personal data.
Allows smaller models with shorter context memory to be on par with larger models. Therefore, saving compute and memory cost on GPUs.

Note on popular Chat UI frameworks

If doing basic RAG which can read a few of your documents and can search the web, check out popular LLM chat frameworks with integrated RAG functionality.

LMstudio
Open-webui
WebUI by llama.cpp
Chainlit

Agentic RAG 🤖¶

An LLM-powered agent decides when and how to retrieve during reasoning. This gives more flexibility in the decision making process by the system but low control over it by the engineer.
Router
Tool calling
Multistep reasoning with tools
llm.txt
Documentation sites have already started building their RAG powered chatbots: vllm, langchain, anthropic etc.

(More about Agents will be covered in Day 3.)

Exercise

Create ~/portal/jupyter dir if you dont have already.
Copy llm-workshop/containers/rag/rag_env.sh to your ~/portal/jupyter/ . ie. cp /mimer/NOBACKUP/groups/llm-workshop/containers/rag/rag_env.sh ~/portal/jupyter/
Start a jupyter server on 1x A40 node using rag_env.sh runtime and working directory as your project folder.
Run rag.ipynb

RAG on single node

Compute node setup

Some more useful techniques in retrieval pipelines 🛠️

Have you tried out chatbot for UPPMAX docs yet? Its a RAG system under the hood! C3SE chatbot is coming soon too.

Resources 📚

Recommended papers on RAG:
Popular libraries and software suite:

Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach arXiv ↩