Skip to content

Retrieval Augmented Generation (RAG)

Learning outcomes
  • Understand the key components of RAG applications by looking at what popular open-source RAG libraries provide
  • Perform an Agentic RAG task

RAG meme

  • LLMs are not trained on your personal data or fairly recent data.

  • RAG can help provide richer and accurate responses based on external knowledge.

  • It incurs significantly lower computation cost compared to long-context LLMs.

  • We will learn RAG through the lens of popular open-source RAG libraries, viz. LangChain and LlamaIndex.

Basics

RAG figures
Naive RAG

Naive RAG

Naive Retrieval System

Naive Retrieval

Stages in RAG 🔄

Atomic unit

  • LangChain's atomic unit is a Document.
  • LlamaIndex's atomic unit is a Node. A collection of Nodes consitutes a Document.

Loading 📥

Loading/Parsing data from source and creating well-formatted Documents with metadata. (1)
This step includes splitting texts such that it can be embedded into lower dimensions.

flowchart LR
  A[Text/image + metadata] --> B[Chunking/Splitting] --> C[Document]

Document parsing sequence

  • LangChain creates Documents first and then performs chunking.
  • LlamaIndex performs chunking first, that becomes a Node and then creates Documents of multiple Nodes.

Indexing 📊

Creating data structure and/or reducing dimensions of the data for easy querying of data.(2)

flowchart LR
  A[Document] --> B[Embeddings]

Storing 💾

Storing Documents, metadata and embeddings in a persistant manner (Ex. Vector Stores). (3)

flowchart LR
  A[Document] --> C[Vector Store/Storage Context]
  B[Embeddings] --> C

Querying ❓

Retrieving relavent Documents for a user Query and feeding it to LLM for added context. (4)

flowchart LR
  A[Vector Store/Storage Context] --> D[LLM + tools]
  B[Query] --> D
  C[Prompt] --> D
  D --> E[Response]

Evaluation 📈

Trace inspection, meterics, comparisons to test if full pipeline gives desired results. (5)

  1. LlamaIndex ex.: SimpleDirectoryReader class
    LangChain ex.: document_loaders module, langchain_text_splitters module

  2. LlamaIndex ex.: VectorStoreIndex class
    LangChain ex.: Embeddings class

  3. LlamaIndex ex.: StorageContext
    LangChain ex.: VectorStore

  4. LlamaIndex ex.: RetrieverQueryEngine class
    LangChain ex.: Retriever class

  5. LlamaIndex ex.: LLM-Evaluator
    LangChain ex.: LangSmith, QAEvalChain

  6. Retrieval techniques

  7. QA/chat
  8. Misc: Reranker model, GraphRAG, RAPTOR, EraRAG, multimodal

Exercise

DIY: vllm with langchain

When and when not to use RAG ⚖️

  • It was found1 that RAG lags behing Long-Context LLMs in the following scenarios: (1)

    • Query requiring multi-step reasoning.
    • General queries to which embeddings model does not perform well.
    • Long and complex queries.
    • Implicit queries requiring the reader to connect the dots.
  • Way easier than just fine-tuning on personal data.

  • Allows smaller models with shorter context memory to be on par with larger models. Therefore, saving compute and memory cost on GPUs.

  1. RAG failure reasons
Note on popular Chat UI frameworks

If doing basic RAG which can read a few of your documents and can search the web, check out popular LLM chat frameworks with integrated RAG functionality.

  • LMstudio
  • Open-webui
  • WebUI by llama.cpp
  • Chainlit

Agentic RAG 🤖

  • An LLM-powered agent decides when and how to retrieve during reasoning. This gives more flexibility in the decision making process by the system but low control over it by the engineer.
  • Router
  • Tool calling
  • Multistep reasoning with tools
  • llm.txt
  • Documentation sites have already started building their RAG powered chatbots: vllm, langchain, anthropic etc.

(More about Agents will be covered in Day 3.)

Exercise

  • Create ~/portal/jupyter dir if you dont have already.

  • Copy llm-workshop/containers/rag/rag_env.sh to your ~/portal/jupyter/ . ie. cp /mimer/NOBACKUP/groups/llm-workshop/containers/rag/rag_env.sh ~/portal/jupyter/

  • Start a jupyter server on 1x A40 node using rag_env.sh runtime and working directory as your project folder.

  • Run rag.ipynb

RAG on single node

Compute node setup

Some more useful techniques in retrieval pipelines 🛠️

Have you tried out chatbot for UPPMAX docs yet? Its a RAG system under the hood! C3SE chatbot is coming soon too.

Resources 📚

  1. Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach arXiv