RAG Systems Essentials

Enhance your AI agents with external knowledge through Retrieval Augmented Generation

Understanding RAG Systems

Retrieval Augmented Generation (RAG) is a powerful approach that combines the strengths of retrieval-based systems with generative AI models. RAG systems enhance LLMs by providing them with relevant external knowledge at inference time.

Key Insight

RAG systems solve one of the most critical limitations of LLMs: their inability to access information beyond their training data. By retrieving and incorporating external knowledge, RAG enables more accurate, up-to-date, and verifiable responses.

Why RAG Matters for AI Agents

RAG addresses several fundamental challenges in building effective AI agents:

Knowledge Limitations: LLMs have fixed knowledge cutoffs and can't access new information
Hallucinations: LLMs sometimes generate plausible but incorrect information
Domain Specificity: General-purpose LLMs lack deep expertise in specialised domains
Verifiability: LLM outputs often lack clear sources or citations
Customisation: Organisations need agents that reflect their specific knowledge and policies

The RAG Architecture

A typical RAG system consists of these key components:

Document Processing Pipeline: Ingests, processes, and chunks documents
Embedding Model: Converts text chunks into vector representations
Vector Database: Stores and enables semantic search of embeddings
Retriever: Finds relevant information based on user queries
Generator: Uses retrieved information to create accurate responses

When to Use RAG

Use Case	RAG Benefit	Implementation Complexity
Knowledge-intensive Q&A	Provides factual, up-to-date information	Medium
Domain-specific assistants	Incorporates specialised knowledge	Medium-High
Enterprise search	Enables natural language search with contextual answers	High
Document summarisation	Ensures summaries are grounded in source material	Medium
Content generation	Creates content based on accurate, relevant information	Medium

Building RAG Systems: Step-by-Step

Let's walk through the process of building a RAG system from scratch, focusing on practical implementation:

1. Document Processing Pipeline

The first step is to ingest and process your documents into a format suitable for retrieval.

        Document Processing Steps:
        Document Loading: Import documents from various sources
Text Extraction: Extract plain text from different file formats
Text Chunking: Split text into manageable, semantically meaningful chunks
Metadata Enrichment: Add useful metadata to each chunk

    

# Ensure required libraries are installed
# pip install langchain langchain-community pypdf python-dotenv unstructured[local-inference] tiktoken faiss-cpu

from langchain_community.document_loaders import PyPDFLoader, CSVLoader, DirectoryLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document # Import Document class
import os
from dotenv import load_dotenv

load_dotenv()

# Step 1: Load documents from different sources
def load_documents_from_directory(directory_path):
    """Loads documents from a directory using various loaders."""
    # Use DirectoryLoader for simplicity, configuring loaders for different types
    loader = DirectoryLoader(
        directory_path,
        glob="**/*.*", # Load all files
        loader_cls=TextLoader, # Default loader for unknown types
        loader_kwargs={"encoding": "utf-8"}, # Example argument for TextLoader
        use_multithreading=True,
        show_progress=True,
        recursive=True # Load from subdirectories too
        # Note: More specific loaders can be added or configured if needed, 
        # e.g., using UnstructuredFileLoader for broader format support
        # or PyPDFLoader specifically for PDFs.
    )
    try:
        documents = loader.load()
        print(f"Loaded {len(documents)} documents from {directory_path}.")
        return documents
    except Exception as e:
        print(f"Error loading documents: {e}")
        return []

# Step 2: Process and chunk documents
def process_documents(documents, chunk_size=1000, chunk_overlap=200):
    """Splits documents into chunks and enriches metadata."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", " ", ""], # Try splitting by paragraphs first
        is_separator_regex=False,
    )
    
    chunks = text_splitter.split_documents(documents)
    
    # Step 3: Enrich with metadata
    for i, chunk in enumerate(chunks):
        chunk.metadata["chunk_id"] = i
        chunk.metadata["chunk_length"] = len(chunk.page_content)
        # Ensure source exists before trying to get basename
        if "source" in chunk.metadata:
            try:
                file_name = os.path.basename(chunk.metadata["source"])
                chunk.metadata["title"] = os.path.splitext(file_name)[0]
            except Exception:
                 chunk.metadata["title"] = "Unknown"
        else:
            chunk.metadata["source"] = "Unknown"
            chunk.metadata["title"] = "Unknown"

    print(f"Split into {len(chunks)} chunks.")
    return chunks

# Example usage (assuming a directory named 'rag_data' exists)
# Create dummy data directory and file if it doesn't exist
data_dir = "./rag_data"
if not os.path.exists(data_dir):
    os.makedirs(data_dir)
    with open(os.path.join(data_dir, "sample.txt"), "w") as f:
        f.write("This is sample text for the RAG system demonstration.")

documents = load_documents_from_directory(data_dir)
if documents:
    chunks = process_documents(documents)
    # print(chunks[0].metadata) # Example: print metadata of first chunk
else:
    print("No documents loaded, skipping chunk processing.")

# Clean up dummy data directory (optional)
# import shutil
# if os.path.exists(data_dir):
#    shutil.rmtree(data_dir)

Chunking Best Practices

Effective chunking is critical for RAG performance:

Semantic Boundaries: Try to chunk at paragraph or section boundaries (RecursiveCharacterTextSplitter helps)
Chunk Size: Aim for 300-1000 tokens per chunk (balances context and specificity)
Chunk Overlap: Use 10-20% overlap to avoid losing context at boundaries
Metadata: Include source, page numbers, section titles, and timestamps

2. Embedding Generation

Next, convert your text chunks into vector embeddings that capture their semantic meaning.

        Embedding Considerations:
        Model Selection: Choose embedding models based on performance, cost, and dimensions
Batch Processing: Generate embeddings in batches to improve efficiency
Caching: Store embeddings to avoid regenerating them

    

# Ensure necessary libraries are installed
# pip install langchain-openai langchain-huggingface sentence-transformers

from langchain_openai import OpenAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
import numpy as np
import time

# Assume 'chunks' is a list of Document objects from the previous step
# Define dummy chunks if previous step was skipped
if 'chunks' not in locals():
    from langchain_core.documents import Document
    chunks = [Document(page_content="This is chunk 1.", metadata={}), 
              Document(page_content="This is chunk 2.", metadata={})]

# Option 1: OpenAI Embeddings (hosted)
def generate_openai_embeddings(docs):
    """Generates embeddings using OpenAI API with batching."""
    embeddings_model = OpenAIEmbeddings()
    texts = [doc.page_content for doc in docs]
    
    # Process in batches (OpenAI API handles batching internally, but good practice for large lists)
    batch_size = 100 # Adjust based on API limits and performance
    all_embeddings = []
    start_time = time.time()
    
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        try:
            batch_embeddings = embeddings_model.embed_documents(batch_texts)
            all_embeddings.extend(batch_embeddings)
            elapsed_time = time.time() - start_time
            print(f"Processed batch {i//batch_size + 1}/{len(texts)//batch_size + 1} ({len(batch_texts)} docs) in {elapsed_time:.2f}s")
            start_time = time.time() # Reset timer for next batch
        except Exception as e:
            print(f"Error processing batch {i//batch_size + 1}: {e}")
            # Optionally add None or empty lists for failed embeddings
            all_embeddings.extend([None] * len(batch_texts))
            
    return all_embeddings

# Option 2: Local Embeddings with Hugging Face
def generate_local_embeddings(docs):
    """Generates embeddings locally using Hugging Face sentence-transformers."""
    # Use a common, effective sentence transformer model
    model_name = "sentence-transformers/all-MiniLM-L6-v2"
    # Specify device (cpu, cuda, mps) if needed, defaults usually work
    # model_kwargs = {'device': 'cpu'} 
    # encode_kwargs = {'normalize_embeddings': False}
    try:
        embeddings_model = HuggingFaceEmbeddings(
            model_name=model_name,
            # model_kwargs=model_kwargs,
            # encode_kwargs=encode_kwargs
        )
    except Exception as e:
        print(f"Error initializing HuggingFaceEmbeddings: {e}")
        return []
        
    texts = [doc.page_content for doc in docs]
    try:
        print(f"Generating local embeddings for {len(texts)} documents using {model_name}...")
        start_time = time.time()
        all_embeddings = embeddings_model.embed_documents(texts)
        elapsed_time = time.time() - start_time
        print(f"Generated {len(all_embeddings)} embeddings in {elapsed_time:.2f}s.")
        return all_embeddings
    except Exception as e:
        print(f"Error generating local embeddings: {e}")
        return []

# Example usage (Choose one option)
print("\n--- Generating Embeddings (Example) ---")
if chunks:
    # Option 1: Using OpenAI (requires API key in environment)
    # document_embeddings_openai = generate_openai_embeddings(chunks)
    # if document_embeddings_openai:
    #     print(f"Generated {len(document_embeddings_openai)} OpenAI embeddings.")

    # Option 2: Using local Hugging Face model (ensure sentence-transformers is installed)
    document_embeddings_local = generate_local_embeddings(chunks)
    if document_embeddings_local:
         print(f"Generated {len(document_embeddings_local)} local embeddings.")
else:
    print("Skipping embedding generation as no chunks were processed.")

Embedding Model Comparison

Model	Dimensions	Performance	Cost	Deployment
OpenAI text-embedding-3-small	1536	Excellent	$0.02/1M tokens	API (hosted)
OpenAI text-embedding-3-large	3072	State-of-the-art	$0.13/1M tokens	API (hosted)
Cohere embed-english-v3.0	1024	Very good	$0.10/1M tokens	API (hosted)
BAAI/bge-large-en-v1.5	1024	Very good	Free	Local / Self-hosted
all-MiniLM-L6-v2	384	Good	Free	Local / Self-hosted

Choose based on balance of performance, cost, and infrastructure constraints.

3. Vector Database Storage

Store the generated embeddings in a vector database for efficient similarity search.

        Vector Database Options:
        In-Memory: FAISS (good for small datasets, prototyping)
Open Source: ChromaDB, Qdrant, Weaviate (self-hostable)
Cloud Managed: Pinecone, Zilliz Cloud, Vertex AI Vector Search

    

# Ensure necessary libraries are installed
# For FAISS: pip install faiss-cpu (or faiss-gpu)
# For Chroma: pip install chromadb
# For Pinecone: pip install pinecone-client

from langchain_community.vectorstores import FAISS, Chroma
# from langchain_pinecone import Pinecone # Requires pinecone-client >= 3.0.0 

# Assume 'chunks' and 'document_embeddings' are available from previous steps
# Define dummy data if previous steps were skipped
if 'chunks' not in locals() or 'document_embeddings_local' not in locals() or not document_embeddings_local:
    from langchain_core.documents import Document
    chunks = [Document(page_content="Chunk A", metadata={"id": "a"}), 
              Document(page_content="Chunk B", metadata={"id": "b"})]
    # Generate dummy embeddings matching the dimension of a chosen model (e.g., 384 for all-MiniLM-L6-v2)
    dummy_embedding_dim = 384
    document_embeddings_local = [list(np.random.rand(dummy_embedding_dim)) for _ in chunks]
    print("Using dummy chunks and embeddings for Vector DB examples.")

# Option 1: FAISS (In-Memory)
try:
    # FAISS requires texts and their corresponding embeddings separately
    texts_for_faiss = [chunk.page_content for chunk in chunks]
    # Check if embeddings list matches text list length and is not empty
    if document_embeddings_local and len(texts_for_faiss) == len(document_embeddings_local):
        text_embedding_pairs = list(zip(texts_for_faiss, document_embeddings_local))
        faiss_vectorstore = FAISS.from_embeddings(
            text_embeddings=text_embedding_pairs, 
            embedding=OpenAIEmbeddings() # Provide an embedding function instance (can be any valid one)
            # If using local embeddings for storage, you might use the same HuggingFaceEmbeddings instance
        )
        print("Created FAISS in-memory vector store.")
        # Example search
        # query = "What is RAG?"
        # results = faiss_vectorstore.similarity_search(query, k=1)
        # print(f"FAISS Search Results: {results}")
    else:
        print("Skipping FAISS creation due to missing or mismatched embeddings.")
except ImportError:
    print("FAISS library not found. pip install faiss-cpu")
except Exception as e:
    print(f"Error creating FAISS vector store: {e}")


# Option 2: ChromaDB (Local Persistent or In-Memory)
persist_directory = "./chroma_db"
try:
    # Ensure embeddings_model is initialized (e.g., embeddings_model = OpenAIEmbeddings() or HuggingFaceEmbeddings(...))
    embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2") # Example
    
    chroma_vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings_model,
        persist_directory=persist_directory # Saves to disk
        # collection_name="my_rag_collection" # Optional: specify collection name
    )
    print(f"Created/Loaded Chroma vector store at {persist_directory}")
    # Example search
    # query = "What is RAG?"
    # results = chroma_vectorstore.similarity_search(query, k=1)
    # print(f"Chroma Search Results: {results}")
except ImportError:
    print("ChromaDB library not found. pip install chromadb")
except Exception as e:
    print(f"Error creating/loading Chroma vector store: {e}")

# Option 3: Pinecone (Cloud Managed)
# Requires PINECONE_API_KEY and PINECONE_ENVIRONMENT environment variables
# try:
#     import pinecone
#     from langchain_pinecone import Pinecone
#     # Initialize Pinecone connection (usually done once)
#     # pinecone.init(api_key=os.getenv("PINECONE_API_KEY"), environment=os.getenv("PINECONE_ENVIRONMENT"))
#     index_name = "my-rag-index" # Make sure this index exists in your Pinecone environment
#     
#     # Add documents to Pinecone index
#     # Ensure embeddings_model is initialized (e.g., embeddings_model = OpenAIEmbeddings())
#     pinecone_vectorstore = Pinecone.from_documents(
#         documents=chunks, 
#         embedding=embeddings_model, 
#         index_name=index_name
#     )
#     print(f"Added documents to Pinecone index '{index_name}'.")
#     # Example search
#     # query = "What is RAG?"
#     # results = pinecone_vectorstore.similarity_search(query, k=1)
#     # print(f"Pinecone Search Results: {results}")
# except ImportError:
#      print("Pinecone client not found. pip install pinecone-client>=3.0.0")
# except Exception as e:
#      print(f"Error interacting with Pinecone: {e}")
print("Skipping Pinecone example as it requires setup and credentials.")

# Clean up ChromaDB directory (optional)
# import shutil
# if os.path.exists(persist_directory):
#    shutil.rmtree(persist_directory)

4. Retrieval

Implement a retriever to query the vector database and fetch relevant document chunks based on the user's query.

        Retrieval Strategies:
        Similarity Search: Basic retrieval based on vector similarity (e.g., cosine)
Maximum Marginal Relevance (MMR): Optimises for relevance and diversity
Self-Querying Retriever: Uses an LLM to generate structured queries from natural language
Contextual Compression: Re-ranks or filters retrieved documents based on context

    

# Assume 'vectorstore' is an initialized vector store object (e.g., from FAISS or Chroma)
# Define a dummy vectorstore if previous steps were skipped
if 'chroma_vectorstore' not in locals() and 'faiss_vectorstore' not in locals():
    class DummyVectorStore:
        def as_retriever(self, **kwargs):
            print("Using Dummy Retriever.")
            return DummyRetriever()
    class DummyRetriever:
        def invoke(self, query):
            return [Document(page_content=f"Dummy result for '{query}'", metadata={"source":"dummy"})]
    vectorstore = DummyVectorStore()
    print("Using dummy vector store for Retriever examples.")
else:
    # Prioritize Chroma if it exists, otherwise use FAISS if it exists
    vectorstore = locals().get('chroma_vectorstore') or locals().get('faiss_vectorstore')

# 1. Basic Similarity Search Retriever
simple_retriever = vectorstore.as_retriever(search_kwargs={'k': 3}) # Retrieve top 3 docs

# 2. MMR Retriever (for diversity)
mmr_retriever = vectorstore.as_retriever(
    search_type="mmr", 
    search_kwargs={'k': 5, 'fetch_k': 20} # Retrieve 5 docs, considering 20 initially
)

# 3. Self-Querying Retriever (requires LLM)
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain_openai import ChatOpenAI # Ensure model is imported

# Define metadata fields the retriever can query
metadata_field_info = [
    AttributeInfo(name="source", description="The file the chunk came from", type="string"),
    AttributeInfo(name="title", description="The title of the document", type="string"),
    AttributeInfo(name="chunk_length", description="Length of the chunk text", type="integer"),
]
document_content_description = "Content of text chunks from documents"

# Ensure llm is initialized (e.g., llm = ChatOpenAI(temperature=0))
llm = ChatOpenAI(temperature=0)

try:
    self_query_retriever = SelfQueryRetriever.from_llm(
        llm,
        vectorstore,
        document_content_description,
        metadata_field_info,
        verbose=True
    )
    print("Initialized Self-Query Retriever.")
    # Example self-query
    # result = self_query_retriever.invoke("Find chunks about RAG from the document titled 'rag_systems'")
    # print(f"Self-Query Results: {result}")
except Exception as e:
    print(f"Error initializing Self-Query Retriever: {e}")
    self_query_retriever = None # Assign None if failed

# 4. Contextual Compression Retriever
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# Ensure llm is initialized
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=simple_retriever # Use the basic retriever as base
)
print("Initialized Contextual Compression Retriever.")

# --- Example Usage --- 
query = "What is the core idea of RAG?"

print(f"\n--- Retrieving for query: '{query}' ---")

# Use the simple retriever
retrieved_docs_simple = simple_retriever.invoke(query)
print(f"\nSimple Retriever Results ({len(retrieved_docs_simple)} docs):")
# for doc in retrieved_docs_simple:
#     print(f"- {doc.page_content[:100]}... (Source: {doc.metadata.get('source', 'N/A')})")

# Use MMR retriever
retrieved_docs_mmr = mmr_retriever.invoke(query)
print(f"\nMMR Retriever Results ({len(retrieved_docs_mmr)} docs):")
# for doc in retrieved_docs_mmr:
#     print(f"- {doc.page_content[:100]}... (Source: {doc.metadata.get('source', 'N/A')})")

# Use Self-Query (if initialized)
# if self_query_retriever:
#     try:
#         retrieved_docs_self_query = self_query_retriever.invoke(query)
#         print(f"\nSelf-Query Retriever Results ({len(retrieved_docs_self_query)} docs):")
#         # for doc in retrieved_docs_self_query:
#         #     print(f"- {doc.page_content[:100]}... (Source: {doc.metadata.get('source', 'N/A')})")
#     except Exception as e:
#         print(f"Error running self-query retriever: {e}")
print("Skipping Self-Query execution example.")

# Use Contextual Compression
try:
    compressed_docs = compression_retriever.invoke(query)
    print(f"\nContextual Compression Results ({len(compressed_docs)} docs):")
    # for doc in compressed_docs:
    #     print(f"- {doc.page_content[:100]}... (Source: {doc.metadata.get('source', 'N/A')})")
except Exception as e:
    print(f"Error running contextual compression: {e}")

5. Generation

Finally, use an LLM to generate a response based on the user's query and the retrieved context.

        Generation Strategies:
        Stuffing: Concatenate all retrieved documents into the prompt (simplest, but limited by context window)
Map-Reduce: Process each document individually, then combine results
Refine: Process documents sequentially, refining the answer at each step
Map-Rerank: Process each document and score relevance, using the best one

    

from langchain.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI # Ensure model imported

# Assume 'retriever' is an initialized retriever object (e.g., simple_retriever)
# Assume 'llm' is an initialized LLM or ChatModel object
if 'simple_retriever' in locals():
    retriever = simple_retriever
    llm = ChatOpenAI(temperature=0, model="gpt-4o") # Example initialization
    
    # 1. Basic RetrievalQA Chain (uses "stuff" method by default)
    try:
        qa_chain_stuff = RetrievalQA.from_chain_type(
            llm=llm,
            chain_type="stuff",
            retriever=retriever,
            return_source_documents=True # Optionally return sources
        )
        
        query = "What is RAG?"
        result_stuff = qa_chain_stuff.invoke({"query": query})
        print(f"\n--- RetrievalQA (Stuff) Result for '{query}' ---")
        print(f"Answer: {result_stuff['result']}")
        # print(f"Source Documents: {len(result_stuff['source_documents'])} found")
    except Exception as e:
        print(f"Error with RetrievalQA (stuff): {e}")

    # 2. RetrievalQA with Map-Reduce
    try:
        qa_chain_map_reduce = RetrievalQA.from_chain_type(
            llm=llm,
            chain_type="map_reduce", # Suitable for many documents
            retriever=retriever,
            return_source_documents=True,
            # chain_type_kwargs can be added here if needed for map/combine prompts
        )
        query = "Summarise the key aspects of RAG."
        result_map_reduce = qa_chain_map_reduce.invoke({"query": query})
        print(f"\n--- RetrievalQA (Map-Reduce) Result for '{query}' ---")
        print(f"Answer: {result_map_reduce['result']}")
    except Exception as e:
        print(f"Error with RetrievalQA (map_reduce): {e}")

    # 3. RetrievalQA with Refine
    try:
        qa_chain_refine = RetrievalQA.from_chain_type(
            llm=llm,
            chain_type="refine", # Suitable for building response iteratively
            retriever=retriever,
            return_source_documents=True,
            # chain_type_kwargs can be added here if needed for refine prompts
        )
        query = "Provide a detailed explanation of RAG benefits."
        result_refine = qa_chain_refine.invoke({"query": query})
        print(f"\n--- RetrievalQA (Refine) Result for '{query}' ---")
        print(f"Answer: {result_refine['result']}")
    except Exception as e:
        print(f"Error with RetrievalQA (refine): {e}")
        
    # 4. Custom Prompt for Generation
    custom_prompt_template = """
    Use the following pieces of context to answer the question at the end.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.

    Context: {context}

    Question: {question}
    Helpful Answer:
    """
    CUSTOM_PROMPT = PromptTemplate(
        template=custom_prompt_template, input_variables=["context", "question"]
    )
    
    try:
        qa_chain_custom = RetrievalQA.from_chain_type(
            llm=llm,
            chain_type="stuff",
            retriever=retriever,
            return_source_documents=True,
            chain_type_kwargs={"prompt": CUSTOM_PROMPT}
        )
        query = "How does RAG help with hallucinations?"
        result_custom = qa_chain_custom.invoke({"query": query})
        print(f"\n--- RetrievalQA (Custom Prompt) Result for '{query}' ---")
        print(f"Answer: {result_custom['result']}")
    except Exception as e:
        print(f"Error with RetrievalQA (custom prompt): {e}")

else:
    print("Skipping Generation examples as retriever is not defined.")

Advanced RAG Techniques

Beyond the basics, several advanced techniques can significantly improve the performance and reliability of your RAG systems:

1. Hybrid Search

Combine semantic search (vector search) with traditional keyword search (e.g., BM25) to leverage the strengths of both approaches.

        Hybrid Search Benefits:
        Improves retrieval for queries with specific keywords or jargon
Catches relevant documents missed by purely semantic search
More robust to variations in query phrasing

    

# Ensure necessary libraries are installed
# pip install rank_bm25

from langchain.retrievers import BM25Retriever, EnsembleRetriever
# Assume 'chunks' is a list of Document objects
# Assume 'faiss_vectorstore' is an initialized FAISS vector store

if 'chunks' in locals() and 'faiss_vectorstore' in locals():
    try:
        # 1. Initialize BM25 Retriever
        bm25_retriever = BM25Retriever.from_documents(chunks)
        bm25_retriever.k = 2 # Retrieve top 2 keyword matches
        print("Initialized BM25 Retriever.")

        # 2. Initialize Semantic Retriever (using FAISS example)
        faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})
        print("Using FAISS as semantic retriever.")

        # 3. Initialize Ensemble Retriever
        ensemble_retriever = EnsembleRetriever(
            retrievers=[bm25_retriever, faiss_retriever],
            weights=[0.5, 0.5] # Assign equal weight to both retrievers
        )
        print("Initialized Ensemble Retriever for Hybrid Search.")

        # Example Hybrid Search
        query = "RAG architecture components"
        hybrid_results = ensemble_retriever.invoke(query)
        print(f"\n--- Hybrid Search Results for '{query}' ({len(hybrid_results)} docs) ---")
        # for doc in hybrid_results:
        #     print(f"- {doc.page_content[:100]}...")
            
    except ImportError:
        print("BM25 library not found (pip install rank_bm25). Skipping Hybrid Search example.")
    except Exception as e:
        print(f"Error setting up Hybrid Search: {e}")
else:
    print("Skipping Hybrid Search example due to missing chunks or vector store.")

2. Re-ranking Retrieved Documents

Use a more sophisticated model (like an LLM or a specialized cross-encoder) to re-rank the initially retrieved documents for better relevance.

        Re-ranking Benefits:
        Improves precision by pushing the most relevant documents to the top
Can consider interactions between the query and document content more deeply
Reduces noise passed to the final generation step

    

# Ensure necessary libraries are installed
# pip install langchain-community sentence-transformers

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Assume 'simple_retriever' is initialized (e.g., from FAISS or Chroma)
if 'simple_retriever' in locals():
    try:
        # Initialize a CrossEncoder model for re-ranking
        # Common models: 'cross-encoder/ms-marco-MiniLM-L-6-v2', 'BAAI/bge-reranker-large'
        model = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")
        
        # Create the compressor using the reranker
        compressor = CrossEncoderReranker(model=model, top_n=3) # Keep top 3 after re-ranking
        
        # Create the compression retriever
        reranking_retriever = ContextualCompressionRetriever(
            base_compressor=compressor, 
            base_retriever=simple_retriever
        )
        print("Initialized Re-ranking Retriever.")

        # Example usage
        query = "Explain the RAG pipeline step-by-step"
        reranked_results = reranking_retriever.invoke(query)
        print(f"\n--- Re-ranked Results for '{query}' ({len(reranked_results)} docs) ---")
        # for doc in reranked_results:
        #     print(f"- {doc.page_content[:100]}...")
            
    except ImportError:
        print("Required libraries for reranking not found. pip install langchain-community sentence-transformers")
    except Exception as e:
        print(f"Error setting up Re-ranking Retriever: {e}")
else:
    print("Skipping Re-ranking example due to missing base retriever.")

3. Query Transformations

Modify the user's query before retrieval to improve relevance. Techniques include:

Hypothetical Document Embeddings (HyDE): Generate a hypothetical answer first, embed that answer, and retrieve documents similar to the hypothetical answer.
Multi-Query Retriever: Use an LLM to generate multiple related queries from the original query and retrieve documents for all of them.

from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI # Ensure model imported

# Assume 'llm' and 'vectorstore' are initialized
# Define dummy data if needed
if 'llm' not in locals() or 'vectorstore' not in locals():
    llm = ChatOpenAI(temperature=0) # Dummy LLM
    class DummyVectorStore:
        def as_retriever(self): return DummyRetriever()
    class DummyRetriever:
        def invoke(self, q): return [Document(page_content=f"Dummy for '{q}'")]
    vectorstore = DummyVectorStore()
    print("Using dummy LLM/VectorStore for Query Transformation examples.")

# 1. HyDE (Conceptual Example - LangChain implementation varies)
# Typically involves an LLMChain to generate the hypothetical doc first
hypothetical_doc_prompt = PromptTemplate.from_template(
    "Generate a hypothetical answer to the question: {question}"
)
hypothetical_doc_chain = LLMChain(llm=llm, prompt=hypothetical_doc_prompt)

# query = "What is the future of RAG systems?"
# hypothetical_answer = hypothetical_doc_chain.invoke({"question": query})['text']
# Embed hypothetical_answer and use it for retrieval from vectorstore
# (Manual implementation or specific LangChain HyDE components needed)
print("Skipping HyDE execution example (requires manual embedding/search or specific components).")

# 2. Multi-Query Retriever
try:
    mq_retriever = MultiQueryRetriever.from_llm(
        retriever=vectorstore.as_retriever(), 
        llm=llm
    )
    print("Initialized Multi-Query Retriever.")

    # Example usage
    query = "Tell me about RAG limitations and solutions."
    mq_results = mq_retriever.invoke(query)
    print(f"\n--- Multi-Query Results for '{query}' ({len(mq_results)} docs) ---")
    # Note: This retriever often returns duplicates due to retrieving for multiple queries.
    # Deduplication might be needed depending on the use case.
    # unique_contents = {doc.page_content for doc in mq_results}
    # print(f"Unique results: {len(unique_contents)}")
    # for content in list(unique_contents)[:3]: # Print first 3 unique results
    #     print(f"- {content[:100]}...")
        
except Exception as e:
    print(f"Error setting up Multi-Query Retriever: {e}")

4. Fine-tuning Embedding Models

For highly specialized domains, fine-tuning an embedding model on your specific data can significantly improve retrieval performance.

Fine-tuning Considerations:

Fine-tuning requires a labeled dataset (query, relevant passage pairs) and significant computational resources. It's generally considered an advanced optimization step after exhausting other techniques.

Platforms like OpenAI offer fine-tuning APIs, while open-source models can be fine-tuned using libraries like sentence-transformers.

Evaluating RAG Systems

Evaluating RAG systems involves assessing both the retrieval and generation components.

Retrieval Evaluation Metrics

Hit Rate: Percentage of queries for which at least one relevant document is retrieved.
Mean Reciprocal Rank (MRR): Average of the reciprocal ranks of the first relevant document.
Normalized Discounted Cumulative Gain (NDCG): Measures ranking quality, considering the position of relevant documents.

Generation Evaluation Metrics

Faithfulness / Groundedness: How well the generated answer is supported by the retrieved context.
Answer Relevance: How well the generated answer addresses the user's query.
Answer Correctness: Factual accuracy of the generated answer (often requires human evaluation).

RAG Evaluation Frameworks

Frameworks like Ragas and LangChain's evaluation modules provide tools for automated RAG evaluation:

# Ensure ragas is installed: pip install ragas

# Ragas Example (Conceptual - requires dataset setup)
# from datasets import Dataset
# from ragas import evaluate
# from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision

# # Assume you have a dataset with questions, answers, contexts, ground_truths
# data_samples = {
#     'question': ['What is RAG?'],
#     'answer': ['RAG combines retrieval with generation...'],
#     'contexts' : [[ 
#         'Retrieval Augmented Generation (RAG) enhances LLMs...',
#         'RAG systems use external knowledge...'
#     ]],
#     'ground_truth': ['RAG is a technique to improve LLMs by retrieving external documents...']
# }
# dataset = Dataset.from_dict(data_samples)

# try:
#     score = evaluate(
#         dataset,
#         metrics=[
#             faithfulness,      # How factual is the answer based on context?
#             answer_relevancy,  # How relevant is the answer to the question?
#             context_precision, # Signal-to-noise ratio in retrieved context
#             context_recall,    # Ability to retrieve all necessary context
#         ]
#     )
#     print("Ragas Evaluation Score:")
#     print(score)
# except Exception as e:
#     print(f"Ragas Evaluation Error: {e}")
print("Skipping Ragas example as it requires dataset setup.")

# LangChain Evaluation Example (Conceptual)
# from langchain.evaluation import load_evaluator

# # Assume llm is initialized
# # Assume retriever returns relevant docs for a query
# try:
#     # Evaluator for checking if the answer is grounded in the documents
#     faithfulness_evaluator = load_evaluator("labeled_score_string", criteria="faithfulness", llm=llm)
    
#     query = "What is the capital of France?"
#     context = "Paris is the capital and most populous city of France."
#     prediction = "The capital of France is Paris."
    
#     eval_result = faithfulness_evaluator.evaluate_strings(
#         prediction=prediction,
#         input=query,
#         reference=context # Context acts as the reference for faithfulness
#     )
#     print("LangChain Faithfulness Evaluation Result:")
#     print(eval_result)
# except Exception as e:
#     print(f"LangChain Evaluation Error: {e}")
print("Skipping LangChain evaluation example.")

Next Steps: Introduction to AI Agents

Understanding RAG is fundamental to building knowledgeable AI agents. With RAG providing access to information, the next step is to understand how agents use this information alongside tools and reasoning capabilities to perform complex tasks.

        Key Takeaways from This Section:
        RAG enhances LLMs by grounding them in external knowledge
Building RAG involves document processing, embedding, vector storage, retrieval, and generation
Effective chunking, embedding model selection, and vector databases are crucial
Advanced techniques like hybrid search, re-ranking, and query transformations improve performance
Evaluating RAG requires assessing both retrieval quality and generation faithfulness

    

In the next section, we formally introduce AI Agents, exploring their core concepts, architectures, and how they differ from traditional AI systems.

Continue to Introduction to AI Agents →