RAG Systems Essentials
Enhance your AI agents with external knowledge through Retrieval Augmented Generation
Understanding RAG Systems
Retrieval Augmented Generation (RAG) is a powerful approach that combines the strengths of retrieval-based systems with generative AI models. RAG systems enhance LLMs by providing them with relevant external knowledge at inference time.
Key Insight
RAG systems solve one of the most critical limitations of LLMs: their inability to access information beyond their training data. By retrieving and incorporating external knowledge, RAG enables more accurate, up-to-date, and verifiable responses.
Why RAG Matters for AI Agents
RAG addresses several fundamental challenges in building effective AI agents:
- Knowledge Limitations: LLMs have fixed knowledge cutoffs and can't access new information
- Hallucinations: LLMs sometimes generate plausible but incorrect information
- Domain Specificity: General-purpose LLMs lack deep expertise in specialised domains
- Verifiability: LLM outputs often lack clear sources or citations
- Customisation: Organisations need agents that reflect their specific knowledge and policies
The RAG Architecture
A typical RAG system consists of these key components:
- Document Processing Pipeline: Ingests, processes, and chunks documents
- Embedding Model: Converts text chunks into vector representations
- Vector Database: Stores and enables semantic search of embeddings
- Retriever: Finds relevant information based on user queries
- Generator: Uses retrieved information to create accurate responses
When to Use RAG
| Use Case | RAG Benefit | Implementation Complexity |
|---|---|---|
| Knowledge-intensive Q&A | Provides factual, up-to-date information | Medium |
| Domain-specific assistants | Incorporates specialised knowledge | Medium-High |
| Enterprise search | Enables natural language search with contextual answers | High |
| Document summarisation | Ensures summaries are grounded in source material | Medium |
| Content generation | Creates content based on accurate, relevant information | Medium |
Building RAG Systems: Step-by-Step
Let's walk through the process of building a RAG system from scratch, focusing on practical implementation:
1. Document Processing Pipeline
The first step is to ingest and process your documents into a format suitable for retrieval.
Document Processing Steps:
- Document Loading: Import documents from various sources
- Text Extraction: Extract plain text from different file formats
- Text Chunking: Split text into manageable, semantically meaningful chunks
- Metadata Enrichment: Add useful metadata to each chunk
# Ensure required libraries are installed
# pip install langchain langchain-community pypdf python-dotenv unstructured[local-inference] tiktoken faiss-cpu
from langchain_community.document_loaders import PyPDFLoader, CSVLoader, DirectoryLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document # Import Document class
import os
from dotenv import load_dotenv
load_dotenv()
# Step 1: Load documents from different sources
def load_documents_from_directory(directory_path):
"""Loads documents from a directory using various loaders."""
# Use DirectoryLoader for simplicity, configuring loaders for different types
loader = DirectoryLoader(
directory_path,
glob="**/*.*", # Load all files
loader_cls=TextLoader, # Default loader for unknown types
loader_kwargs={"encoding": "utf-8"}, # Example argument for TextLoader
use_multithreading=True,
show_progress=True,
recursive=True # Load from subdirectories too
# Note: More specific loaders can be added or configured if needed,
# e.g., using UnstructuredFileLoader for broader format support
# or PyPDFLoader specifically for PDFs.
)
try:
documents = loader.load()
print(f"Loaded {len(documents)} documents from {directory_path}.")
return documents
except Exception as e:
print(f"Error loading documents: {e}")
return []
# Step 2: Process and chunk documents
def process_documents(documents, chunk_size=1000, chunk_overlap=200):
"""Splits documents into chunks and enriches metadata."""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
separators=["\n\n", "\n", " ", ""], # Try splitting by paragraphs first
is_separator_regex=False,
)
chunks = text_splitter.split_documents(documents)
# Step 3: Enrich with metadata
for i, chunk in enumerate(chunks):
chunk.metadata["chunk_id"] = i
chunk.metadata["chunk_length"] = len(chunk.page_content)
# Ensure source exists before trying to get basename
if "source" in chunk.metadata:
try:
file_name = os.path.basename(chunk.metadata["source"])
chunk.metadata["title"] = os.path.splitext(file_name)[0]
except Exception:
chunk.metadata["title"] = "Unknown"
else:
chunk.metadata["source"] = "Unknown"
chunk.metadata["title"] = "Unknown"
print(f"Split into {len(chunks)} chunks.")
return chunks
# Example usage (assuming a directory named 'rag_data' exists)
# Create dummy data directory and file if it doesn't exist
data_dir = "./rag_data"
if not os.path.exists(data_dir):
os.makedirs(data_dir)
with open(os.path.join(data_dir, "sample.txt"), "w") as f:
f.write("This is sample text for the RAG system demonstration.")
documents = load_documents_from_directory(data_dir)
if documents:
chunks = process_documents(documents)
# print(chunks[0].metadata) # Example: print metadata of first chunk
else:
print("No documents loaded, skipping chunk processing.")
# Clean up dummy data directory (optional)
# import shutil
# if os.path.exists(data_dir):
# shutil.rmtree(data_dir)
Chunking Best Practices
Effective chunking is critical for RAG performance:
- Semantic Boundaries: Try to chunk at paragraph or section boundaries (RecursiveCharacterTextSplitter helps)
- Chunk Size: Aim for 300-1000 tokens per chunk (balances context and specificity)
- Chunk Overlap: Use 10-20% overlap to avoid losing context at boundaries
- Metadata: Include source, page numbers, section titles, and timestamps
2. Embedding Generation
Next, convert your text chunks into vector embeddings that capture their semantic meaning.
Embedding Considerations:
- Model Selection: Choose embedding models based on performance, cost, and dimensions
- Batch Processing: Generate embeddings in batches to improve efficiency
- Caching: Store embeddings to avoid regenerating them
# Ensure necessary libraries are installed
# pip install langchain-openai langchain-huggingface sentence-transformers
from langchain_openai import OpenAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
import numpy as np
import time
# Assume 'chunks' is a list of Document objects from the previous step
# Define dummy chunks if previous step was skipped
if 'chunks' not in locals():
from langchain_core.documents import Document
chunks = [Document(page_content="This is chunk 1.", metadata={}),
Document(page_content="This is chunk 2.", metadata={})]
# Option 1: OpenAI Embeddings (hosted)
def generate_openai_embeddings(docs):
"""Generates embeddings using OpenAI API with batching."""
embeddings_model = OpenAIEmbeddings()
texts = [doc.page_content for doc in docs]
# Process in batches (OpenAI API handles batching internally, but good practice for large lists)
batch_size = 100 # Adjust based on API limits and performance
all_embeddings = []
start_time = time.time()
for i in range(0, len(texts), batch_size):
batch_texts = texts[i:i+batch_size]
try:
batch_embeddings = embeddings_model.embed_documents(batch_texts)
all_embeddings.extend(batch_embeddings)
elapsed_time = time.time() - start_time
print(f"Processed batch {i//batch_size + 1}/{len(texts)//batch_size + 1} ({len(batch_texts)} docs) in {elapsed_time:.2f}s")
start_time = time.time() # Reset timer for next batch
except Exception as e:
print(f"Error processing batch {i//batch_size + 1}: {e}")
# Optionally add None or empty lists for failed embeddings
all_embeddings.extend([None] * len(batch_texts))
return all_embeddings
# Option 2: Local Embeddings with Hugging Face
def generate_local_embeddings(docs):
"""Generates embeddings locally using Hugging Face sentence-transformers."""
# Use a common, effective sentence transformer model
model_name = "sentence-transformers/all-MiniLM-L6-v2"
# Specify device (cpu, cuda, mps) if needed, defaults usually work
# model_kwargs = {'device': 'cpu'}
# encode_kwargs = {'normalize_embeddings': False}
try:
embeddings_model = HuggingFaceEmbeddings(
model_name=model_name,
# model_kwargs=model_kwargs,
# encode_kwargs=encode_kwargs
)
except Exception as e:
print(f"Error initializing HuggingFaceEmbeddings: {e}")
return []
texts = [doc.page_content for doc in docs]
try:
print(f"Generating local embeddings for {len(texts)} documents using {model_name}...")
start_time = time.time()
all_embeddings = embeddings_model.embed_documents(texts)
elapsed_time = time.time() - start_time
print(f"Generated {len(all_embeddings)} embeddings in {elapsed_time:.2f}s.")
return all_embeddings
except Exception as e:
print(f"Error generating local embeddings: {e}")
return []
# Example usage (Choose one option)
print("\n--- Generating Embeddings (Example) ---")
if chunks:
# Option 1: Using OpenAI (requires API key in environment)
# document_embeddings_openai = generate_openai_embeddings(chunks)
# if document_embeddings_openai:
# print(f"Generated {len(document_embeddings_openai)} OpenAI embeddings.")
# Option 2: Using local Hugging Face model (ensure sentence-transformers is installed)
document_embeddings_local = generate_local_embeddings(chunks)
if document_embeddings_local:
print(f"Generated {len(document_embeddings_local)} local embeddings.")
else:
print("Skipping embedding generation as no chunks were processed.")
Embedding Model Comparison
| Model | Dimensions | Performance | Cost | Deployment |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | Excellent | $0.02/1M tokens | API (hosted) |
| OpenAI text-embedding-3-large | 3072 | State-of-the-art | $0.13/1M tokens | API (hosted) |
| Cohere embed-english-v3.0 | 1024 | Very good | $0.10/1M tokens | API (hosted) |
| BAAI/bge-large-en-v1.5 | 1024 | Very good | Free | Local / Self-hosted |
| all-MiniLM-L6-v2 | 384 | Good | Free | Local / Self-hosted |
Choose based on balance of performance, cost, and infrastructure constraints.
3. Vector Database Storage
Store the generated embeddings in a vector database for efficient similarity search.
Vector Database Options:
- In-Memory: FAISS (good for small datasets, prototyping)
- Open Source: ChromaDB, Qdrant, Weaviate (self-hostable)
- Cloud Managed: Pinecone, Zilliz Cloud, Vertex AI Vector Search
# Ensure necessary libraries are installed
# For FAISS: pip install faiss-cpu (or faiss-gpu)
# For Chroma: pip install chromadb
# For Pinecone: pip install pinecone-client
from langchain_community.vectorstores import FAISS, Chroma
# from langchain_pinecone import Pinecone # Requires pinecone-client >= 3.0.0
# Assume 'chunks' and 'document_embeddings' are available from previous steps
# Define dummy data if previous steps were skipped
if 'chunks' not in locals() or 'document_embeddings_local' not in locals() or not document_embeddings_local:
from langchain_core.documents import Document
chunks = [Document(page_content="Chunk A", metadata={"id": "a"}),
Document(page_content="Chunk B", metadata={"id": "b"})]
# Generate dummy embeddings matching the dimension of a chosen model (e.g., 384 for all-MiniLM-L6-v2)
dummy_embedding_dim = 384
document_embeddings_local = [list(np.random.rand(dummy_embedding_dim)) for _ in chunks]
print("Using dummy chunks and embeddings for Vector DB examples.")
# Option 1: FAISS (In-Memory)
try:
# FAISS requires texts and their corresponding embeddings separately
texts_for_faiss = [chunk.page_content for chunk in chunks]
# Check if embeddings list matches text list length and is not empty
if document_embeddings_local and len(texts_for_faiss) == len(document_embeddings_local):
text_embedding_pairs = list(zip(texts_for_faiss, document_embeddings_local))
faiss_vectorstore = FAISS.from_embeddings(
text_embeddings=text_embedding_pairs,
embedding=OpenAIEmbeddings() # Provide an embedding function instance (can be any valid one)
# If using local embeddings for storage, you might use the same HuggingFaceEmbeddings instance
)
print("Created FAISS in-memory vector store.")
# Example search
# query = "What is RAG?"
# results = faiss_vectorstore.similarity_search(query, k=1)
# print(f"FAISS Search Results: {results}")
else:
print("Skipping FAISS creation due to missing or mismatched embeddings.")
except ImportError:
print("FAISS library not found. pip install faiss-cpu")
except Exception as e:
print(f"Error creating FAISS vector store: {e}")
# Option 2: ChromaDB (Local Persistent or In-Memory)
persist_directory = "./chroma_db"
try:
# Ensure embeddings_model is initialized (e.g., embeddings_model = OpenAIEmbeddings() or HuggingFaceEmbeddings(...))
embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2") # Example
chroma_vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings_model,
persist_directory=persist_directory # Saves to disk
# collection_name="my_rag_collection" # Optional: specify collection name
)
print(f"Created/Loaded Chroma vector store at {persist_directory}")
# Example search
# query = "What is RAG?"
# results = chroma_vectorstore.similarity_search(query, k=1)
# print(f"Chroma Search Results: {results}")
except ImportError:
print("ChromaDB library not found. pip install chromadb")
except Exception as e:
print(f"Error creating/loading Chroma vector store: {e}")
# Option 3: Pinecone (Cloud Managed)
# Requires PINECONE_API_KEY and PINECONE_ENVIRONMENT environment variables
# try:
# import pinecone
# from langchain_pinecone import Pinecone
# # Initialize Pinecone connection (usually done once)
# # pinecone.init(api_key=os.getenv("PINECONE_API_KEY"), environment=os.getenv("PINECONE_ENVIRONMENT"))
# index_name = "my-rag-index" # Make sure this index exists in your Pinecone environment
#
# # Add documents to Pinecone index
# # Ensure embeddings_model is initialized (e.g., embeddings_model = OpenAIEmbeddings())
# pinecone_vectorstore = Pinecone.from_documents(
# documents=chunks,
# embedding=embeddings_model,
# index_name=index_name
# )
# print(f"Added documents to Pinecone index '{index_name}'.")
# # Example search
# # query = "What is RAG?"
# # results = pinecone_vectorstore.similarity_search(query, k=1)
# # print(f"Pinecone Search Results: {results}")
# except ImportError:
# print("Pinecone client not found. pip install pinecone-client>=3.0.0")
# except Exception as e:
# print(f"Error interacting with Pinecone: {e}")
print("Skipping Pinecone example as it requires setup and credentials.")
# Clean up ChromaDB directory (optional)
# import shutil
# if os.path.exists(persist_directory):
# shutil.rmtree(persist_directory)
4. Retrieval
Implement a retriever to query the vector database and fetch relevant document chunks based on the user's query.
Retrieval Strategies:
- Similarity Search: Basic retrieval based on vector similarity (e.g., cosine)
- Maximum Marginal Relevance (MMR): Optimises for relevance and diversity
- Self-Querying Retriever: Uses an LLM to generate structured queries from natural language
- Contextual Compression: Re-ranks or filters retrieved documents based on context
# Assume 'vectorstore' is an initialized vector store object (e.g., from FAISS or Chroma)
# Define a dummy vectorstore if previous steps were skipped
if 'chroma_vectorstore' not in locals() and 'faiss_vectorstore' not in locals():
class DummyVectorStore:
def as_retriever(self, **kwargs):
print("Using Dummy Retriever.")
return DummyRetriever()
class DummyRetriever:
def invoke(self, query):
return [Document(page_content=f"Dummy result for '{query}'", metadata={"source":"dummy"})]
vectorstore = DummyVectorStore()
print("Using dummy vector store for Retriever examples.")
else:
# Prioritize Chroma if it exists, otherwise use FAISS if it exists
vectorstore = locals().get('chroma_vectorstore') or locals().get('faiss_vectorstore')
# 1. Basic Similarity Search Retriever
simple_retriever = vectorstore.as_retriever(search_kwargs={'k': 3}) # Retrieve top 3 docs
# 2. MMR Retriever (for diversity)
mmr_retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={'k': 5, 'fetch_k': 20} # Retrieve 5 docs, considering 20 initially
)
# 3. Self-Querying Retriever (requires LLM)
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain_openai import ChatOpenAI # Ensure model is imported
# Define metadata fields the retriever can query
metadata_field_info = [
AttributeInfo(name="source", description="The file the chunk came from", type="string"),
AttributeInfo(name="title", description="The title of the document", type="string"),
AttributeInfo(name="chunk_length", description="Length of the chunk text", type="integer"),
]
document_content_description = "Content of text chunks from documents"
# Ensure llm is initialized (e.g., llm = ChatOpenAI(temperature=0))
llm = ChatOpenAI(temperature=0)
try:
self_query_retriever = SelfQueryRetriever.from_llm(
llm,
vectorstore,
document_content_description,
metadata_field_info,
verbose=True
)
print("Initialized Self-Query Retriever.")
# Example self-query
# result = self_query_retriever.invoke("Find chunks about RAG from the document titled 'rag_systems'")
# print(f"Self-Query Results: {result}")
except Exception as e:
print(f"Error initializing Self-Query Retriever: {e}")
self_query_retriever = None # Assign None if failed
# 4. Contextual Compression Retriever
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
# Ensure llm is initialized
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=simple_retriever # Use the basic retriever as base
)
print("Initialized Contextual Compression Retriever.")
# --- Example Usage ---
query = "What is the core idea of RAG?"
print(f"\n--- Retrieving for query: '{query}' ---")
# Use the simple retriever
retrieved_docs_simple = simple_retriever.invoke(query)
print(f"\nSimple Retriever Results ({len(retrieved_docs_simple)} docs):")
# for doc in retrieved_docs_simple:
# print(f"- {doc.page_content[:100]}... (Source: {doc.metadata.get('source', 'N/A')})")
# Use MMR retriever
retrieved_docs_mmr = mmr_retriever.invoke(query)
print(f"\nMMR Retriever Results ({len(retrieved_docs_mmr)} docs):")
# for doc in retrieved_docs_mmr:
# print(f"- {doc.page_content[:100]}... (Source: {doc.metadata.get('source', 'N/A')})")
# Use Self-Query (if initialized)
# if self_query_retriever:
# try:
# retrieved_docs_self_query = self_query_retriever.invoke(query)
# print(f"\nSelf-Query Retriever Results ({len(retrieved_docs_self_query)} docs):")
# # for doc in retrieved_docs_self_query:
# # print(f"- {doc.page_content[:100]}... (Source: {doc.metadata.get('source', 'N/A')})")
# except Exception as e:
# print(f"Error running self-query retriever: {e}")
print("Skipping Self-Query execution example.")
# Use Contextual Compression
try:
compressed_docs = compression_retriever.invoke(query)
print(f"\nContextual Compression Results ({len(compressed_docs)} docs):")
# for doc in compressed_docs:
# print(f"- {doc.page_content[:100]}... (Source: {doc.metadata.get('source', 'N/A')})")
except Exception as e:
print(f"Error running contextual compression: {e}")
5. Generation
Finally, use an LLM to generate a response based on the user's query and the retrieved context.
Generation Strategies:
- Stuffing: Concatenate all retrieved documents into the prompt (simplest, but limited by context window)
- Map-Reduce: Process each document individually, then combine results
- Refine: Process documents sequentially, refining the answer at each step
- Map-Rerank: Process each document and score relevance, using the best one
from langchain.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI # Ensure model imported
# Assume 'retriever' is an initialized retriever object (e.g., simple_retriever)
# Assume 'llm' is an initialized LLM or ChatModel object
if 'simple_retriever' in locals():
retriever = simple_retriever
llm = ChatOpenAI(temperature=0, model="gpt-4o") # Example initialization
# 1. Basic RetrievalQA Chain (uses "stuff" method by default)
try:
qa_chain_stuff = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True # Optionally return sources
)
query = "What is RAG?"
result_stuff = qa_chain_stuff.invoke({"query": query})
print(f"\n--- RetrievalQA (Stuff) Result for '{query}' ---")
print(f"Answer: {result_stuff['result']}")
# print(f"Source Documents: {len(result_stuff['source_documents'])} found")
except Exception as e:
print(f"Error with RetrievalQA (stuff): {e}")
# 2. RetrievalQA with Map-Reduce
try:
qa_chain_map_reduce = RetrievalQA.from_chain_type(
llm=llm,
chain_type="map_reduce", # Suitable for many documents
retriever=retriever,
return_source_documents=True,
# chain_type_kwargs can be added here if needed for map/combine prompts
)
query = "Summarise the key aspects of RAG."
result_map_reduce = qa_chain_map_reduce.invoke({"query": query})
print(f"\n--- RetrievalQA (Map-Reduce) Result for '{query}' ---")
print(f"Answer: {result_map_reduce['result']}")
except Exception as e:
print(f"Error with RetrievalQA (map_reduce): {e}")
# 3. RetrievalQA with Refine
try:
qa_chain_refine = RetrievalQA.from_chain_type(
llm=llm,
chain_type="refine", # Suitable for building response iteratively
retriever=retriever,
return_source_documents=True,
# chain_type_kwargs can be added here if needed for refine prompts
)
query = "Provide a detailed explanation of RAG benefits."
result_refine = qa_chain_refine.invoke({"query": query})
print(f"\n--- RetrievalQA (Refine) Result for '{query}' ---")
print(f"Answer: {result_refine['result']}")
except Exception as e:
print(f"Error with RetrievalQA (refine): {e}")
# 4. Custom Prompt for Generation
custom_prompt_template = """
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context: {context}
Question: {question}
Helpful Answer:
"""
CUSTOM_PROMPT = PromptTemplate(
template=custom_prompt_template, input_variables=["context", "question"]
)
try:
qa_chain_custom = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": CUSTOM_PROMPT}
)
query = "How does RAG help with hallucinations?"
result_custom = qa_chain_custom.invoke({"query": query})
print(f"\n--- RetrievalQA (Custom Prompt) Result for '{query}' ---")
print(f"Answer: {result_custom['result']}")
except Exception as e:
print(f"Error with RetrievalQA (custom prompt): {e}")
else:
print("Skipping Generation examples as retriever is not defined.")
Advanced RAG Techniques
Beyond the basics, several advanced techniques can significantly improve the performance and reliability of your RAG systems:
1. Hybrid Search
Combine semantic search (vector search) with traditional keyword search (e.g., BM25) to leverage the strengths of both approaches.
Hybrid Search Benefits:
- Improves retrieval for queries with specific keywords or jargon
- Catches relevant documents missed by purely semantic search
- More robust to variations in query phrasing
# Ensure necessary libraries are installed
# pip install rank_bm25
from langchain.retrievers import BM25Retriever, EnsembleRetriever
# Assume 'chunks' is a list of Document objects
# Assume 'faiss_vectorstore' is an initialized FAISS vector store
if 'chunks' in locals() and 'faiss_vectorstore' in locals():
try:
# 1. Initialize BM25 Retriever
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 2 # Retrieve top 2 keyword matches
print("Initialized BM25 Retriever.")
# 2. Initialize Semantic Retriever (using FAISS example)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})
print("Using FAISS as semantic retriever.")
# 3. Initialize Ensemble Retriever
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, faiss_retriever],
weights=[0.5, 0.5] # Assign equal weight to both retrievers
)
print("Initialized Ensemble Retriever for Hybrid Search.")
# Example Hybrid Search
query = "RAG architecture components"
hybrid_results = ensemble_retriever.invoke(query)
print(f"\n--- Hybrid Search Results for '{query}' ({len(hybrid_results)} docs) ---")
# for doc in hybrid_results:
# print(f"- {doc.page_content[:100]}...")
except ImportError:
print("BM25 library not found (pip install rank_bm25). Skipping Hybrid Search example.")
except Exception as e:
print(f"Error setting up Hybrid Search: {e}")
else:
print("Skipping Hybrid Search example due to missing chunks or vector store.")
2. Re-ranking Retrieved Documents
Use a more sophisticated model (like an LLM or a specialized cross-encoder) to re-rank the initially retrieved documents for better relevance.
Re-ranking Benefits:
- Improves precision by pushing the most relevant documents to the top
- Can consider interactions between the query and document content more deeply
- Reduces noise passed to the final generation step
# Ensure necessary libraries are installed
# pip install langchain-community sentence-transformers
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
# Assume 'simple_retriever' is initialized (e.g., from FAISS or Chroma)
if 'simple_retriever' in locals():
try:
# Initialize a CrossEncoder model for re-ranking
# Common models: 'cross-encoder/ms-marco-MiniLM-L-6-v2', 'BAAI/bge-reranker-large'
model = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")
# Create the compressor using the reranker
compressor = CrossEncoderReranker(model=model, top_n=3) # Keep top 3 after re-ranking
# Create the compression retriever
reranking_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=simple_retriever
)
print("Initialized Re-ranking Retriever.")
# Example usage
query = "Explain the RAG pipeline step-by-step"
reranked_results = reranking_retriever.invoke(query)
print(f"\n--- Re-ranked Results for '{query}' ({len(reranked_results)} docs) ---")
# for doc in reranked_results:
# print(f"- {doc.page_content[:100]}...")
except ImportError:
print("Required libraries for reranking not found. pip install langchain-community sentence-transformers")
except Exception as e:
print(f"Error setting up Re-ranking Retriever: {e}")
else:
print("Skipping Re-ranking example due to missing base retriever.")
3. Query Transformations
Modify the user's query before retrieval to improve relevance. Techniques include:
- Hypothetical Document Embeddings (HyDE): Generate a hypothetical answer first, embed that answer, and retrieve documents similar to the hypothetical answer.
- Multi-Query Retriever: Use an LLM to generate multiple related queries from the original query and retrieve documents for all of them.
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI # Ensure model imported
# Assume 'llm' and 'vectorstore' are initialized
# Define dummy data if needed
if 'llm' not in locals() or 'vectorstore' not in locals():
llm = ChatOpenAI(temperature=0) # Dummy LLM
class DummyVectorStore:
def as_retriever(self): return DummyRetriever()
class DummyRetriever:
def invoke(self, q): return [Document(page_content=f"Dummy for '{q}'")]
vectorstore = DummyVectorStore()
print("Using dummy LLM/VectorStore for Query Transformation examples.")
# 1. HyDE (Conceptual Example - LangChain implementation varies)
# Typically involves an LLMChain to generate the hypothetical doc first
hypothetical_doc_prompt = PromptTemplate.from_template(
"Generate a hypothetical answer to the question: {question}"
)
hypothetical_doc_chain = LLMChain(llm=llm, prompt=hypothetical_doc_prompt)
# query = "What is the future of RAG systems?"
# hypothetical_answer = hypothetical_doc_chain.invoke({"question": query})['text']
# Embed hypothetical_answer and use it for retrieval from vectorstore
# (Manual implementation or specific LangChain HyDE components needed)
print("Skipping HyDE execution example (requires manual embedding/search or specific components).")
# 2. Multi-Query Retriever
try:
mq_retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(),
llm=llm
)
print("Initialized Multi-Query Retriever.")
# Example usage
query = "Tell me about RAG limitations and solutions."
mq_results = mq_retriever.invoke(query)
print(f"\n--- Multi-Query Results for '{query}' ({len(mq_results)} docs) ---")
# Note: This retriever often returns duplicates due to retrieving for multiple queries.
# Deduplication might be needed depending on the use case.
# unique_contents = {doc.page_content for doc in mq_results}
# print(f"Unique results: {len(unique_contents)}")
# for content in list(unique_contents)[:3]: # Print first 3 unique results
# print(f"- {content[:100]}...")
except Exception as e:
print(f"Error setting up Multi-Query Retriever: {e}")
4. Fine-tuning Embedding Models
For highly specialized domains, fine-tuning an embedding model on your specific data can significantly improve retrieval performance.
Fine-tuning Considerations:
Fine-tuning requires a labeled dataset (query, relevant passage pairs) and significant computational resources. It's generally considered an advanced optimization step after exhausting other techniques.
Platforms like OpenAI offer fine-tuning APIs, while open-source models can be fine-tuned using libraries like sentence-transformers.
Evaluating RAG Systems
Evaluating RAG systems involves assessing both the retrieval and generation components.
Retrieval Evaluation Metrics
- Hit Rate: Percentage of queries for which at least one relevant document is retrieved.
- Mean Reciprocal Rank (MRR): Average of the reciprocal ranks of the first relevant document.
- Normalized Discounted Cumulative Gain (NDCG): Measures ranking quality, considering the position of relevant documents.
Generation Evaluation Metrics
- Faithfulness / Groundedness: How well the generated answer is supported by the retrieved context.
- Answer Relevance: How well the generated answer addresses the user's query.
- Answer Correctness: Factual accuracy of the generated answer (often requires human evaluation).
RAG Evaluation Frameworks
Frameworks like Ragas and LangChain's evaluation modules provide tools for automated RAG evaluation:
# Ensure ragas is installed: pip install ragas
# Ragas Example (Conceptual - requires dataset setup)
# from datasets import Dataset
# from ragas import evaluate
# from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
# # Assume you have a dataset with questions, answers, contexts, ground_truths
# data_samples = {
# 'question': ['What is RAG?'],
# 'answer': ['RAG combines retrieval with generation...'],
# 'contexts' : [[
# 'Retrieval Augmented Generation (RAG) enhances LLMs...',
# 'RAG systems use external knowledge...'
# ]],
# 'ground_truth': ['RAG is a technique to improve LLMs by retrieving external documents...']
# }
# dataset = Dataset.from_dict(data_samples)
# try:
# score = evaluate(
# dataset,
# metrics=[
# faithfulness, # How factual is the answer based on context?
# answer_relevancy, # How relevant is the answer to the question?
# context_precision, # Signal-to-noise ratio in retrieved context
# context_recall, # Ability to retrieve all necessary context
# ]
# )
# print("Ragas Evaluation Score:")
# print(score)
# except Exception as e:
# print(f"Ragas Evaluation Error: {e}")
print("Skipping Ragas example as it requires dataset setup.")
# LangChain Evaluation Example (Conceptual)
# from langchain.evaluation import load_evaluator
# # Assume llm is initialized
# # Assume retriever returns relevant docs for a query
# try:
# # Evaluator for checking if the answer is grounded in the documents
# faithfulness_evaluator = load_evaluator("labeled_score_string", criteria="faithfulness", llm=llm)
# query = "What is the capital of France?"
# context = "Paris is the capital and most populous city of France."
# prediction = "The capital of France is Paris."
# eval_result = faithfulness_evaluator.evaluate_strings(
# prediction=prediction,
# input=query,
# reference=context # Context acts as the reference for faithfulness
# )
# print("LangChain Faithfulness Evaluation Result:")
# print(eval_result)
# except Exception as e:
# print(f"LangChain Evaluation Error: {e}")
print("Skipping LangChain evaluation example.")
Next Steps: Introduction to AI Agents
Understanding RAG is fundamental to building knowledgeable AI agents. With RAG providing access to information, the next step is to understand how agents use this information alongside tools and reasoning capabilities to perform complex tasks.
Key Takeaways from This Section:
- RAG enhances LLMs by grounding them in external knowledge
- Building RAG involves document processing, embedding, vector storage, retrieval, and generation
- Effective chunking, embedding model selection, and vector databases are crucial
- Advanced techniques like hybrid search, re-ranking, and query transformations improve performance
- Evaluating RAG requires assessing both retrieval quality and generation faithfulness
In the next section, we formally introduce AI Agents, exploring their core concepts, architectures, and how they differ from traditional AI systems.
Continue to Introduction to AI Agents →