Home/Blog/Building Production-Ready RAG Systems: From Basics...
AI⭐ Featured

Building Production-Ready RAG Systems: From Basics to Advanced Techniques

Master Retrieval-Augmented Generation (RAG) systems. Learn how to build, optimize, and deploy RAG applications that combine the power of LLMs with your own data.

Sani Mridha

Sani Mridha

Senior Mobile Developer

📅 2024-01-10⏱️ 15 min read
🤖

Building Production-Ready RAG Systems: From Basics to Advanced Techniques

Retrieval-Augmented Generation (RAG) has revolutionized how we build AI applications. Let's explore how to create production-ready RAG systems that actually work.

What is RAG?

RAG combines the power of Large Language Models (LLMs) with external knowledge retrieval. Instead of relying solely on the model's training data, RAG systems:

1. Retrieve relevant information from a knowledge base

2. Augment the prompt with retrieved context

3. Generate responses using both the LLM and retrieved data

Why RAG Matters

Traditional LLM Limitations

  • Knowledge Cutoff: Models only know information up to their training date
  • Hallucinations: Models can confidently generate false information
  • No Domain Specificity: Generic models lack specialized knowledge
  • No Real-Time Data: Can't access current information
  • RAG Solutions

  • Up-to-Date Information: Query live databases and documents
  • Reduced Hallucinations: Ground responses in actual data
  • Domain Expertise: Inject specialized knowledge
  • Source Attribution: Cite specific documents
  • Architecture Overview

    Basic RAG Pipeline

    User Query → Embedding → Vector Search → Context Retrieval → LLM → Response

    Components Breakdown

    1. Document Ingestion: Load and process documents

    2. Chunking: Split documents into manageable pieces

    3. Embedding: Convert text to vector representations

    4. Vector Store: Store and index embeddings

    5. Retrieval: Find relevant chunks for queries

    6. Generation: Create responses with LLM

    Building Your First RAG System

    Step 1: Document Processing

    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    def process_documents(documents):
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            separators=["\n\n", "\n", " ", ""]
        )
        
        chunks = text_splitter.split_documents(documents)
        return chunks

    Step 2: Create Embeddings

    from langchain.embeddings import OpenAIEmbeddings
    
    embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
    
    # Convert chunks to vectors
    vectors = embeddings.embed_documents([chunk.page_content for chunk in chunks])

    Step 3: Vector Store Setup

    from langchain.vectorstores import Pinecone
    import pinecone
    
    pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
    
    vectorstore = Pinecone.from_documents(
        chunks,
        embeddings,
        index_name="my-rag-index"
    )

    Step 4: Retrieval Chain

    from langchain.chains import RetrievalQA
    from langchain.llms import OpenAI
    
    qa_chain = RetrievalQA.from_chain_type(
        llm=OpenAI(temperature=0),
        chain_type="stuff",
        retriever=vectorstore.as_retriever(search_kwargs={"k": 4})
    )
    
    # Query the system
    response = qa_chain.run("What is the new architecture in React Native?")

    Advanced Techniques

    1. Hybrid Search

    Combine vector search with keyword search for better results:

    from langchain.retrievers import EnsembleRetriever
    from langchain.retrievers import BM25Retriever
    
    # Vector retriever
    vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
    
    # Keyword retriever
    bm25_retriever = BM25Retriever.from_documents(chunks)
    bm25_retriever.k = 5
    
    # Ensemble retriever
    ensemble_retriever = EnsembleRetriever(
        retrievers=[vector_retriever, bm25_retriever],
        weights=[0.5, 0.5]
    )

    2. Re-ranking

    Improve retrieval quality with re-ranking:

    from langchain.retrievers import ContextualCompressionRetriever
    from langchain.retrievers.document_compressors import CohereRerank
    
    compressor = CohereRerank(model="rerank-english-v2.0")
    
    compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=vector_retriever
    )

    3. Query Transformation

    Enhance queries before retrieval:

    from langchain.chains import LLMChain
    from langchain.prompts import PromptTemplate
    
    query_transform_prompt = PromptTemplate(
        input_variables=["question"],
        template="""Given the user question, generate 3 different versions 
        of the question to retrieve relevant documents:
        
        Original: {question}
        
        Variations:"""
    )
    
    query_transformer = LLMChain(llm=llm, prompt=query_transform_prompt)

    4. Metadata Filtering

    Add metadata for precise filtering:

    # Add metadata during ingestion
    chunks_with_metadata = [
        {
            "content": chunk.page_content,
            "metadata": {
                "source": chunk.metadata["source"],
                "date": "2024-01-15",
                "category": "technical",
                "author": "Sani Mridha"
            }
        }
        for chunk in chunks
    ]
    
    # Filter during retrieval
    results = vectorstore.similarity_search(
        "React Native performance",
        k=5,
        filter={"category": "technical", "date": {"$gte": "2024-01-01"}}
    )

    Optimization Strategies

    Chunking Strategies

    Different strategies for different content:

    # For code documentation
    code_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        separators=["\nclass ", "\ndef ", "\n\n", "\n", " "]
    )
    
    # For narrative content
    narrative_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1500,
        chunk_overlap=300,
        separators=["\n## ", "\n### ", "\n\n", "\n", " "]
    )

    Embedding Model Selection

    | Model | Dimensions | Use Case |

    |-------|-----------|----------|

    | text-embedding-3-small | 1536 | Fast, cost-effective |

    | text-embedding-3-large | 3072 | High accuracy |

    | all-MiniLM-L6-v2 | 384 | Local deployment |

    Prompt Engineering

    Optimize your RAG prompts:

    rag_prompt = """Use the following context to answer the question. 
    If you cannot answer based on the context, say so clearly.
    
    Context:
    {context}
    
    Question: {question}
    
    Instructions:
    1. Answer based only on the provided context
    2. Cite specific sources when possible
    3. If uncertain, express your level of confidence
    4. Keep answers concise but complete
    
    Answer:"""

    Production Considerations

    1. Caching

    Implement caching for frequent queries:

    from functools import lru_cache
    
    @lru_cache(maxsize=1000)
    def cached_retrieval(query: str):
        return vectorstore.similarity_search(query, k=4)

    2. Error Handling

    Robust error handling is crucial:

    def safe_rag_query(query: str, max_retries: int = 3):
        for attempt in range(max_retries):
            try:
                results = qa_chain.run(query)
                return results
            except Exception as e:
                if attempt == max_retries - 1:
                    return "I'm having trouble processing your request. Please try again."
                time.sleep(2 ** attempt)

    3. Monitoring

    Track key metrics:

    import time
    
    def monitored_rag_query(query: str):
        start_time = time.time()
        
        # Retrieval metrics
        retrieval_start = time.time()
        docs = vectorstore.similarity_search(query, k=4)
        retrieval_time = time.time() - retrieval_start
        
        # Generation metrics
        generation_start = time.time()
        response = llm.generate(docs)
        generation_time = time.time() - generation_start
        
        # Log metrics
        log_metrics({
            "total_time": time.time() - start_time,
            "retrieval_time": retrieval_time,
            "generation_time": generation_time,
            "num_docs_retrieved": len(docs)
        })
        
        return response

    Real-World Use Cases

    Customer Support Bot

    # Ingest support documentation
    support_docs = load_support_documentation()
    support_vectorstore = create_vectorstore(support_docs)
    
    # Create specialized chain
    support_chain = RetrievalQA.from_chain_type(
        llm=ChatOpenAI(temperature=0.3),
        retriever=support_vectorstore.as_retriever(),
        return_source_documents=True
    )

    Code Documentation Assistant

    # Ingest codebase
    code_docs = load_code_documentation()
    code_vectorstore = create_vectorstore(code_docs)
    
    # Query with code context
    code_assistant = RetrievalQA.from_chain_type(
        llm=ChatOpenAI(model="gpt-4"),
        retriever=code_vectorstore.as_retriever(search_kwargs={"k": 6})
    )

    Common Pitfalls

    1. Poor Chunking

    Bad: Fixed 500-character chunks without context

    Good: Semantic chunking with overlap

    2. Ignoring Metadata

    Bad: Storing only text content

    Good: Rich metadata for filtering and ranking

    3. No Evaluation

    Bad: Deploy without testing

    Good: Comprehensive evaluation pipeline

    Evaluation Framework

    def evaluate_rag_system(test_cases):
        metrics = {
            "retrieval_precision": [],
            "answer_relevance": [],
            "faithfulness": []
        }
        
        for case in test_cases:
            # Retrieve documents
            docs = retriever.get_relevant_documents(case["query"])
            
            # Check if correct docs retrieved
            precision = calculate_precision(docs, case["expected_docs"])
            metrics["retrieval_precision"].append(precision)
            
            # Generate answer
            answer = qa_chain.run(case["query"])
            
            # Evaluate answer quality
            relevance = evaluate_relevance(answer, case["expected_answer"])
            faithfulness = evaluate_faithfulness(answer, docs)
            
            metrics["answer_relevance"].append(relevance)
            metrics["faithfulness"].append(faithfulness)
        
        return metrics

    Conclusion

    RAG systems are powerful but require careful design and optimization. Focus on:

    1. Quality Data: Clean, well-structured documents

    2. Smart Chunking: Context-aware splitting

    3. Effective Retrieval: Hybrid search + re-ranking

    4. Robust Generation: Well-crafted prompts

    5. Continuous Evaluation: Monitor and improve

    Start simple, measure everything, and iterate based on real usage patterns.

    ---

    *Building a RAG system? I'd love to hear about your use case!*

    Tags

    #AI#RAG#LLM#Machine Learning#NLP

    Share this article

    Let's Work Together

    Need help with your mobile app or have a project in mind?

    Sani Mridha - Senior React Native Developer | iOS & Android Expert