Writing

Building Production-Ready RAG Systems: A Comprehensive Guide

6 min read
RAG
LLM
Vector Databases
Production ML
Architecture

Building production-ready RAG (Retrieval-Augmented Generation) systems requires more than just connecting an LLM to a vector database. Based on my experience building multilingual enterprise chatbots using Claude 3 and Cohere v3, here's a comprehensive guide to building RAG systems that actually work at scale.

What is RAG and When to Use It

RAG combines the power of large language models with external knowledge retrieval. Instead of relying solely on the LLM's training data, RAG systems retrieve relevant information from your documents and provide it as context to the LLM.

Use RAG when:

  • You need to answer questions about private/proprietary data
  • Your domain knowledge changes frequently
  • You want to cite sources and verify responses
  • Fine-tuning is too expensive or time-consuming

Skip RAG when:

  • You need the model to perform a specific task consistently (use fine-tuning)
  • Your use case doesn't require external knowledge
  • Latency is critical and you can't afford retrieval overhead

Core Architecture Components

1. Document Processing Pipeline

The foundation of any RAG system is how you process and chunk your documents:

python
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Smart chunking strategy
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = text_splitter.split_documents(documents)

Key lessons:

  • Chunk size matters: 500-1000 tokens works well for most cases
  • Always include overlap (10-20%) to preserve context
  • Respect document structure (paragraphs, sections)

2. Embedding Strategy

For the enterprise project, we used Cohere v3 embeddings which excel at multilingual scenarios:

python
from cohere import Client

cohere_client = Client(api_key="your-key")

embeddings = cohere_client.embed(
    texts=chunks,
    model="embed-multilingual-v3.0",
    input_type="search_document"
)

Best practices:

  • Use different input types: "search_document" for indexing, "search_query" for queries
  • Batch embed for efficiency (up to 96 docs per request with Cohere)
  • Store metadata with embeddings (source, page number, timestamp)

3. Vector Database Selection

We used FAISS for development and AWS SageMaker for production:

python
import faiss
from langchain.vectorstores import FAISS

# Create FAISS index
vectorstore = FAISS.from_documents(
    documents=chunks,
    embedding=embeddings
)

# Similarity search
results = vectorstore.similarity_search_with_score(
    query="your question",
    k=5
)

Production considerations:

  • FAISS: Great for prototyping, limited scalability
  • Pinecone/Weaviate: Managed, scales well
  • PostgreSQL with pgvector: Good for < 1M vectors

4. Advanced Prompt Engineering

This is where RAG systems succeed or fail:

python
prompt_template = """You are an expert assistant for the enterprise.
Use the following context to answer the question. If the answer
is not in the context, say "I don't have enough information."

Context:
{context}

Question: {question}

Answer in {language}:"""

Key techniques:

  • Explicitly instruct what to do when information is missing
  • Include language instruction for multilingual systems
  • Add role/persona for consistent tone
  • Request citations: "Quote the relevant section"

Production Challenges and Solutions

Challenge 1: Multilingual Query Understanding

Problem: Arabic queries with English technical terms, or vice versa.

Solution:

  • Use multilingual embeddings (Cohere v3, multilingual-e5)
  • Translate queries to multiple languages before retrieval
  • Create hybrid search (semantic + keyword)
python
# Hybrid search example
semantic_results = vectorstore.similarity_search(query, k=10)
keyword_results = bm25_retriever.get_relevant_documents(query, k=10)

# Rerank combined results
final_results = rerank(semantic_results + keyword_results, query)

Challenge 2: Latency Optimization

Problem: Retrieval + LLM inference = slow responses.

Solution:

  • Cache frequent queries
  • Reduce chunk size for faster retrieval
  • Use streaming responses
  • Parallel retrieval and reranking

We achieved 2.3s average response time on the enterprise chatbot with these optimizations.

Challenge 3: Answer Quality and Hallucinations

Problem: LLMs sometimes ignore retrieved context or hallucinate.

Solution:

  • Use Claude 3 or GPT-4 (better at following instructions)
  • Implement confidence scoring
  • Add citation requirements
  • Human-in-the-loop for low confidence answers
python
# Confidence scoring
if confidence_score < 0.7:
    return "I'm not confident about this answer.
            Please consult documentation."

Performance Metrics

Track these metrics in production:

  1. Retrieval Accuracy: Are relevant docs retrieved? (P@5, MRR)
  2. Answer Quality: Human evaluation, RAGAS scores
  3. Latency: P50, P95, P99 response times
  4. Cost: Embedding + LLM API costs per query
  5. User Satisfaction: Thumbs up/down, CSAT scores

Our enterprise chatbot metrics:

  • 92% retrieval accuracy (relevant doc in top 5)
  • 2.3s average latency
  • 87% user satisfaction rate

Cost Optimization

RAG can get expensive at scale:

Embedding costs:

  • Cohere v3: $0.10 per 1M tokens
  • OpenAI ada-002: $0.10 per 1M tokens
  • Open source (free but hosting costs)

LLM costs:

  • Claude 3 Haiku: $0.25/$1.25 per MTok (in/out)
  • GPT-3.5: $0.50/$1.50 per MTok
  • Use smaller models for simple queries

Cost reduction strategies:

  • Cache embeddings (embed once, query many times)
  • Use semantic cache for repeated questions
  • Route simple queries to cheaper models
  • Implement query classification

Deployment Architecture

Our production setup on AWS:

User Query
  ↓
API Gateway + Lambda
  ↓
Query Classification (simple vs complex)
  ↓
Vector Store (FAISS/SageMaker)
  ↓
Retrieve Top-K Documents
  ↓
Rerank (Cohere Rerank)
  ↓
LLM (Claude 3 via Bedrock)
  ↓
Stream Response

Key infrastructure:

  • Lambda for serverless scaling
  • SageMaker for embeddings and reranking
  • Bedrock for LLM access
  • CloudWatch for monitoring

Testing and Evaluation

Build a comprehensive test suite:

python
test_cases = [
    {
        "query": "What is the customer service number?",
        "expected_answer": "04-601-9999",
        "language": "en"
    },
    # Add 50-100 diverse test cases
]

for test in test_cases:
    response = rag_system.query(test["query"])
    assert expected_answer in response
    assert response_time < 5.0

Key Takeaways

  1. Chunking is critical: Bad chunks = bad retrieval = bad answers
  2. Prompt engineering matters: 80% of quality comes from good prompts
  3. Monitor everything: Latency, cost, quality metrics
  4. Start simple: Get basic RAG working before adding complexity
  5. Multilingual is hard: Use specialized embeddings and extensive testing

Next Steps

  • Implement hybrid search (semantic + keyword)
  • Add query classification for routing
  • Set up A/B testing for prompt variations
  • Build feedback loop for continuous improvement

RAG is powerful but requires careful engineering. Start with the basics, measure everything, and iterate based on real user feedback.


*Have questions about RAG systems? Feel free to reach out!*