What is RAG?
RAG grounds LLM responses in your own data, eliminating hallucinations and giving users source-cited answers. The pipeline has two phases: offline indexing and online retrieval + generation.
Document Ingestion
LangChain loaders handle PDF, DOCX, HTML, Notion, and Google Drive. Split documents into overlapping chunks (512 tokens, 64 overlap) to preserve context across boundaries.
Embeddings and Vector Store
Use OpenAI text-embedding-3-small or a local model via Ollama. Store vectors in ChromaDB for development; switch to Pinecone or pgvector for production scale.
Hybrid Search
Combine semantic similarity search with BM25 keyword search and a cross-encoder re-ranker for significantly better retrieval accuracy compared to vector search alone.
FastAPI Endpoint
Expose the RAG chain as a streaming FastAPI endpoint with source citations returned alongside each answer — essential for user trust.
Comments (0)
No comments yet
Be the first to share your thoughts!