How to Build a RAG Chatbot with LangChain | And Why Vector Search Is Just the Beginning

Build a working RAG chatbot for business documents with LangChain, OpenAI, and Chroma plus the production gotchas most tutorials skip.

SantoshApril 18, 20268 min read

#RAG #LangChain #LLM #AI Chatbot #OpenAI #Vector Database #Chroma #Python #Tutorial #Machine Learning #AI Engineering #Begineer #How-To

How to Build a RAG Chatbot with LangChain | And Why Vector Search Is Just the Beginning

If you've ever wanted to build a chatbot that actually knows your company's documents instead of politely hallucinating you're looking for RAG: Retrieval-Augmented Generation.

The good news: you can build a working prototype in an afternoon. The less-good news: the gap between "working prototype" and "production-ready chatbot" is where most projects quietly stall. This tutorial walks you through the full pipeline with LangChain, OpenAI, and Chroma, then points at the depth most beginner tutorials skip.

By the end, you'll have a chatbot that answers questions from a folder of business documents and a clear-eyed view of what to worry about next.

What RAG Actually Is

A plain language model only knows what it was trained on. It has never seen your employee handbook, your vendor contracts, or last quarter's compliance memo. Ask it about them and you get one of three outcomes: a confident hallucination, a refusal, or a generic non-answer.

RAG fixes that by doing two things at query time:

Retrieve the most relevant chunks from your documents using semantic search.
Generate an answer by feeding those chunks into the model as context.

That's the whole idea. Vector search handles step 1; the LLM handles step 2. Everything else is engineering around those two steps.

Setup

pip install langchain langchain-openai langchain-community langchain-chroma pypdf
export OPENAI_API_KEY="sk-..."

Chroma runs in-process with zero setup perfect for learning. When you outgrow it, Pinecone or Weaviate are common managed alternatives that handle scaling, replication, and hybrid search out of the box.

Step 1: Load Your Documents

LangChain has loaders for almost every format i.e. PDFs, Word docs, Notion pages, Confluence, Google Drive, SharePoint. For business docs, PDFs dominate:

from langchain_community.document_loaders import PyPDFDirectoryLoader
                                                                                                                                               
loader = PyPDFDirectoryLoader("./docs")                                                                                                      
documents = loader.load()                                                                                                                    
print(f"Loaded {len(documents)} pages")

Each Document carries two important fields:

page_content: the extracted text
metadata: source filename, page number, and any custom fields you add

Keep that metadata. You'll use it later for citations, access control, and debugging ("why did it pull that chunk?").

Step 2: Chunk the Text

LLMs have context limits, and retrieval works better on small, focused passages. A 50-page PDF as a single chunk is useless the whole document would be "relevant" to every query, and you'd blow through your context window.

from langchain.text_splitter import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter(                                                                                                   
    chunk_size=1000,
    chunk_overlap=150,                                                                                                                       
)               
chunks = splitter.split_documents(documents)

Why these numbers?

chunk_size=1000 characters (~200 words) is small enough to stay focused, large enough to contain a complete thought.
chunk_overlap=150 prevents sentences from being guillotined mid-thought. If a critical answer sits at a chunk boundary, overlap ensures it appears in full in at least one chunk.

RecursiveCharacterTextSplitter tries to split on natural boundaries first: paragraphs, then sentences, then words which keeps chunks more coherent than a blind character split. For structured content (code, markdown, HTML), LangChain ships format-aware splitters that respect that structure.

Step 3: Embed and Store

Embeddings turn text into high-dimensional vectors. The key property: texts with similar meanings end up as similar vectors, measured by cosine similarity. That's how semantic search works you embed the user's question, then find the chunks whose embeddings are nearest.

  from langchain_openai import OpenAIEmbeddings
  from langchain_chroma import Chroma                                                                                                          
  
  embeddings = OpenAIEmbeddings(model="text-embedding-3-small")                                                                                
                  
  vectorstore = Chroma.from_documents(                                                                                                         
      documents=chunks,
      embedding=embeddings,                                                                                                                    
      persist_directory="./chroma_db",
  )

text-embedding-3-small is cheap, fast, and plenty accurate for most business docs. text-embedding-3-large is the upgrade path when retrieval quality matters more than cost.

Chroma persists to disk via persist_directory, so you only pay for embeddings once. Re-running the script won't re-embed unchanged documents if you're careful about how you load.

Step 4: Build the Retrieval Chain

Now the LCEL part. You're composing a pipeline: retrieve -> format prompt -> call LLM -> parse output.

from langchain_openai import ChatOpenAI                                                                                                      
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain                                                                                          
from langchain.chains.combine_documents import create_stuff_documents_chain
                                                                                                                                               
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)                                                                                         
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
                                                                                                                                               
prompt = ChatPromptTemplate.from_messages([                                                                                                  
    ("system", "Answer the question using ONLY the context below. "
               "If the answer isn't there, say you don't know.\n\n{context}"),                                                               
    ("human", "{input}"),
])                                                                                                                                           
  
doc_chain = create_stuff_documents_chain(llm, prompt)                                                                                        
rag_chain = create_retrieval_chain(retriever, doc_chain)
 
response = rag_chain.invoke({"input": "What is our refund policy?"})                                                                         
print(response["answer"])

That's a working RAG chatbot. Four retrieved chunks, stuffed into the prompt, answered by the model.

Tuning the Retriever

The k parameter(how many chunks to retrieve) is one of the most consequential knobs you have:

Too low (k=1–2): you miss context, answers feel incomplete.
Too high (k=15+): you dilute the prompt with irrelevant text, and the model starts guessing which chunk matters.

Start at k=4 and adjust based on answer quality. For questions that span multiple documents (e.g., "compare our 2023 and 2024 policies"), bump it higher.

You can also swap the default similarity search for MMR (Maximum Marginal Relevance), which balances relevance with diversity useful when your top-k results are all near duplicates from the same section:

retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 20},
)

Step 5: Prompt Engineering Matters More Than You Think

The system prompt above isn't just polite each clause does work:

"ONLY the context" reduces hallucination by discouraging training-data answers.
"say you don't know" is an explicit escape hatch; without it, the model will guess rather than admit ignorance.

For business use, consider layering on:

"Quote the exact wording when citing policies.": prevents paraphrasing drift on legal/compliance language.
"Include the source filename from the metadata.": gives users something to verify against.
"If multiple sources conflict, say so explicitly.": real document corpora contradict themselves more than you'd expect.

Small prompt changes produce large quality changes. Treat your system prompt as a living piece of the codebase, not a one-time setup.

Step 6: Adding Conversation History

Real chatbots handle follow-ups like "and what about refunds over $500?" That question is meaningless without the previous turn, if you embed it as-is, the vector search returns garbage because "refunds over $500" isn't what any chunk is specifically about.

The fix is a history-aware retriever that rewrites follow-ups into standalone queries before searching:

from langchain.chains import create_history_aware_retriever                                                                                  
from langchain_core.prompts import MessagesPlaceholder
                                                                                                                                               
rewrite_prompt = ChatPromptTemplate.from_messages([
    MessagesPlaceholder("chat_history"),                                                                                                     
    ("human", "{input}"),
    ("human", "Rewrite the above as a standalone question."),
])                                                                                                                                           
  
history_retriever = create_history_aware_retriever(llm, retriever, rewrite_prompt)

Internally, this calls the LLM once to rewrite ("What is the refund policy for transactions over $500?"), then uses that rewritten query for retrieval. Wire it into create_retrieval_chain in place of the plain retriever, and your bot can now follow a conversation.

Step 7: Evaluating What You Built

Here's the question that separates weekend projects from shipped products: how do you know it's any good?

"I asked it five things and they seemed right" is not evaluation. Build a small test set 20 to 50 real questions with known correct answers from your documents and measure two things separately:

Retrieval quality: did the right chunks come back? (Independent of the LLM's answer.)
Generation quality: given the right chunks, was the answer accurate and complete?

Separating these matters because the fixes are different. Bad retrieval means chunking, embeddings, or k are off. Bad generation with good retrieval means the prompt or the model is the problem.

LangChain's langsmith and open-source tools like ragas give you scaffolding for this — but even a hand-rolled spreadsheet is better than vibes.

Where It Gets Hard: The Gotchas Nobody Mentions

This is where most "afternoon RAG projects" run into production reality.

PDF parsing is lossy. Tables, multi-column layouts, and scanned documents routinely come out as garbled text. PyPDFLoader is a starting point, not a finish line. Tools like unstructured or pymupdf help with complex layouts. For scanned PDFs, you'll need OCR before any of this pipeline is useful.

Chunk boundaries cut through meaning. A policy table split across two chunks means neither chunk contains the full answer. Consider semantic splitters or structure-aware chunking (e.g., never split a table row) for high-stakes content.

Citations can be hallucinated. If you ask the model to cite sources, it may invent page numbers or filenames that don't exist. Always pull citations from the retrieved chunks' metadata not from the model's generated output.

There's no access control by default. Every user querying the chatbot can retrieve every document. For business data, this is a non-negotiable problem. The fix is metadata filtering at retrieval time (e.g., filter={"department": user.department}) combined with real authentication before the chatbot is ever queried.

Updates are messier than you expect. When a document changes, you need to re-chunk and re-embed just that document not the whole corpus. Build that update pipeline early; retrofitting it later is painful.

Evaluation is the hardest part. "Does it work?" isn't a yes/no question. Without a test set and a scoring method, every change you make is a guess.

Closing Thought

RAG is genuinely approachable the code above is under 80 lines, and it works. But the vector search piece is maybe 20% of what makes a business chatbot actually useful. The other 80% is document quality, chunking strategy, prompt design, access control, and evaluation.

Start with the simple version. Ship it to a small, friendly audience. Then let the real failure modes not your imagination, tell you what to build next.

← Back to blog Let's work together