How to Build a RAG Chatbot with LangChain | And Why Vector Search Is Just the Beginning
Build a working RAG chatbot for business documents with LangChain, OpenAI, and Chroma plus the production gotchas most tutorials skip.

Build a working RAG chatbot for business documents with LangChain, OpenAI, and Chroma plus the production gotchas most tutorials skip.

If you've ever wanted to build a chatbot that actually knows your company's documents instead of politely hallucinating you're looking for RAG: Retrieval-Augmented Generation.
The good news: you can build a working prototype in an afternoon. The less-good news: the gap between "working prototype" and "production-ready chatbot" is where most projects quietly stall. This tutorial walks you through the full pipeline with LangChain, OpenAI, and Chroma, then points at the depth most beginner tutorials skip.
By the end, you'll have a chatbot that answers questions from a folder of business documents and a clear-eyed view of what to worry about next.
A plain language model only knows what it was trained on. It has never seen your employee handbook, your vendor contracts, or last quarter's compliance memo. Ask it about them and you get one of three outcomes: a confident hallucination, a refusal, or a generic non-answer.
RAG fixes that by doing two things at query time:
That's the whole idea. Vector search handles step 1; the LLM handles step 2. Everything else is engineering around those two steps.
pip install langchain langchain-openai langchain-community langchain-chroma pypdf
export OPENAI_API_KEY="sk-..."Chroma runs in-process with zero setup perfect for learning. When you outgrow it, Pinecone or Weaviate are common managed alternatives that handle scaling, replication, and hybrid search out of the box.
LangChain has loaders for almost every format i.e. PDFs, Word docs, Notion pages, Confluence, Google Drive, SharePoint. For business docs, PDFs dominate:
from langchain_community.document_loaders import PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader("./docs")
documents = loader.load()
print(f"Loaded {len(documents)} pages") Each Document carries two important fields:
Keep that metadata. You'll use it later for citations, access control, and debugging ("why did it pull that chunk?").
LLMs have context limits, and retrieval works better on small, focused passages. A 50-page PDF as a single chunk is useless the whole document would be "relevant" to every query, and you'd blow through your context window.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=150,
)
chunks = splitter.split_documents(documents)RecursiveCharacterTextSplitter tries to split on natural boundaries first: paragraphs, then sentences, then words which keeps chunks more coherent than a blind character split. For structured content (code, markdown, HTML), LangChain ships format-aware splitters that respect that structure.
Embeddings turn text into high-dimensional vectors. The key property: texts with similar meanings end up as similar vectors, measured by cosine similarity. That's how semantic search works you embed the user's question, then find the chunks whose embeddings are nearest.
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db",
)text-embedding-3-small is cheap, fast, and plenty accurate for most business docs. text-embedding-3-large is the upgrade path when retrieval quality matters more than cost.
Chroma persists to disk via persist_directory, so you only pay for embeddings once. Re-running the script won't re-embed unchanged documents if you're careful about how you load.
Now the LCEL part. You're composing a pipeline: retrieve -> format prompt -> call LLM -> parse output.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
prompt = ChatPromptTemplate.from_messages([
("system", "Answer the question using ONLY the context below. "
"If the answer isn't there, say you don't know.\n\n{context}"),
("human", "{input}"),
])
doc_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, doc_chain)
response = rag_chain.invoke({"input": "What is our refund policy?"})
print(response["answer"])That's a working RAG chatbot. Four retrieved chunks, stuffed into the prompt, answered by the model.
The k parameter(how many chunks to retrieve) is one of the most consequential knobs you have:
Start at k=4 and adjust based on answer quality. For questions that span multiple documents (e.g., "compare our 2023 and 2024 policies"), bump it higher.
You can also swap the default similarity search for MMR (Maximum Marginal Relevance), which balances relevance with diversity useful when your top-k results are all near duplicates from the same section:
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 4, "fetch_k": 20},
) The system prompt above isn't just polite each clause does work:
For business use, consider layering on:
Small prompt changes produce large quality changes. Treat your system prompt as a living piece of the codebase, not a one-time setup.
Real chatbots handle follow-ups like "and what about refunds over $500?" That question is meaningless without the previous turn, if you embed it as-is, the vector search returns garbage because "refunds over $500" isn't what any chunk is specifically about.
The fix is a history-aware retriever that rewrites follow-ups into standalone queries before searching:
from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import MessagesPlaceholder
rewrite_prompt = ChatPromptTemplate.from_messages([
MessagesPlaceholder("chat_history"),
("human", "{input}"),
("human", "Rewrite the above as a standalone question."),
])
history_retriever = create_history_aware_retriever(llm, retriever, rewrite_prompt) Internally, this calls the LLM once to rewrite ("What is the refund policy for transactions over $500?"), then uses that rewritten query for retrieval. Wire it into create_retrieval_chain in place of the plain retriever, and your bot can now follow a conversation.
Here's the question that separates weekend projects from shipped products: how do you know it's any good?
"I asked it five things and they seemed right" is not evaluation. Build a small test set 20 to 50 real questions with known correct answers from your documents and measure two things separately:
Separating these matters because the fixes are different. Bad retrieval means chunking, embeddings, or k are off. Bad generation with good retrieval means the prompt or the model is the problem.
LangChain's langsmith and open-source tools like ragas give you scaffolding for this — but even a hand-rolled spreadsheet is better than vibes.
Where It Gets Hard: The Gotchas Nobody Mentions
This is where most "afternoon RAG projects" run into production reality.
PDF parsing is lossy. Tables, multi-column layouts, and scanned documents routinely come out as garbled text. PyPDFLoader is a starting point, not a finish line. Tools like unstructured or pymupdf help with complex layouts. For scanned PDFs, you'll need OCR before any of this pipeline is useful.
Chunk boundaries cut through meaning. A policy table split across two chunks means neither chunk contains the full answer. Consider semantic splitters or structure-aware chunking (e.g., never split a table row) for high-stakes content.
Citations can be hallucinated. If you ask the model to cite sources, it may invent page numbers or filenames that don't exist. Always pull citations from the retrieved chunks' metadata not from the model's generated output.
There's no access control by default. Every user querying the chatbot can retrieve every document. For business data, this is a non-negotiable problem. The fix is metadata filtering at retrieval time (e.g., filter={"department": user.department}) combined with real authentication before the chatbot is ever queried.
Updates are messier than you expect. When a document changes, you need to re-chunk and re-embed just that document not the whole corpus. Build that update pipeline early; retrofitting it later is painful.
Evaluation is the hardest part. "Does it work?" isn't a yes/no question. Without a test set and a scoring method, every change you make is a guess.
RAG is genuinely approachable the code above is under 80 lines, and it works. But the vector search piece is maybe 20% of what makes a business chatbot actually useful. The other 80% is document quality, chunking strategy, prompt design, access control, and evaluation.
Start with the simple version. Ship it to a small, friendly audience. Then let the real failure modes not your imagination, tell you what to build next.