AI agents with RAG for knowledge base 2026: implementation guide
RAG stands for Retrieval-Augmented Generation. Sounds complex, but the idea is simple: AI reads your documents before answering. No «made-up from thin air» hallucinations, with direct source citations. In 2026 it's no longer an «experiment» but a working tool for support, internal search, training and sales. This article – what you need to know to launch or order such an agent.
What is RAG and why
Imagine ChatGPT that reads your corporate wiki, product docs, policies, resolved tickets before every answer – and gives a response with links to specific documents it sourced from. That's RAG.
typical knowledge base size for b2b tasks
implementation timeline for a basic agent
answer accuracy on properly prepared data
operating cost for mid-sized business
Why regular ChatGPT doesn't fit most business tasks:
- Doesn't know your product. ChatGPT hasn't read your docs, doesn't know specifics, will invent «plausible» but inaccurate answers.
- Cut-off date. Even Claude or GPT-4 are trained on data up to a certain date. Product changes after that – unknown.
- No source citations. User can't verify where the answer came from – less trust.
- Expensive to dump everything into context. 200K-token context window – but 1000 PDFs won't fit, and per-query cost would be huge.
RAG solves all four. Before responding, the system finds 3-7 most relevant chunks in your base and passes them to the LLM as context. The LLM forms an answer based on those chunks, with citations.
5 typical use cases
1. Support over product documentation
Client asks in the site chat or a Telegram bot. RAG searches docs, FAQ, resolved tickets. Closes 60-80% of typical questions without a human. Complex cases handed to an operator with a pre-collected summary.
2. Internal search across corporate wiki
A new hire doesn't remember «how do we handle business travel» or «where is the client communication policy». Asks the RAG agent. It finds the doc in Notion/Confluence/Google Docs, quotes the relevant part, links to the full doc.
3. Sales assistant on products
A salesperson in a client meeting. Client asks a complex technical question. The salesperson opens the RAG chat, asks, gets an answer with links to tech specs, contracts, cases. Response speed to client – ×3-5.
4. Legal document analysis
A lawyer uploads a contract. RAG checks it against corporate standards («mandatory clauses», «red flags»), highlights differences and risks, cites precedents from previous deals.
5. Educational assistant
A student in an online course asks a question. RAG finds the answer in course materials, slides, lecture transcripts. Removes 70% of typical questions from the tutor, improves course completion rate.
This isn't an exhaustive list. RAG applies anywhere there's a large base of semi-structured text and recurring questions about it.
Components of a RAG system
RAG isn't «one service», it's a pipeline of 6-7 components. Each must be matched to the task.
- Data source – PDF, web pages, Notion, Confluence, Google Docs, SharePoint, CSV/Excel, databases. The source determines how to extract text.
- Parser and chunker – splits documents into 300-800 token chunks. Chunk size strongly affects search quality.
- Embeddings model – converts text to vectors. OpenAI text-embedding-3-small (universal), Voyage-large (more accurate on code), Cohere embed-multilingual (multilingual).
- Vector database – stores vectors and quickly finds «similar». Pinecone, Qdrant, Weaviate, pgvector in Postgres.
- Retrieval logic – finds top-K chunks, optional re-ranking (Cohere Rerank, Voyage Rerank) for higher accuracy.
- LLM – forms the final answer based on found chunks. Claude Sonnet (long contexts), GPT-4 (universal), Mistral/Llama (local).
- Prompt template – model instruction: «answer only based on documents below, always cite sources, if you don't know – say so directly».
- UI or API – chat on the site, Telegram bot, Notion widget, or REST API for integrating into your product.
Launch in 2-4 weeks
Realistic schedule for launching a basic RAG agent from scratch:
- Datad 1-3
- Chunkingd 4-6
- Embeddingsd 7-10
- Retrievald 11-16
- UI+testsd 17-21
Days 1-3 – data. Gather all sources: what we want the agent to know. Clean: remove outdated, duplicates, garbage. At this point we usually discover «our docs were last updated 3 years ago» – part of the work is on the client side.
Days 4-6 – chunking. Parse documents, split into chunks. Lots of nuances here: PDF with tables needs special handling, code blocks can't be cut mid-block, headings need to be kept with context. Not «one universal chunker for all», but adapted.
Days 7-10 – embeddings. Connect OpenAI or Voyage API, run all chunks through the embeddings model. Save to vector-DB. For RU+EN content I use multilingual models; for pure English – text-embedding-3-small is enough.
Days 11-16 – retrieval. Set up search: top-K (usually 5-10), similarity threshold, optional re-ranking. Test on 20-30 typical questions: does it return relevant chunks? If not – tune: change chunk size, embeddings, prompt.
Days 17-21 – UI and tests. Build the interface: chat on the site via iframe, Telegram bot, Notion widget, or REST API. Connect monitoring: query logs, user ratings (👍/👎), accuracy metrics. Final testing with real users.
Vector database comparison
Choice of vector-DB affects cost, speed, scaling convenience. Three popular options:
| Solution | Pros | Cons |
|---|---|---|
| Pinecone | Managed, fast start, zero DevOps | More expensive at scale ($70+/month), vendor lock-in, no self-host |
| Qdrant | Open source, self-host for free, or Qdrant Cloud. High speed | Needs basic infra (Docker) for self-host |
| PostgreSQL + pgvector | If you already have Postgres – install the extension, no new infra | Slightly slower on huge sets (10M+ vectors), needs indices |
| Weaviate | Hybrid search (vectors + keywords), modules for different embeddings | More setup complexity than Pinecone/Qdrant |
My recommendations:
- Prototype / MVP – Pinecone free tier or Qdrant locally. Fast start, no infra concerns.
- Production up to 1M vectors – Qdrant self-hosted on a VPS ($10-30/month) or Pinecone starter ($70/month).
- Already have Postgres – pgvector, no new infra, convenient for the team.
- Enterprise 10M+ vectors – Pinecone enterprise or Qdrant Cloud with replicas.
7 common mistakes
2 years of active RAG work – here are the top problem causes.
- Loading everything indiscriminately. Garbage in = garbage out. 80% of the work goes to data prep: cleaning, normalisation, removing outdated. Not a «technical detail», but half the success.
- Chunks too big. If a chunk is a whole page, the model finds «this page» instead of «that specific paragraph». Accuracy drops. Optimal – 300-800 tokens with 50-100 overlap.
- Chunks too small. If a chunk is 1 sentence – context is lost: «it» refers to nothing. Too short is also bad.
- One embedding model for all languages. If you have RU+EN content – you need multilingual embeddings (Cohere multilingual, OpenAI text-embedding-3-large). Otherwise cross-language search breaks.
- No re-ranking. Top-K from vector search often contains «similar but not exact». Re-ranking model reorders by relevance – 15-30% accuracy lift.
- Ignoring citations. Answer without source links = no trust. Prompt must require citations explicitly: «put source in brackets after each fact». User sees link → clicks → verifies → trusts.
- Not refreshing the base. After 3-6 months docs become outdated, answers turn false. Need a regular re-indexing process: auto on Notion/Confluence change, or weekly cron.
Frequently asked questions
How is a RAG agent different from regular ChatGPT?
Regular ChatGPT answers from data it was trained on (up to the cut-off date). It doesn't know your product, documents or internal processes. A RAG agent searches relevant chunks in your knowledge base (docs, articles, policies) before answering – with source citations. Essentially «ChatGPT that reads your documents before answering».
How much does it cost to deploy a RAG agent?
Basic agent on 50-500 documents: 2-4 weeks of development + $20-100/month on vector-DB and LLM API. For a business with average traffic (100-500 queries/day) – ~$100-300/month. Large enterprise installations with 10,000+ documents and thousands of daily queries – from $1000/month. Exact estimate – after a short brief.
How many documents can RAG realistically handle?
From 10 to millions. No technical upper limit. 10-100 documents – any vector-DB works. 1,000-10,000 – managed (Pinecone) or self-hosted Qdrant. Over 100,000 – needs chunking optimisation, hierarchical retrieval, sometimes separate indices per content type. In my practice 500-5,000 documents is the most common size.
Is my data safe with OpenAI or Anthropic?
API calls (paid plans) – your data is not used for training, this is in the Terms of Service of both providers. For sensitive data there are options: enterprise plans with signed DPA, local models (Llama, Mistral) on your server, or hybrid approach. For most b2b tasks API plans are enough. For PII or medical – local model or enterprise.
Pinecone, Qdrant or PostgreSQL+pgvector – which is better?
Pinecone – fast start, managed, more expensive at scale ($70+/month). Qdrant – open source, self-host for free, or Qdrant Cloud. PostgreSQL+pgvector – if you already have Postgres, add the extension, no new infra. For 10-100K vectors all three work. For millions – Pinecone or Qdrant Cloud. For teams without DevOps – Pinecone is simplest.
Want to deploy a RAG agent in your business?
I'll help you assemble the pipeline for your task: model selection, vector-DB, prompt, UI. Free technical briefing – within 24 hours.