RAG vs Prompts: When You Need Retrieval, When You Do Not
Half the production AI features built in 2025 used retrieval-augmented generation when a well-written prompt would have done the job. The other half used prompts where retrieval was the only thing that could prevent hallucination. The cost of getting this wrong is significant: an unnecessary RAG pipeline adds latency, infrastructure, and maintenance burden; a missing one produces confident wrong answers in production. The decision is not actually subtle once you ask the right question. This guide is the working version of when retrieval is needed, when it is overkill, and what hybrid approaches earn their complexity.
Table of contents
- The actual difference
- Five tasks where prompts alone work
- Five tasks where you need RAG
- Hybrid approaches
- Cost and latency tradeoffs
- Implementation cheatsheet
- Frequently asked questions
- The bottom line
The actual difference
A prompt-only system has one input: the user query, plus optional system prompt. The model answers from its parametric knowledge -- what it learned during training. The model's knowledge is frozen at the training cutoff, fixed across all users, and cannot include private data.
A RAG system has two inputs: the user query, plus a retrieved set of documents that are placed in the prompt at query time. The model answers using both. The retrieved documents come from a vector database (or keyword index, or hybrid) that holds your specific corpus -- internal docs, customer records, recent news, a 200-page contract.
The flow is roughly:
Prompt-only: user query --> LLM --> answerRAG: user query --> embedding --> vector search --> top K documents --> prompt with documents --> LLM --> answerThe defining test: does the correct answer depend on facts the model could not have learned during training? If yes, RAG. If no, prompts alone are usually enough. "Could not have learned" includes private data, post-training-cutoff news, niche technical material that was sparsely represented in training, and specific instances of common patterns (your specific contract, not contracts in general).
For the broader pattern context see our complete prompt engineering guide.
Five tasks where prompts alone work
1. Writing tasks where the model's training is enough. Drafting an email, generating headlines, rewriting a paragraph for tone -- these depend on linguistic skill, not specific facts. The model's training has covered millions of examples of each. Adding retrieval here adds nothing.
2. Reasoning over user-provided text. If the user pastes the document in the prompt itself ("summarise this for me"), there is no retrieval problem -- the document is already in context. RAG starts to matter when the document is too large to paste, or when many documents need to be considered.
3. Code generation in mainstream languages. Python, TypeScript, Go, common frameworks -- the model's training data includes massive code corpora. Asking GPT-5 or Claude Opus 4.7 to write a React component does not require retrieval. RAG starts to earn its keep for proprietary internal libraries the model has never seen.
4. Translation between common languages. Spanish to English, French to Japanese, German to Chinese -- modern models handle these well from training. RAG on translation only matters when terminology must match a specific glossary (legal, medical, internal-jargon-heavy).
5. General knowledge questions on stable facts. "Why is the sky blue," "what is the capital of France," "explain the second law of thermodynamics." The model has learned these many times. Adding retrieval here is overkill and adds noise.
The unifying property: the answer depends on common knowledge that was well-represented in training, not on specific facts that change rapidly or are private.
Five tasks where you need RAG
1. Q&A over your company's internal documentation. The model has not read your wiki, your runbooks, your engineering RFCs. Retrieval is the only way to ground answers in your specific environment. The single most common production RAG application is internal-docs Q&A.
2. Recent news, prices, or events. Models have a training cutoff. Asking about events from the last few months without retrieval produces guesses or refusals. RAG with a search index over recent material gives the model the freshness it lacks.
3. Legal, medical, or financial document analysis. A 200-page contract, a patient's medical history, a 10-K filing -- these are too long to paste, too specific to memorise during training, and have outputs (clauses, diagnoses, line items) that must be grounded in the source. RAG with citation requirements is the standard pattern.
4. Customer or account-specific responses. "What is the status of order 47823" or "summarise this customer's last three support tickets" -- the model cannot answer without access to that customer's data. Retrieval over your CRM/order-management system is required.
5. Niche technical domains under-represented in training. Internal proprietary frameworks, obscure libraries, government regulation specific to one country, equipment manuals. The model's knowledge is shallow here; retrieval over the canonical sources fills the gap.
The unifying property: the answer depends on specific, often-private, often-recent facts that the model could not have learned during training. The single most reliable test: if the user expects the answer to reference exact text from a known source, RAG is non-optional.
Hybrid approaches
Many systems benefit from a layered approach -- prompts for the bulk of the work, retrieval triggered conditionally.
Conditional retrieval. The model classifies the query first ("does this question require facts about our company?"), and only triggers retrieval when needed. This avoids latency on simple queries while still grounding when grounding is required. The trade-off: the classifier needs to be accurate; misrouting a question that needed retrieval produces hallucination.
Retrieval as fallback. The model attempts to answer from parametric knowledge first; if confidence is low ("I am not sure -- let me look this up"), it triggers retrieval. This pattern works well in agent loops where the model has access to tools.
Static system context plus dynamic retrieval. The system prompt holds stable context (the company's style guide, common terminology, organisational principles); RAG provides dynamic context (the specific document the user is asking about). This separation keeps the system prompt small and reliable while still allowing per-query retrieval.
Long-context plus retrieval. Modern models with 1-2M token context windows (Gemini 2.5 Pro, Claude Opus 4.7) can sometimes hold an entire corpus in context, removing the need for retrieval. In practice, hybrid approaches still win on cost: retrieving the relevant 5,000 tokens is cheaper than processing 1.5M every call. Long-context is best for tasks where retrieval would miss relevant material -- multi-hop reasoning across multiple documents, for instance.
Cost and latency tradeoffs
RAG is not free. The trade-offs matter at scale.
Latency. A prompt-only call has one round trip: prompt to model, response back. A RAG call adds embedding generation, vector search, document fetch, and prompt construction before the model call. Total latency typically goes from 1-2 seconds to 3-6 seconds for a single-document retrieval, and more for multi-document or re-ranked retrieval.
Token cost. Retrieved documents add tokens to every call. Five 1,000-token documents add 5,000 tokens to the prompt. At GPT-5 pricing ($USD 1.25/M input tokens) that is rounding error per call but adds up at scale.
Infrastructure cost. A vector database (Pinecone, Weaviate, pgvector, Chroma), embedding model API, document chunking pipeline, refresh schedule, monitoring. The infrastructure adds engineering and operational burden that prompt-only systems do not have.
Quality cost from bad retrieval. If retrieval surfaces irrelevant documents, the model can be misled by them. A bad RAG system performs worse than a prompt-only system on the same task. Retrieval quality is itself an engineering problem requiring evals.
The cost calculus: RAG earns its complexity when (a) the answer depends on specific facts unavailable to the model, or (b) the user expects citations. For tasks where prompt-only produces acceptable answers, the additional infrastructure is overhead.
Implementation cheatsheet
If you have decided you need RAG, the minimum viable setup looks like this:
Chunking. Split documents into 500-1,500 token chunks with 100-token overlaps. Smaller chunks improve retrieval precision; larger chunks preserve context. The right size is task-dependent and worth testing.
Embedding. Use a recent embedding model -- OpenAI's text-embedding-3-large, Cohere's embed-english-v3, or Voyage's voyage-3 are common choices in 2026. Cost matters at scale; quality matters at low scale.
Storage. pgvector (Postgres extension) is fine for under 10M vectors; Pinecone or Weaviate scale further. Avoid premature optimisation -- pgvector handles most production workloads.
Retrieval. Top-K cosine similarity (K=3-7) is the baseline. Add a reranker (Cohere Rerank or a cross-encoder) when retrieval quality plateaus.
Prompt construction. Wrap retrieved documents in delimiters (<documents>...</documents>). Instruct the model to answer using only the documents. Require citations. Allow refusal when the answer is not in the documents.
Evaluation. Build a 50-question test set with known correct answers. Measure retrieval recall (does the right document show up in top-K?) and answer correctness separately. Iterate.
| Property | Prompt-only | RAG |
|---|---|---|
| Latency (typical) | 1-2 sec | 3-6 sec |
| Cost per call | Lower | ~3-5x prompt-only |
| Infrastructure | None | Vector DB, embedding pipeline |
| Hallucination risk | High on private data | Lower with citations + refusal |
| Knowledge freshness | Training cutoff | As current as your index |
| Best for | Writing, code, general knowledge | Internal docs, recent events, citations |
Frequently asked questions
Should I start with RAG or with prompts?
Start with prompts. Add RAG only when prompt-only output fails on the specific tasks RAG is suited for. Most teams over-engineer in the other direction, building RAG infrastructure for tasks where a 200-word system prompt would have sufficed.
Can long-context models replace RAG entirely?
Sometimes. For tasks where the corpus fits in context (under ~500K tokens) and cost is not a concern, long-context can replace simple RAG. For larger corpora or cost-sensitive applications, RAG remains cheaper. The two approaches will increasingly co-exist, with RAG handling the bulk and long-context handling cases where retrieval would miss relevant cross-document signal.
What about fine-tuning -- when is it the right answer?
Fine-tuning is rarely the answer for knowledge problems. It is occasionally the answer for behaviour problems -- specific output formats, brand voice, multi-step reasoning patterns the model handles inconsistently. For knowledge, RAG is almost always faster, cheaper, and more current. The order of escalation is: prompts, then RAG, then fine-tuning.
How do I prevent the RAG system from making things up?
Three rules: instruct the model to answer using only the retrieved documents, require citations, and allow explicit refusal ("if the documents do not contain the answer, say 'not found in source'"). The single most reliable hallucination prevention move in 2026 is that third rule -- a one-sentence permission to refuse.
How do I evaluate a RAG system?
Two-stage evaluation. First, retrieval recall: for each test question, does the correct document appear in the top-K results? Second, answer quality: given the retrieved documents, does the model produce a correct, well-cited answer? Many RAG systems fail because retrieval is bad, not because the model is bad; measuring them separately makes the bottleneck visible.
Is RAG the same as a knowledge graph?
No. RAG retrieves text passages and gives them to the model. Knowledge graphs are structured representations of entities and relationships. Some advanced RAG systems combine both -- retrieving relevant passages from a graph-aware index -- but the simple form of RAG is unstructured text retrieval. For most teams, plain RAG is the right starting point.
Where can I learn the RAG implementation in detail?
The LangChain and LlamaIndex documentation cover end-to-end implementation. For deeper architectural patterns see our AI agents hub, and for prompt-side patterns specific to RAG, our advanced prompting section in the main guide.
The bottom line
The decision rule is one question: does the correct answer depend on specific facts the model could not have learned during training? If yes, RAG; if no, prompts alone. Most teams over-engineer toward RAG before they need it, and the resulting infrastructure burden never produces commensurate value. Start with a well-written prompt. Run a 20-question evaluation. Where the prompt fails on knowledge-grounded questions, add RAG -- and only there. Continue with our system prompts guide for the prompt-side techniques that pair with RAG, or browse the full hub for the rest of the cluster.
Last updated: May 2026
