Chain-of-Thought Prompting: When and How to Use It
Adding eight words to a prompt -- "think through this step by step before answering" -- can move a model from 18% accuracy to 57% on the same problem set. That number comes from Wei et al.'s 2022 paper that introduced chain-of-thought, on the GSM8K math benchmark, and similar gains have been documented across reasoning, planning, and analysis tasks. Five years later the technique has evolved: reasoning models do CoT internally, the prompt patterns have shifted, and a growing list of tasks where CoT actually hurts has emerged. This guide is the working version of when chain-of-thought helps, when it does not, and what changes in 2026.
Table of contents
- What CoT actually changes
- Worked example: math reasoning
- Worked example: business analysis
- Zero-shot vs few-shot CoT
- When CoT hurts performance
- CoT in 2026 reasoning models
- Tool-call CoT patterns
- Frequently asked questions
- The bottom line
What CoT actually changes
A model without CoT is asked to produce an answer directly. The model's next-token prediction is conditioned on producing a final answer immediately, which biases it toward the most likely answer given the question -- often a confidently wrong one for any non-trivial reasoning.
A model with CoT is asked to produce reasoning before the answer. The reasoning becomes part of the model's context for the answer token. The model now answers conditioned on its own intermediate steps, which means it has self-generated a more useful prompt for the final prediction.
The mechanism is not that the model "thinks more" in any human sense. It is that the model produces tokens that act as scaffolding for later tokens. A wrong intermediate step can produce a wrong final answer; a right intermediate step often produces a right final answer. The probabilities shift in favour of correctness because correct reasoning correlates with correct answers in the training distribution.
Two practical consequences. First, CoT is asymmetric in cost and benefit: it adds tokens (cost), but on hard problems the accuracy gain dwarfs the token cost. Second, CoT's benefit is largest on problems where the answer is a function of multiple steps -- math, logic, planning, multi-criterion analysis. On problems where the answer is a single retrieval -- "what year did X happen" -- CoT produces no measurable gain and slows things down.
For the broader theory, see our complete prompt engineering guide.
Worked example: math reasoning
The canonical CoT demonstration is a math word problem. Consider:
"Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"
Direct prompting (model asked to answer immediately) on a 2022-era model produced answers like "11" with no work shown. The frequency of correct answers on harder versions of this problem hovered around 18% on Wei et al.'s test set.
CoT prompting -- adding "let's think step by step" -- produces something like: "Roger starts with 5 balls. He buys 2 cans, each with 3 balls. So 2 cans is 2 x 3 = 6 balls. Total is 5 + 6 = 11. Answer: 11." Same answer in this case, but the same prompt structure on harder problems lifted accuracy to 57%.
The reason: when the model produces "2 x 3 = 6" as an intermediate step, the final answer prediction is conditioned on a correctly-computed 6, not on direct estimation of the total. The intermediate step does the work; the final answer becomes routine.
The math example is the cleanest demonstration of the technique because correctness is unambiguous. Most real-world tasks reward CoT for the same reason -- explicit intermediate steps anchor the final answer in something the model has to reason about, not just guess at.
Worked example: business analysis
CoT is most useful on tasks where the answer depends on weighing several factors. A direct prompt like "should we acquire Company X" produces a confident recommendation grounded in nothing visible.
A CoT prompt restructures the same question:
"I am evaluating whether to acquire Company X. Before recommending, work through the following: (1) the strongest argument for the acquisition; (2) the strongest argument against; (3) the single piece of evidence that, if false, would change the recommendation; (4) the recommendation in one sentence."
The output is markedly different. The model surfaces the actual weight it is putting on each factor. The argument-against forces the model to engage with reasons it might otherwise gloss over. The "evidence that would change the recommendation" produces a falsifiable hinge -- the user can now check whether that evidence holds.
For business decisions, the format above performs better than free-form CoT because it specifies the dimensions of the reasoning. "Think step by step" gives the model freedom to reason about anything; structured CoT specifies what to reason about. The structured version is more reliable, more reviewable, and more useful as a starting point for human decision-making.
This pattern -- specifying reasoning dimensions rather than just asking for reasoning -- is the modern form of CoT and the one that scales best.
Zero-shot vs few-shot CoT
Two versions of CoT exist in the literature.
Zero-shot CoT means adding "think step by step" (or a similar instruction) without showing examples. The model produces reasoning unprompted by examples. This is the most common form because it is cheap -- one extra sentence in the prompt -- and works well on most tasks where reasoning helps.
Kojima et al. 2022 demonstrated that "let's think step by step" added to math problems lifted GPT-3 from 17.7% to 78.7% accuracy on a popular math benchmark. The phrase has since become a standard tool, and Anthropic, OpenAI, and Google all reference variants of it in their official guides.
Few-shot CoT means showing the model one or two examples of well-formed reasoning before asking it to reason about a new problem. The example acts as a demonstration of the reasoning shape you want.
Few-shot CoT outperforms zero-shot CoT when the reasoning structure matters. For a custom analysis format -- "for each candidate, list pros, cons, and a final score" -- showing one fully-worked example produces more consistent output than describing the format. The model imitates the example shape.
The trade-off is token cost. Few-shot examples add hundreds of tokens to every call. For high-volume systems, the cost adds up. The rule of thumb: zero-shot when "think step by step" is enough; few-shot when the reasoning needs a specific shape that is hard to describe.
When CoT hurts performance
CoT is not free, and not always beneficial. Three cases where it actively hurts.
Simple retrieval tasks. "What is the capital of France?" does not benefit from CoT; it adds latency and produces verbose output that a user has to skim through. Reserve CoT for problems with multiple steps; for single-fact retrieval, skip it.
Tasks where verbosity matters. If your output target is "respond in one short sentence," CoT directly conflicts with the format. The model will either produce reasoning followed by a short answer (over budget) or skip the reasoning to comply with length (defeating CoT). For terse outputs, use hidden CoT instead -- ask the model to reason in <thinking> tags and produce only the conclusion outside, then strip the tags before showing the user.
Reasoning models on simple inputs. Adding "think step by step" to a reasoning model (o3, Claude Opus 4.7 with thinking, Gemini 2.5 Pro with thinking) often produces no improvement and sometimes degrades performance. The model is already doing extended reasoning internally; the explicit instruction can cause it to over-think simple queries, producing bloated answers.
A 2024 paper from Anthropic found that on a subset of tasks, CoT prompting with reasoning models reduced accuracy by 2-5%. The signal: when using a reasoning model, only invoke explicit CoT when you specifically want to steer the reasoning, not when you want to trigger it.
CoT in 2026 reasoning models
The rise of reasoning models has changed how to use CoT in production.
Reasoning models -- OpenAI's o3 and o4-mini, Claude Opus 4.7 with extended thinking, Gemini 2.5 Pro with thinking -- run extended internal reasoning before producing the visible response. The reasoning is invisible to the user, often takes 5-30 seconds, and consumes "thinking tokens" that are billed but hidden.
The prompt change: instead of telling the model to think, you specify what to think about. "Spend the thinking budget weighing cost vs accuracy specifically; do not weigh lock-in risk." This is steerable thinking, and it is the new advanced layer of prompt engineering.
Three patterns work well with reasoning models:
Dimension specification. Name the criteria the reasoning should consider, in order of priority. The model uses the thinking budget to weigh those specifically.
Stop conditions. Tell the model when to stop thinking. "If you reach a confident answer in 30 seconds of thinking, stop -- do not over-elaborate." This addresses the over-thinking failure mode.
Output structure. Specify the visible response format separately from the thinking. The model thinks freely, then produces output in the structure you specified. "Think for as long as needed; respond as a 100-word summary."
For non-reasoning models (GPT-4o, Claude Sonnet, Gemini Flash), classical CoT still applies: explicit step-by-step instructions remain the highest-impact edit on hard reasoning problems.
Tool-call CoT patterns
When models can call tools, CoT extends to the tool-use loop. The pattern is sometimes called ReAct (Yao et al. 2022): Reason, then Act, then Observe, then repeat.
The prompt: "For each step, produce: (1) Thought: what you are trying to do next; (2) Action: the tool call or final answer; (3) Observation: the result. Continue the loop until you can answer."
The benefit is that the reasoning is visible at each step. When a tool call fails or returns unexpected data, you can see what the model was trying to do and why. This makes debugging tool-using agents dramatically easier.
In 2026 the major frameworks (OpenAI Assistants, LangGraph, the Anthropic Agents API) abstract this loop, but the underlying prompt pattern is unchanged. For agent-specific patterns, see our AI agents hub.
| Scenario | Best CoT pattern | Why |
|---|---|---|
| Math word problem (non-reasoning model) | Zero-shot "think step by step" | Cheap, large gain |
| Custom analysis format | Few-shot CoT with one example | Encodes the reasoning shape |
| Multi-criterion business decision | Structured CoT (named dimensions) | More reviewable, less drift |
| Reasoning model + complex task | Steerable thinking (specify dimensions) | Avoids redundant CoT instructions |
| Tool-using agent | ReAct (Thought/Action/Observation) | Debuggable execution trace |
| Simple retrieval | No CoT | Adds latency without benefit |
| Terse output required | Hidden CoT in <thinking> tags | Reasoning quality without verbosity |
Frequently asked questions
Should I always use chain-of-thought?
No. Use it on multi-step tasks (math, planning, multi-criterion analysis) and skip it on simple retrieval or terse-output tasks. Adding CoT indiscriminately wastes tokens and produces verbose output where you wanted a sharp answer.
Does CoT work on all models?
It works on most current models, with the largest gains on capable non-reasoning models. Smaller models (under 7B parameters) sometimes show smaller or inconsistent gains, partly because they have less reliable reasoning capability to begin with. Frontier models (GPT-5, Claude Opus 4.7, Gemini 2.5 Pro) all benefit when CoT is applied to suitably hard problems.
Why is "let's think step by step" so effective?
It is a phrase that appears in many training examples where reasoning is shown explicitly. The model has learned to produce reasoning when those words appear. The exact phrasing is less important than the prompt structure -- "explain your reasoning" or "work through this carefully" produce similar gains.
How much extra cost does CoT add?
Typically 200-1500 extra tokens per response, depending on problem complexity. At GPT-5 pricing ($USD 30/M output tokens) that is roughly $USD 0.006 to $USD 0.045 per call. For high-volume applications, the cost is non-trivial; for high-stakes single calls, it is rounding error.
Should I use few-shot CoT or zero-shot CoT?
Default to zero-shot. Move to few-shot when the reasoning shape is hard to describe, when consistency across runs matters, or when zero-shot has plateaued at unacceptable accuracy. Few-shot is more powerful but more expensive.
Do reasoning models obsolete chain-of-thought prompting?
They obsolete the basic form ("think step by step") but elevate the advanced form (steerable thinking). The skill of specifying which dimensions to reason about is the new high-impact move on reasoning models. The 2022 paper's technique is a stepping stone; the destination is structured reasoning prompts.
Where can I see CoT in production?
OpenAI's o-series response payloads include a "reasoning" field that exposes a sanitised version of the model's internal reasoning. Claude with extended thinking shows the thinking block in API responses. Both are useful for debugging prompts that depend on multi-step reasoning. For broader patterns, see our prompt engineering guide.
The bottom line
For non-reasoning models on hard problems: add "think step by step" or, better, structured reasoning instructions that specify the dimensions to consider. For reasoning models: skip the basic CoT instructions and steer the reasoning instead -- name the criteria, set stop conditions, separate the thinking from the output format. For everything: skip CoT on simple retrieval tasks, where it only adds latency. The technique earns its keep on multi-step problems; on single-step ones, it is dead weight. Browse the full prompt engineering hub for the cluster guides on every other technique mentioned here.
Last updated: May 2026
