How to Build an AI Agent: A Practical Walk-Through

Most "how to build an AI agent" tutorials show you a hello-world that wouldn''t survive five minutes in front of a real user. This one builds an agent that does survive — a research assistant that takes a question, searches the web, reads the results, and produces a brief — and then walks through the parts most tutorials skip: error handling, evaluation, and what to leave out so the project actually ships. The whole thing is about 120 lines of Python and runs against the Anthropic API, but the pattern transfers to OpenAI, Gemini, or any frontier model with tool use.

Table of contents

What you''ll build (clear scope)

A research agent. Input: a question. Output: a 500-word brief with three to five inline citations. The agent must be able to handle questions where the answer isn''t in any single source — it has to search, decide which sources are worth reading, read them, and synthesise. By the end you''ll have something you''d trust to draft an internal memo.

What it''s not: a chatbot, a customer-facing product, an autonomous "do my job" agent. Keeping the scope narrow is the single biggest predictor of success on a first agent build. Resist the urge to add features until the core loop is rock-solid.

The minimum viable architecture

Three components and nothing else.

ComponentRoleLines of code (rough)
ToolsTwo functions: web_search and fetch_page. Each returns a string the model can read.30
Loop controllerWhile-loop that calls the model, parses tool calls, executes them, feeds results back.40
System promptTells the model its role, its tools, and when to stop.20 (prose)

That''s the agent. The model itself does the heavy lifting — planning, reading, reasoning, writing. Your code''s job is to set up the loop and stay out of the way.

Step 1: define the goal and tools

Start by writing the system prompt. The system prompt is your contract with the model — what it''s for, what tools it has, and what done looks like.

SYSTEM_PROMPT = '''You are a research assistant.Tools available:- web_search(query: str) -> list of {title, url, snippet}- fetch_page(url: str) -> cleaned text content of the pageProcess:1. Search for the question.2. Read 3-5 of the most relevant results.3. Write a 500-word brief that answers the question, with inline citations [1], [2], etc.4. End with a numbered list of the URLs you cited.Stop calling tools and write the final brief when you have enough material.Do not make up facts you cannot cite from your sources.'''

Two things matter here. First, explicit stop condition — without it, agents browse forever. Second, an explicit "don''t hallucinate" instruction. It doesn''t prevent hallucinations completely, but it cuts them noticeably.

Now the tools. In Anthropic''s tool-use format:

TOOLS = [{"name": "web_search","description": "Search the web. Returns up to 10 results.","input_schema": {"type": "object","properties": {"query": {"type": "string"}},"required": ["query"]}},{"name": "fetch_page","description": "Fetch and clean a web page. Returns text content.","input_schema": {"type": "object","properties": {"url": {"type": "string"}},"required": ["url"]}}]

The implementations: web_search wraps Brave Search''s API (free tier handles 2,000 queries/month). fetch_page wraps Jina Reader (jina.ai/reader, free tier generous), which returns clean Markdown of any URL — strips ads, navigation, the lot.

Step 2: write the agent loop

The loop is shorter than you''d think.

import anthropicclient = anthropic.Anthropic()def run_agent(question: str, max_iterations: int = 15) -> str:messages = [{"role": "user", "content": question}]for iteration in range(max_iterations):response = client.messages.create(model="claude-sonnet-4-6",max_tokens=4096,system=SYSTEM_PROMPT,tools=TOOLS,messages=messages)# If model returned text and no tool calls, we''re doneif response.stop_reason == "end_turn":return response.content[0].text# Otherwise, execute tool calls and append resultsmessages.append({"role": "assistant", "content": response.content})tool_results = []for block in response.content:if block.type == "tool_use":result = execute_tool(block.name, block.input)tool_results.append({"type": "tool_result","tool_use_id": block.id,"content": result})messages.append({"role": "user", "content": tool_results})return "Agent hit iteration cap without finishing."

That''s it. The loop calls the model, checks if it''s done, and if not, executes tool calls and feeds the results back. The execute_tool function is a router that maps tool names to your Python functions:

def execute_tool(name: str, input: dict) -> str:if name == "web_search":return json.dumps(brave_search(input["query"]))elif name == "fetch_page":return jina_fetch(input["url"])return f"Unknown tool: {name}"

Run this on a question — "What were the major findings from Anthropic''s 2024 constitutional AI paper?" — and you''ll watch it search, fetch a paper, fetch a blog post, and write a brief. About 30 seconds, ~$0.02 in API charges.

Step 3: add error handling

The version above will work most of the time and break in three predictable ways. Each gets a small fix.

Tool execution fails. The search API times out, the URL returns 404, the page is paywalled. Without handling, this raises an exception and kills the run. The fix is to catch the exception, return the error message as the tool result, and let the model recover:

def execute_tool(name: str, input: dict) -> str:try:if name == "web_search":return json.dumps(brave_search(input["query"]))elif name == "fetch_page":return jina_fetch(input["url"])return f"Unknown tool: {name}"except Exception as e:return f"Tool failed: {type(e).__name__}: {e}"

The model will read the error and adapt — usually by trying a different URL or rephrasing the search. This single change moves your agent from "fails on any flaky network call" to "robust to most transient failures."

The model invents tools or arguments. Rare with Sonnet but happens. The fix is the unknown-tool fallback in execute_tool above, plus schema validation. The Anthropic SDK validates the input schema for you — invalid arguments raise BadRequestError on the API call before the tool runs. Catch that and feed it back.

Cost runaway. A loop that hits 15 iterations with a 100K-token context per call is ~1.5M tokens, ~$5 per run at 2026 Sonnet pricing. Multiply by a thousand runs and the bill stings. Fix: a token-budget check inside the loop. After each iteration, sum the input + output tokens; if the total exceeds your budget, break and return what you have.

Step 4: add evaluation

This is the step most agent builds skip and most agent builds regret skipping. The minimum: 20 representative questions with manually-graded ideal answers, plus a script that runs the agent on each one and scores the output.

The grading itself can be done with another LLM call (LLM-as-judge). The judge gets the question, the agent''s answer, and the ideal answer, and scores on a 1-5 rubric covering accuracy, citation quality, and structure. This isn''t perfect — judge models have biases — but it''s 90% as good as human grading at 1% of the cost.

The point of the eval suite is not the score itself. It''s having a way to know, when you change the system prompt or swap the model, whether the change made things better or worse. Without that, prompt iteration is gambling.

Tools worth knowing: LangSmith, Braintrust, and Helicone all have free tiers and integrate in three lines of code. Pick one. For the broader observability story, our pillar guide covers it in more depth.

Step 5: deploy

"Deploy" for a research agent might mean three things, depending on your audience.

For yourself. A CLI script is enough. python research.py "your question here", output to stdout. No deployment needed. Most internal-tool agents live their entire useful life in this form.

For your team. Wrap it in a small Flask or FastAPI app, deploy to Railway/Fly/Render. Add basic auth. Total deployment time: 30 minutes. Add a Slack slash command in front of it and your team will start using it that afternoon.

For external users. This is where the work multiplies. You need rate limiting, abuse prevention, billing, support for users who break things in unexpected ways, and a content policy for what the agent will and won''t answer. A reasonable estimate: 10x the engineering work of the internal version. Don''t do this until you''re sure people want it. The right path is almost always: ship internal, watch usage for a month, then decide.

What this skips and why

Three things this walk-through deliberately leaves out.

Multi-agent orchestration. Most production agents in 2026 are single-agent. Multi-agent setups are useful but add latency, cost, and debugging complexity. Don''t reach for them unless your task genuinely splits into specialist roles.

Memory across sessions. The research agent above is stateless — each question is independent. Adding memory (Mem0, Letta, or a simple key-value store) is straightforward but it''s a separate concern. Get the stateless version working first.

RAG. Retrieval-augmented generation matters when the agent needs to query your own document store rather than the open web. The pattern is the same — RAG just becomes another tool. For the broader RAG context, our prompt engineering hub covers the technique.

Frequently asked questions

Do I need to use LangChain or another framework?

No. The 120-line agent above uses only the Anthropic SDK and two HTTP libraries. Frameworks become useful when your agent grows past 500 lines, has multiple sub-agents, or needs sophisticated state management. Starting framework-first usually slows you down.

How long should this take to build?

A working version: an afternoon. A version with error handling, evaluation, and one round of prompt iteration: about a week, working evenings. Add a deployment and a Slack integration and call it two weeks for a usable internal tool.

Which model is right for a first agent?

Claude Sonnet 4.6 in 2026, or GPT-4o / GPT-5 if you''re already on OpenAI. Both have solid tool use, reliable JSON, and strong reasoning. Don''t start with smaller or open-weight models — debugging an agent and debugging the model''s reliability at the same time is too many variables.

How do I prevent the agent from doing something dangerous?

For a research agent that only reads, the blast radius is small — it can waste time and money but can''t damage anything. For agents that write to systems, follow the principle of least privilege: each tool gets only the permissions it strictly needs. A "send email" tool shouldn''t have permission to delete emails. Add a human-approval step for any irreversible action.

What''s the cheapest way to run this in production?

Three optimisations stack: prompt caching (cuts repeated input cost ~90%), model routing (use Haiku for cheap subtasks, Sonnet only for hard reasoning), and summarising old conversation turns to keep the context window small. Together these typically cut cost 60-80% versus the naive implementation.

Should I open-source it?

If it''s a generic tool other people would find useful, sure — there''s a lot of agent-template code on GitHub and the good ones get used. If it''s wired to your business, no. The agent code is rarely the moat; the prompts, evaluations, and tool integrations specific to your domain are.

The bottom line

Build the smallest agent that does one real thing, ship it to yourself first, instrument it well enough to know when it breaks, and resist every temptation to add features until the core loop is reliable. The 120 lines above are not a toy — variants of the same pattern run in production at companies you''ve heard of. The discipline that separates a working agent from a demo is not framework choice or model choice, it''s the willingness to add evaluation before adding features. Do that and you''ll have an agent in two weeks. Skip it and you''ll still be debugging in two months. For the architectural background, see our AI agents guide; for framework choice when you outgrow the basics, see our frameworks comparison.

Last updated: May 2026