How Machine Learning and Deep Learning Actually Work

Machine learning has a small number of moving parts and a lot of jargon wrapped around them. Once you have the moving parts, the jargon stops getting in the way. The fundamental idea is straightforward: rather than writing rules for a computer to follow, you give it examples and let it derive its own rules from patterns in the data. The math that makes this work has been understood since the 1980s. The data and compute that made it work in practice arrived in the 2010s. The architectures that made it generalise across every domain (transformers) arrived in 2017. This guide walks through how the modern machine learning pipeline actually works -- the training loop, the three main learning paradigms, the architectures that have dominated each modality, and the reason deep learning specifically displaced everything else.

The core idea: learning from data
The training loop
Supervised learning
Unsupervised learning
Reinforcement learning
Gradient descent: how the model actually learns
Neural network architectures
Transformers in more detail
Why deep learning won
What production ML actually looks like
Frequently asked questions
The bottom line

The core idea: learning from data

Classical software is a programmer writing rules for inputs they have anticipated. Machine learning is a programmer writing code that, given enough examples, derives its own rules from data. The difference is profound when the rules you would otherwise have to write are too numerous or too complex to express by hand.

Consider an email spam filter. The rules-based version is easy at first: "if the email contains 'free Viagra' or 'Nigerian prince', mark spam". It is also unmaintainable, because spammers change the words and the rules become a Whac-A-Mole game. The machine-learning version trains on a dataset of emails labelled spam or not-spam. The trained model has learned which combinations of words, sender patterns, header anomalies and structural features distinguish the two classes. It generalises to new spam patterns it never saw in training because the underlying patterns are similar enough to the ones it saw.

That is the essential trade. You give up direct control of the rules; you gain a system that handles cases you could not have anticipated. Machine learning is the right tool when the cases outnumber what you could write rules for, and the wrong tool when you actually do know the rules and care about precision more than coverage.

For where this fits in the broader AI picture, see our AI vs ML vs deep learning comparison.

The training loop

The training loop is the heart of every machine learning system. It has the same basic structure whether the model is a linear regression with twelve parameters or a frontier transformer with a trillion. The structure is:

Initialise the model with random parameters
Show it an example (or a batch of examples)
Get the model's prediction
Compute the loss -- how wrong the prediction was
Adjust the parameters in the direction that would reduce the loss
Repeat steps 2-5 until the loss stops improving

The "adjust the parameters in the direction that would reduce the loss" step is gradient descent, which we cover in its own section below. The "loss" is a number that quantifies how wrong the model is on the current example -- different tasks use different loss functions, but the role is the same.

The training loop runs for thousands to millions of iterations. The model's parameters drift, slowly, in directions that reduce loss across the training data. When training is done, you stop adjusting the parameters and start using the trained model to make predictions on new data. That is the inference phase. Training is expensive and rare; inference is cheap and frequent.

One critical subtlety: the goal is not to minimise loss on the training data. The goal is to minimise loss on data the model has not seen, which is what "generalisation" means. Models that memorise the training data perfectly but fail on new data are said to be overfitting. Most of the engineering tricks in modern ML (regularisation, dropout, data augmentation, early stopping, validation splits) exist to prevent overfitting.

Supervised learning

The most common ML setup. You have a dataset where each example is labelled with the correct answer; the model learns to predict the answer given the input.

The label can be a category (classification) -- spam or not, dog or cat, malignant or benign. It can be a number (regression) -- the predicted house price given features, the predicted temperature tomorrow. It can be a structured output (sequence-to-sequence) -- the English translation of a French sentence, the SQL query that answers a natural-language question.

Supervised learning has a fundamental dependency: you need labels. Producing labels is expensive when the task is non-trivial. A medical-imaging classifier might need radiologists to label tens of thousands of scans. A self-driving system needs human labellers to mark every car, pedestrian, lane line and traffic sign in a video. The labelling cost is often larger than the training cost; companies that have built large labelling pipelines (Tesla for driving, the early ImageNet team for vision) have built moats out of label data alone.

Supervised learning is well-understood and works reliably when you have enough labelled data. The challenge is almost always the data, not the modelling.

Unsupervised learning

You have data without labels; the model learns to find structure on its own. Two big subcategories:

Clustering. Group similar examples together without being told which groups exist. Customer segmentation, anomaly detection, document grouping. The classic algorithms (k-means, DBSCAN, hierarchical clustering) are old and still work for many tabular cases.

Dimensionality reduction. Take data with many features and find a lower-dimensional representation that preserves the important structure. Useful for visualisation, for compression, and as a preprocessing step before other methods. Principal component analysis (PCA) is the classical method; t-SNE and UMAP are the modern visualisation favourites.

The most consequential modern variant is self-supervised learning, where you create labels from the data itself. The classic example is masked language modelling: take a sentence, hide one word, train the model to predict the hidden word. The labels were always there; you just had to define a task that uses them. Self-supervised learning is the engine behind every large language model -- it lets you train on essentially unlimited unlabelled text by treating "what comes next" as the implicit label.

Self-supervised learning is technically supervised by the data itself rather than by humans, but it is conventionally classed with unsupervised because no human labelling is required. The semantic confusion is harmless once you understand the technique.

Reinforcement learning

An agent acts in an environment; it receives rewards or penalties; it learns to maximise long-term reward. Reinforcement learning is the right framing for any problem where you have a sequence of decisions and you can score the outcome.

The canonical examples are games. AlphaGo, DeepMind's 2016 system that beat Lee Sedol at Go, was trained primarily through reinforcement learning -- self-play games where winning was rewarded. AlphaZero, its 2017 successor, learned chess, shogi and Go from scratch with no human game data, using only the rules of each game and self-play. The 2017 result was a watershed in showing that RL could reach superhuman performance on hard sequential decision problems.

Outside games, RL is used in robotics (where actions in physical environments produce rewards over time), in recommender systems (where the long-term engagement signal matters), and in operations research (resource allocation, scheduling). It is also the core of the post-training step that turned base language models into useful assistants. Reinforcement learning from human feedback (RLHF) is the technique that made ChatGPT helpful in 2022 and remains a central part of every major LLM's pipeline.

RL is conceptually clean and practically hard. The challenges are sample efficiency (RL typically needs millions of episodes to learn anything non-trivial), reward shaping (designing reward functions that produce desired behaviour without exploits), and stability (RL training can collapse in ways supervised training does not). Most production RL systems combine RL with supervised pretraining, which is why pure RL deployments are still relatively rare.

Gradient descent: how the model actually learns

The reason any of this works is gradient descent. It is the algorithm by which the training loop adjusts the model's parameters in the direction that reduces loss.

The intuition: imagine a hilly landscape where the height represents the loss and your position represents the model's parameters. You want to find the lowest point in the landscape. Gradient descent works by feeling the slope where you are standing and taking a small step downhill. Repeat enough times and you end up at a low point. The "feeling the slope" step is computing the gradient -- the partial derivatives of the loss with respect to each parameter -- and the "taking a step downhill" step is updating each parameter by a small fraction of its gradient (the fraction is the learning rate).

Two complications matter in practice. First, the gradient is computed on a small batch of examples rather than the whole dataset, which is why this is called stochastic gradient descent. The stochasticity adds noise but lets training proceed without holding the entire dataset in memory. Second, the loss landscape of a deep neural network is not a single bowl; it is a high-dimensional surface with many local minima, plateaus and saddle points. Gradient descent does not always find the global minimum. The remarkable thing is that for deep networks the local minima it finds are usually good enough, and modern variants (Adam, AdamW, the various learning-rate schedules) make the process robust enough to scale to trillion-parameter models.

The algorithm by which gradients are efficiently computed in deep networks is backpropagation, popularised by Rumelhart, Hinton and Williams in 1986. Backprop is the chain rule of calculus applied through every layer of a network in a single backward pass. Without backprop, training deep networks would be computationally intractable. With it, training a transformer on trillions of tokens becomes a matter of capital expenditure rather than algorithmic insight.

Neural network architectures

The architecture is the structure of the model -- which kinds of layers, in what order, with what connections. Three architectures dominated the deep learning era; one of them now dominates everything.

Convolutional neural networks (CNNs). Designed for grid-structured data like images. The convolutional layer applies a small filter across the image, producing a feature map; stacking many convolutional layers produces a hierarchy of features, from edges in early layers to objects in late layers. CNNs were the workhorses of computer vision from AlexNet (2012) until vision transformers (2020). They are still used widely in production for vision tasks where transformers would be overkill.

Recurrent neural networks (RNNs and their LSTM/GRU variants). Designed for sequence data like text or audio. RNNs process the sequence one element at a time, maintaining a hidden state that summarises everything seen so far. Long short-term memory networks (LSTMs), introduced by Hochreiter and Schmidhuber in 1997, solved the vanishing-gradient problem that made plain RNNs hard to train on long sequences. RNNs were the dominant NLP architecture from the 2000s until transformers in 2017. They are largely deprecated in 2026 but appear in older production systems.

Transformers. Introduced in the 2017 "Attention Is All You Need" paper. The transformer's central mechanism is self-attention, which lets every position in a sequence attend to every other position, weighted by learned relevance. Self-attention is parallelisable in a way recurrence is not, which makes transformers well-suited to GPUs and lets them scale to enormous sizes. Transformers dominate text, code, increasingly images (vision transformers), increasingly audio, and increasingly video.

A handful of other architectures play supporting roles in 2026: diffusion models for image and video generation (which use neural networks underneath, often transformer-based); graph neural networks for relational data (drug discovery, social networks); state-space models like Mamba that try to combine the efficiency of recurrence with the parallelism of attention. The transformer is the dominant general-purpose architecture, but it is not the only one.

Transformers in more detail

Because almost everything in 2026 is a transformer, it pays to understand how they work at one level deeper than "the dominant architecture". The transformer's central trick is self-attention, and self-attention is conceptually simple once stripped of its math.

Imagine processing a sentence. For each word, you want to know which other words in the sentence are most relevant for understanding it. "The cat sat on the mat because it was tired" -- to understand "it", you need to know whether it refers to "cat" or "mat", which depends on the rest of the sentence. Self-attention computes, for every word, a weighted relevance score against every other word, and uses those scores to build a richer representation of each word that includes context from the whole sentence.

The math turns out to be elegant: each word produces three vectors -- a query (what am I looking for?), a key (what do I represent?), and a value (what do I contribute if I am attended to?). Attention is the dot product of queries against keys, normalised to weights, used to combine values. The whole operation is a single matrix multiplication and runs efficiently on GPUs. Stacking many self-attention layers (each followed by a small feed-forward network) produces a transformer.

Two variants matter. Encoder-only transformers (BERT, the early ones) read input and produce representations -- good for classification, retrieval, embedding. Decoder-only transformers (GPT, Claude, Llama) generate output token by token -- good for text generation. The encoder-decoder hybrids of the early years (T5, the original transformer for translation) are less common in 2026, mostly because pure decoder-only models with enough scale handle both tasks.

The 2017 paper used six encoder layers and six decoder layers. Frontier models in 2026 use hundreds of layers, with attention mechanisms running in parallel across many "heads" (multi-head attention), trained on context windows that have grown from the original 512 tokens to a million tokens or more in flagship models. The architectural skeleton is the same; the scale and the engineering tricks around it are radically larger.

Why deep learning won

The straightforward answer: deep learning won because, at scale, it works better than the alternatives across an extraordinarily wide range of problems. The interesting question is why.

The traditional ML pipeline involved hand-engineered features. A computer vision researcher would spend years figuring out which mathematical transformations of an image would produce features useful for classification, then train a relatively simple classifier (an SVM, say) on top of those hand-engineered features. The performance of the system was limited by the quality of the features, which were limited by human ingenuity.

Deep learning replaces the feature engineering with feature learning. The early layers of a deep network learn features automatically from data; the later layers compose those features into the final prediction. This means the system can find features that humans would not have thought of, and -- when the data is large enough -- consistently outperforms the hand-engineered version.

The 2012 AlexNet result was the first time this advantage was decisive enough to be undeniable. Within five years, every major perceptual task -- vision, speech, text, audio -- had fallen the same way. Within ten years, the same lesson had been generalised: any task with enough data and compute is best solved by a sufficiently deep neural network, often a transformer.

The pattern is sometimes called the "bitter lesson" after Rich Sutton's 2019 essay: methods that scale with compute consistently outperform methods that rely on human cleverness about the structure of the problem. Deep learning is the most general, scalable method we have. Until something more general comes along, it will continue to dominate.

For a focused look at the breakthroughs that defined the modern era, see our history of AI.

What production ML actually looks like

Almost every guide to machine learning focuses on the modelling, which is the easy part. The hard part is everything else around it: getting the data, validating that the model still works after deployment, and dealing with the failure modes that only appear at scale. The unglamorous discipline is called MLOps, and most of an ML engineer's time goes into it.

A typical production pipeline has four stages beyond the modelling itself.

Data engineering. Production data is messier than the curated datasets in research papers. It comes in from logging systems, customer-facing applications, third-party APIs, and partner integrations, with inconsistent formats, missing fields, and silent schema changes. Most ML projects fail at this layer rather than at the modelling layer; cleaning, validating, and consistently joining the data is genuinely hard work and rarely glamorous.

Training infrastructure. Reliable, reproducible training runs require version-controlled code, version-controlled data, deterministic environments (Docker, Kubernetes), and the ability to track experiments (MLflow, Weights & Biases, etc.). When you cannot reproduce last quarter's training run, you cannot diagnose why this quarter's run is worse.

Evaluation and monitoring. A model that hits 95% accuracy on the held-out test set might hit 80% on real production traffic if the production distribution has shifted. You need monitoring that compares model predictions to ground-truth outcomes as they arrive, with alerts when performance drifts. For systems where ground truth is delayed (a fraud model only knows it was right or wrong after a chargeback), you need proxies that flag distributional changes faster than the ground truth arrives.

Retraining and model versioning. Models go stale. The world changes; user behaviour shifts; the distribution of inputs evolves. Production ML systems retrain on a schedule, validate the new model against the current one, and roll out the new version with the ability to roll back. Each new model needs an audit trail showing what data it was trained on, what hyperparameters were used, and how it performed on the validation set.

The modelling work is interesting and gets the headlines. The pipeline work is what determines whether the model actually delivers value over time. Engineers who can do both are scarce and expensive; the gap between research-quality ML and production-quality ML remains one of the field's open problems.

Frequently asked questions

Do I need to know math to use machine learning?

To use it as a product, no. To build production ML systems at scale, you need comfort with linear algebra, probability and basic calculus, plus a working knowledge of the practical tools (Python, PyTorch or JAX, the Hugging Face ecosystem). To do research at the frontier, you need substantially more depth. The ladder of math required scales steeply with how deep into the technical work you go.

What is the difference between training and inference?

Training is the process of adjusting the model's parameters so it performs well on the data. Inference is using the trained model to make predictions. Training is rare, expensive, and done once (or periodically). Inference is frequent, cheap per query, and done every time someone uses the model. The economics of running an LLM service in production is mostly inference economics; the economics of building a frontier model is mostly training economics.

What is a parameter?

A parameter is one of the numbers inside the model that gets adjusted during training. A model with a billion parameters has a billion of these numbers, all of which were initialised randomly and gradually adjusted to make the model's predictions match the training data. Frontier models in 2026 have hundreds of billions to over a trillion parameters. The number alone does not tell you how good the model is, but it correlates loosely with capability.

Why is more data usually better?

More data exposes the model to more of the distribution of cases it might encounter. More data also reduces overfitting -- a model trained on a million examples is less likely to memorise idiosyncrasies of the training set than a model trained on a thousand. There are diminishing returns: doubling the data does not double the performance, and at some point data quality matters more than quantity. But across the range that most projects operate in, more good data beats almost every other intervention.

Can I train a model on my laptop?

Yes, for small models and small datasets. Classical machine learning on tabular data (decision trees, gradient-boosted methods, simple neural networks) runs comfortably on a laptop. Fine-tuning a pre-trained model with a few thousand examples works on a laptop with a decent GPU. Training a frontier model from scratch requires a cluster of thousands of GPUs and is out of reach for individuals; almost everyone in the field starts from a pretrained model and adapts it.

What is the difference between fine-tuning and prompting?

Prompting is showing the model what you want via the input you give it at inference time. The model's weights do not change. Fine-tuning is continuing the training process on a smaller, task-specific dataset to adjust the model's weights. Prompting is fast and cheap; fine-tuning is more expensive but produces better behaviour for narrow specialised tasks. The 2026 default is to start with prompting (often combined with retrieval) and only fine-tune when you have a specific reason to.

How do you evaluate whether an ML model is good?

You measure its performance on data it has never seen during training -- a held-out validation set or test set. The metric depends on the task: accuracy for classification, mean squared error for regression, BLEU for translation, human ratings for open-ended generation, task-specific benchmarks for many others. Good evaluation is much harder than good modelling, especially for generative systems where the right answer is not unique. A surprising fraction of ML problems are really evaluation problems wearing a modelling costume.

What does "fine-tuning a model" actually mean in practice?

You take a pretrained model (a publicly available LLM, say) and continue its training process on a dataset specific to your task or domain. The model's parameters are slightly adjusted to perform better on your data. Fine-tuning is far cheaper than training from scratch and produces models that retain general capabilities while being substantially better at the specific task. Modern variants like LoRA (low-rank adaptation) make fine-tuning cheap enough to be practical for individual developers.

Why are GPUs important for machine learning?

GPUs were originally designed for graphics rendering, which is dominated by parallel matrix operations. Neural network training is also dominated by parallel matrix operations, which makes the same hardware orders of magnitude faster than a CPU for the same work. Training a frontier model on CPUs would take centuries; on a cluster of NVIDIA H100s it takes weeks. The 2020s AI boom is in large part the GPU industry's boom; NVIDIA's market capitalisation went from under $200B in 2020 to several trillion in 2024-2025 on the back of this dynamic.

How do I choose between deep learning and classical ML for a problem?

If your data is tabular and modest in size (thousands to low millions of rows), gradient-boosted trees (XGBoost, LightGBM, CatBoost) almost always beat deep learning. If your data is unstructured -- text, images, audio, video -- and you have at least tens of thousands of examples, deep learning is the right tool. If you have a small unstructured dataset, fine-tune a pretrained model rather than training from scratch. The classical-ML versus deep-learning choice tracks the data type more than the data size.

The bottom line

Machine learning is the discipline of building systems that learn from data. Deep learning is the version of it that has worked so well it has become the default. The mechanics are not magical -- a training loop, a loss function, gradient descent, a neural network architecture -- and once you have them in your head, the field is much easier to follow. The leverage from understanding even the basics is high: it lets you read papers and product announcements with the right context, evaluate vendor claims with the right scepticism, and choose the right method for the problem in front of you. Read the related guides on generative AI and AGI for where the field is going next, and our prompt engineering hub for the practical skill of getting the most out of these systems day to day.

Last updated: May 2026