AI Glossary: 200+ Terms Every Practitioner Should Know in 2026
The vocabulary of AI doubles every two years. The terms that mattered in 2020 are still in the lexicon; the ones that matter in 2026 (RAG, MoE, agentic, RLHF, vision transformer) were not in mainstream use four years ago. Reading the field requires fluency with the whole stack of vocabulary at once: classical machine learning concepts that still apply, deep learning concepts that took over, and the newer transformer-and-LLM-specific terms that dominate current product announcements. This glossary is the reference we use across our hub. Every term has a 30-80 word definition, a concrete example where helpful, and a link to a deeper article when one exists. Terms are alphabetised; jump links sit at the top so you can search quickly. The list runs to over 200 entries because the field is genuinely wide; you are not expected to read it cover to cover.
How to use this glossary
Use it as a reference, not a reading. Skim the index. Jump to the letter you need. When you encounter a term in another article that you are not sure about, search this page. Most terms link out to either a deeper article in our hub or to an authoritative source. The definitions are deliberately tight -- aim for "enough to read the next paragraph confidently" rather than "everything you would ever need to know about this concept".
Index
A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
A
Agent
An AI system that takes multi-step actions in an environment to achieve a goal, rather than just answering a single question. Agents in 2026 typically combine an LLM with tool use, memory, and a planning loop. Examples: Cursor's coding agent, Devin, the agent layer in ChatGPT and Claude. See our AI agents hub.
Agentic workflow
A pattern where an LLM is wrapped in a loop that can call tools, check outcomes, and decide on the next action. Agentic workflows extend the model from "single response" to "multi-step task". Reliability depends on careful scoping; agents fail in compounding ways when early steps go wrong.
AGI (Artificial General Intelligence)
A system that matches human-level competence across the full range of intellectual tasks rather than excelling at one. The definition is contested; serious working definitions include OpenAI's "outperforms humans at most economically valuable work" and DeepMind's levels framework. See our AGI explained.
AlexNet
The convolutional neural network from Hinton's group at Toronto that won the 2012 ImageNet competition by a landslide, kicking off the modern deep learning era. AlexNet ran on two NVIDIA GPUs, used ReLU activations, and dropped the top-5 error rate by more than 10 percentage points compared to the previous year's winner.
Alignment
The research problem of making AI systems pursue the goals their designers intended rather than goals that look similar but produce harmful behaviour at scale. Subfields include RLHF, Constitutional AI, interpretability, and red-teaming. Anthropic's superalignment team and OpenAI's preparedness framework are the most-cited operational programmes.
AlphaFold
DeepMind's system for predicting protein 3D structure from amino acid sequence. AlphaFold 2 (2020) effectively solved the long-standing protein-folding problem; AlphaFold 3 (2024) extends predictions to protein-ligand and protein-DNA complexes. Probably the highest-impact application of deep learning to biology so far.
AlphaGo
DeepMind's Go-playing system, which defeated world champion Lee Sedol 4-1 in March 2016. AlphaGo combined deep neural networks with Monte Carlo tree search and reinforcement learning. Its successor AlphaZero (2017) learned chess, shogi and Go from scratch using self-play and the game rules alone.
Algorithm
A finite sequence of well-defined steps to solve a problem. In ML contexts, "algorithm" often refers to the training procedure (gradient descent, k-means, backpropagation) rather than the trained model. The popular usage of "algorithm" to mean any opaque software system blurs the technical meaning.
Anthropic
AI safety lab founded in 2021 by former OpenAI researchers. Anthropic produces the Claude family of models and is the source of Constitutional AI, the Responsible Scaling Policy framework, and significant interpretability research. Claude 4 Opus is its frontier model as of mid-2026.
API
Application Programming Interface. The way most production AI is consumed: developers send requests to an API endpoint, receive responses, and build features on top. OpenAI, Anthropic, Google and the open-weights model hosts (Together, Replicate, OpenRouter) all expose APIs. API design is itself now a competitive axis.
ASI (Artificial Superintelligence)
A hypothetical AI system that exceeds humans at essentially all cognitive tasks, including the task of building better systems. ASI is speculative; the question of whether the path from AGI to ASI is short, long, or impossible is one of the most important unknowns in the field.
Attention
The mechanism that lets a neural network selectively focus on relevant parts of its input. Self-attention, used in transformers, computes pairwise relevance between every position in the input sequence. The 2017 paper "Attention Is All You Need" introduced the transformer architecture built entirely on attention.
Autoencoder
A neural network trained to reproduce its input through a compressed bottleneck representation. Used for dimensionality reduction, denoising, and as a building block in generative models. Variational autoencoders (VAEs) are the probabilistic version that can generate new samples.
Autoregressive model
A model that generates a sequence one element at a time, conditioning each prediction on the elements generated so far. GPT-style language models are autoregressive: they predict the next token given all previous tokens. Most modern text generation is autoregressive; image and video generation often combine autoregressive and diffusion approaches.
B
Backpropagation
The algorithm that efficiently computes gradients in a neural network by applying the chain rule of calculus through every layer in a single backward pass. Without backpropagation, training deep networks would be computationally intractable. Popularised by Rumelhart, Hinton and Williams in 1986.
Base model
A pretrained language model before instruction-tuning or RLHF. Base models are competent at language modelling but rarely useful as products; they need post-training to follow instructions reliably. Llama base models, GPT-3 davinci, and the various open base models are starting points for further fine-tuning.
Batch
A subset of training examples processed together in a single forward and backward pass. Batches are used both for memory reasons (whole datasets do not fit in GPU memory) and statistical reasons (batch gradients are smoother than single-example gradients). Batch size is a key hyperparameter.
Batch normalization
A technique that normalises the inputs to each layer of a neural network, stabilising training and allowing higher learning rates. Introduced by Ioffe and Szegedy in 2015, it became standard in CNNs. Layer normalisation is the variant typically used in transformers.
Bayesian methods
An approach to ML that explicitly models uncertainty in parameters as probability distributions, updating those distributions as new data arrives. Bayesian methods are mathematically principled but computationally expensive; in 2026 they are used for specialised applications (small data, safety-critical contexts) more than mainstream deep learning.
Benchmark
A standardised evaluation dataset and metric used to compare models. Major 2026 benchmarks include MMLU (knowledge), SWE-bench Verified (coding), GPQA Diamond (graduate science), ARC-AGI-2 (reasoning), and FrontierMath (research-level math). Benchmarks saturate over time and need replacement.
BERT
Bidirectional Encoder Representations from Transformers. Google's 2018 model that became the foundation for most NLP tasks for several years. BERT is an encoder-only transformer trained with masked language modelling. Largely supplanted by decoder-only models in 2026 but still widely used in retrieval and embedding tasks.
Bias (ML)
Systematic difference in model behaviour across groups, in ways that produce unfair outcomes. The technical term is also used neutrally for the additive constant in a linear model; context disambiguates. See our guide to AI ethics, bias and best practices.
Black-box model
A model whose internal reasoning is opaque -- you can see inputs and outputs but not how the conclusion was reached. Most deep learning systems are black-box in practice. Explainability methods (SHAP, LIME, attention visualisation) provide partial post-hoc transparency.
C
Catastrophic forgetting
The tendency of neural networks to lose previously learned skills when trained on new tasks. Catastrophic forgetting is a major obstacle to continual learning and to fine-tuning without regression. Workarounds include rehearsal, elastic weight consolidation, and parameter-efficient fine-tuning techniques like LoRA.
Chain-of-thought (CoT)
A prompting technique where the model is encouraged to produce explicit reasoning steps before answering. Chain-of-thought substantially improves performance on multi-step problems and underlies the "reasoning models" of 2024-2026 (OpenAI's o-series, DeepSeek-R1). Both a prompt pattern and a training target.
Chatbot
A conversational AI interface, typically a wrapper around an LLM. ChatGPT, Claude.ai, Gemini and the various consumer assistants are all chatbots in this sense. The chatbot pattern dominates consumer-facing AI in 2026 even as the underlying capabilities expand.
ChatGPT
OpenAI's chat product, launched in November 2022 and the breakthrough moment for consumer AI. ChatGPT was originally built on GPT-3.5; it has since been updated to GPT-4, GPT-4o, and the GPT-5 generation. The product still drives the majority of consumer AI conversations as of mid-2026.
Chinchilla
DeepMind's 2022 paper showing that earlier scaling laws had under-allocated training tokens relative to model size. Chinchilla's revised scaling law (more data per parameter) has informed every subsequent training run. The "Chinchilla-optimal" terminology refers to compute-efficient training mixes.
Chinese Room
John Searle's 1980 thought experiment arguing that a system manipulating symbols according to rules does not "understand" the symbols, regardless of how convincing its outputs are. Cited in arguments about whether LLMs genuinely understand language. The argument is philosophically contested and increasingly considered orthogonal to the engineering question.
Classification
The supervised learning task of predicting which category an input belongs to. Binary classification has two classes (spam/not-spam); multi-class has many (image labelling, sentiment categories); multi-label allows multiple categories per input. Classification is the most common ML task in production.
Claude
Anthropic's family of LLMs. Claude 3 (2024), Claude 3.5 Sonnet (2024), and Claude 4 Opus (2025) are the major generations. Claude is known for long context windows, strong coding ability, and Constitutional AI training. Available via Anthropic's API and Amazon Bedrock.
Clustering
The unsupervised learning task of grouping similar examples without predefined categories. Classical algorithms include k-means, DBSCAN, and hierarchical clustering. Modern uses include customer segmentation, document organisation, and exploratory analysis of large unlabelled datasets.
CNN (Convolutional Neural Network)
A neural network architecture using convolutional layers, designed for grid-structured data like images. CNNs dominated computer vision from AlexNet (2012) until vision transformers (2020). Still widely used in production for vision tasks where transformer overhead is unjustified.
COMPAS
A US recidivism risk-score system investigated by ProPublica in 2016. The investigation found racially disparate error rates; the response from the vendor highlighted that no algorithm can satisfy both group calibration and equal error rates when base rates differ. The case is the canonical AI fairness example.
Compute
The amount of computational work used to train or run a model, typically measured in FLOPs or in GPU-hours. Compute is one of the three scaling axes (with data and parameters) for modern AI. Frontier 2025-2026 training runs use compute measured in 10^25-10^26 FLOPs.
Constitutional AI
Anthropic's training approach where the model is fine-tuned using a written set of principles rather than only human feedback. Constitutional AI reduces reliance on human labelling for safety training and produces more transparent training objectives. The approach has been adopted in modified form by other labs.
Context window
The maximum number of tokens an LLM can attend to in a single inference. Context windows have grown from 2048 tokens in GPT-3 to over a million in flagship 2025-2026 models. Larger windows enable longer documents and richer agentic state but cost more and can degrade attention quality.
Continual learning
The research area of training models that can learn new tasks without forgetting old ones. Catastrophic forgetting is the main obstacle. Production deployments typically work around the problem by retraining from scratch or by carefully restricting fine-tuning rather than truly continual updates.
Convolution
A mathematical operation that slides a small filter across an input, producing a feature map of activations. The core operation of CNNs. Convolutions exploit spatial locality (nearby pixels are related) and translation invariance (a feature looks the same regardless of where it appears in the image).
D
Data augmentation
Techniques for synthetically expanding a training dataset by transforming existing examples (rotating images, paraphrasing text, adding noise to audio). Data augmentation is one of the most reliable interventions for improving model performance when collecting more real data is infeasible.
Data labelling
The process of annotating raw data with target labels for supervised learning. Labelling is often the largest cost in an ML project. Specialised labelling companies (Scale AI, Labelbox) and crowdsourcing platforms (Amazon Mechanical Turk) provide infrastructure; LLM-assisted labelling is increasingly common in 2026.
Dataset
A collection of examples used to train, validate or test a model. Modern frontier model pretraining datasets contain trillions of tokens of text, hundreds of millions of images, or terabytes of video. The composition of the dataset is often the most important factor in model behaviour.
Decoder
The part of a sequence-to-sequence model that generates outputs. In transformer terminology, decoder-only models (GPT, Claude, Llama) generate text autoregressively; encoder-decoder models (T5, the original transformer) read input with the encoder and generate output with the decoder.
Deep learning
The subset of machine learning that uses neural networks with many layers. The "deep" refers to depth of the network. Almost every AI breakthrough since 2012 is deep learning. See our how ML and DL work.
Deepfake
Synthetic media that convincingly impersonates a real person, typically a video where someone appears to say or do something they did not. The 2024-2026 wave of generative video has made deepfakes substantially easier to produce. Detection methods exist but trail offensive techniques by a meaningful gap.
DeepMind
Google's AI research lab, originally founded in London in 2010 and acquired by Google in 2014. DeepMind produced AlphaGo, AlphaFold, AlphaZero, and the Gemini family of models. Merged with Google Brain in 2023 into Google DeepMind.
Diffusion model
A generative model that learns to reverse a noise process: given a fully noisy input, gradually denoise it to produce a sample. Diffusion models dominate image generation (Stable Diffusion, DALL-E, Midjourney) and video generation. Latent diffusion runs the process in a compressed latent space for efficiency.
Discriminative model
A model that learns the conditional probability of a label given an input -- "given this email, is it spam?". Contrasted with generative models, which learn the joint distribution. Most classification systems are discriminative.
Distillation
Training a small model to mimic the outputs of a larger one. Knowledge distillation produces faster, cheaper deployed models that retain most of the larger model's capability. Many production-tier models in 2026 are distilled from frontier models.
Distribution shift
When the distribution of data the model sees in production differs from the distribution it was trained on. Distribution shift is the most common cause of production model degradation. Monitoring for distribution shift is part of standard MLOps practice.
Dropout
A regularisation technique that randomly zeroes out a fraction of neurons during training to prevent overfitting. Introduced by Hinton's group in 2014. Dropout is less essential in modern transformers but still used in many architectures.
E
Embedding
A dense vector representation of an input (a word, sentence, document, image) in a learned space where similar inputs are close together. Embeddings power semantic search, recommendation, retrieval-augmented generation, and most cross-modal applications. Modern embedding models (OpenAI's text-embedding-3, Cohere's Embed, Anthropic's Voyage) are themselves billion-parameter networks.
Emergent capability
A capability that appears in larger models but not smaller ones, even when both are trained the same way. Few-shot learning, chain-of-thought reasoning, and instruction-following are commonly cited examples. Whether emergence is genuinely discontinuous or an artefact of metric choice is contested.
Encoder
The part of a sequence-to-sequence model that reads input and produces a representation. Encoder-only models (BERT) produce rich representations for downstream tasks like classification or retrieval. Encoder-decoder models pair the encoder with a decoder for generation tasks.
Ensemble
Combining predictions from multiple models to improve performance. Common in classical ML (random forests, gradient boosting) and competition contexts. Less common in deep learning because individual models are already large; mixture-of-experts is the modern equivalent.
Epoch
One complete pass through the training dataset. Models are typically trained for many epochs, with the loss decreasing over time. Frontier-scale LLM pretraining often runs for less than one epoch on the full corpus because the corpus is so large.
Evaluation
Measuring a model's performance on data not used for training. Evaluation is harder than training for many modern systems, particularly generative ones where the right answer is not unique. Benchmark saturation, judge-model bias, and prompt sensitivity are active evaluation problems in 2026.
Explainability (XAI)
Techniques for making AI predictions more interpretable to humans. Methods include SHAP, LIME, attention visualisation, and chain-of-thought. Explainability provides post-hoc explanations rather than fundamental transparency; it helps in audit and debugging contexts but does not solve interpretability in any deep sense.
Expert system
A 1980s-era AI approach that encoded human expert knowledge as if-then rules. Expert systems were the dominant commercial AI of the decade until the second AI winter. The approach is still used in narrow domains (medical decision support, certain compliance systems) where rules are well-specified.
F
Feature
An input variable used by a model. Classical ML required hand-engineered features (extract these properties from each example). Deep learning learns features automatically from raw inputs. The shift from feature engineering to feature learning is one of the central reasons deep learning won.
Feature engineering
The classical-ML practice of designing input features manually before training a model. Still important in tabular ML and when domain knowledge is rich. Largely obviated by deep learning for unstructured data (text, images, audio).
Few-shot learning
The ability to learn a new task from a small number of examples shown in the prompt. GPT-3 (2020) was the first model that demonstrated few-shot learning convincingly. Modern LLMs handle few-shot tasks routinely; this is the technique behind most prompt engineering patterns.
Fine-tuning
Continuing the training process on a task-specific dataset to adapt a pretrained model. Fine-tuning is far cheaper than training from scratch and produces models with retained general capabilities plus specific competence. Modern variants (LoRA, QLoRA) reduce the cost dramatically.
Foundation model
A model trained on broad data that serves as the basis for many downstream applications. The term, popularised by Stanford's HAI in 2021, describes the GPT/Claude/Gemini class of systems. Foundation models are characterised by scale, generality, and the ability to be adapted to many tasks.
Forward pass
Running input through a neural network to produce an output. Contrasted with the backward pass, which computes gradients. Inference is forward-pass-only; training involves both forward and backward passes per training step.
FrontierMath
A 2024 benchmark of research-grade mathematics problems specifically designed to resist saturation by current LLMs. As of mid-2026, frontier models solve a small fraction of the problems. FrontierMath is one of the most-watched benchmarks for measuring progress on hard reasoning.
F-score (F1)
A classification metric that combines precision and recall into a single number. F1 is the harmonic mean. Useful when classes are imbalanced and accuracy alone is misleading.
G
GAN (Generative Adversarial Network)
A generative model architecture from 2014 in which a generator network and a discriminator network train against each other. GANs produced the first photoreal AI-generated faces. Largely displaced by diffusion models for image generation in 2022-2023 but still used in specialised contexts.
Gemini
Google DeepMind's family of multi-modal LLMs. Gemini 1 (December 2023), Gemini 1.5 Pro (February 2024), Gemini 2.0 (December 2024), and Gemini 2.5 Pro (2025) are the major releases. Strong on long context (multi-million tokens) and integrated tool use.
Generative AI
The class of AI systems that produce content (text, images, audio, video, code) rather than classifying inputs. Generative AI is the dominant 2022-onwards wave, built on transformers and diffusion models. See our generative AI guide.
Generalisation
The ability of a model to perform well on data it did not see during training. Generalisation is the actual goal of ML; minimising training loss is just an instrumental step. Models that memorise training data without generalising are said to be overfitting.
GPT (Generative Pretrained Transformer)
OpenAI's family of LLMs. GPT-1 (2018), GPT-2 (2019), GPT-3 (2020), GPT-3.5 (2022, ChatGPT), GPT-4 (2023), GPT-4o (2024), GPT-5 (2025). Each generation is roughly an order of magnitude more compute and parameters than the last.
GPU (Graphics Processing Unit)
The parallel-computing hardware that makes modern AI economical. GPUs were designed for graphics rendering but turn out to excel at the matrix operations underlying neural network training and inference. NVIDIA dominates the AI GPU market; the H100 and Blackwell B200 generations are current frontier hardware.
Gradient descent
The algorithm by which model parameters are adjusted to reduce loss. The model computes the gradient of the loss with respect to each parameter, then takes a small step in the opposite direction. Stochastic gradient descent (SGD) and its variants (Adam, AdamW) are the workhorses of deep learning.
Grok
xAI's family of LLMs, integrated into the X (Twitter) platform and available via xAI's consumer app. Grok-3 and the subsequent generations compete with the OpenAI/Anthropic/Google frontier; the lab's positioning emphasises real-time information access through X integration.
Ground truth
The actual correct answer for an example, against which model predictions are compared. Producing reliable ground truth is one of the harder parts of any ML project; in many real applications "ground truth" is itself a label produced by humans with their own biases.
Guardrails
Constraints applied to AI system inputs or outputs to prevent harmful behaviour. Common guardrails include content filters, output validators, structured-output schemas, and refusal patterns trained into the model. Guardrails complement rather than replace alignment work.
H
Hallucination
An LLM's confident production of false information. Hallucination remains the most serious reliability problem for LLMs in 2026, despite substantial reduction through retrieval-augmented generation, tool use, and reasoning techniques. Particularly dangerous in domains where wrong answers carry real cost (legal, medical, financial).
Hidden layer
A layer in a neural network that is neither input nor output. Hidden layers are where representation learning happens; the depth of the network is the number of hidden layers. Frontier LLMs have hundreds of hidden layers.
Hugging Face
A company hosting the largest open repository of ML models, datasets and demos. Hugging Face has become infrastructure for the open-source AI ecosystem; the Transformers library and the model hub are used by essentially every researcher and many production teams.
Hyperparameter
A configuration value set before training rather than learned during it. Examples: learning rate, batch size, number of layers, dropout rate. Hyperparameter tuning is the discipline of finding the values that produce best performance, typically via grid search, random search, or Bayesian optimisation.
I
ImageNet
The 14-million-image labelled dataset whose 2012 competition winner (AlexNet) launched the modern deep learning era. ImageNet still serves as a benchmark for image classification, though it has been largely saturated by transformer-based vision models.
In-context learning
An LLM's ability to adapt to a new task purely from examples shown in the prompt, without any weight updates. In-context learning is the technical underpinning of few-shot prompting and is a major reason LLMs are so flexible.
Inference
Using a trained model to make predictions on new data. Contrasted with training. Inference is the phase that drives most production cost; reducing inference cost (through distillation, quantisation, speculative decoding) is the main competitive axis among model providers.
Instruction tuning
Fine-tuning a base language model on examples of instruction-response pairs to make it follow instructions reliably. Instruction tuning is the first step of post-training, typically followed by RLHF. Without instruction tuning, base models are technically capable but practically unusable.
J
Jailbreak
An adversarial input that bypasses a model's safety training to elicit otherwise-refused outputs. Jailbreaks are an active research and security topic; new ones are discovered regularly, defended against, and replaced. The cat-and-mouse pattern is similar to traditional security exploit research.
K
K-means
A classic clustering algorithm that partitions data into K groups by iteratively assigning each point to the nearest cluster centroid and updating the centroids. Simple, fast, often the first thing tried for clustering tasks. Limitations include requiring K to be set in advance and assuming spherical clusters.
Kernel
In classical ML, a function that computes similarity between data points in a (possibly implicit) higher-dimensional space, used in kernel methods like SVMs. In deep learning, the small filter applied in a convolution. Context disambiguates.
Knowledge graph
A structured representation of facts as a graph of entities and relationships. Used to ground LLMs with verifiable knowledge, complement retrieval, and provide reasoning structure. Wikidata, Google's Knowledge Graph, and various enterprise knowledge graphs are widely used.
L
Language model
A model that predicts the probability of sequences of words or tokens. Modern language models are large neural networks that estimate next-token probability given prior context. The term covers everything from n-gram models to GPT-5.
Latent space
The internal representation space of a generative model, where each point corresponds to a possible output. Latent diffusion models run the diffusion process in this compressed space rather than in pixel space, dramatically reducing compute requirements.
Layer
A computational block in a neural network. A typical transformer layer contains a self-attention sub-layer and a feed-forward sub-layer plus normalisation. Frontier LLMs stack hundreds of layers; the depth is part of what enables their representation capacity.
Learning rate
The size of the step gradient descent takes at each update. Too high and training diverges; too low and it crawls. Learning rate is the single most important hyperparameter to tune; modern training schedules (warmup, cosine decay) shape it dynamically over a run.
LIME
Local Interpretable Model-agnostic Explanations. A 2016 explainability method that fits a simple local model around an individual prediction to explain it. LIME is widely used in audit contexts and remains a standard baseline for explainability in classical ML.
Llama
Meta's family of open-weights LLMs. Llama (2023), Llama 2 (2023), Llama 3 (2024), Llama 4 (2025) are the major generations. Llama models are the dominant base for the open-source AI ecosystem, fine-tuned and deployed extensively.
LLM (Large Language Model)
A neural network with billions to trillions of parameters trained on internet-scale text. LLMs are the foundation of modern generative AI. Frontier examples include GPT-5, Claude 4 Opus, Gemini 2.5 Pro, Llama 4. The term sometimes blurs with "foundation model".
Logit
The raw, pre-softmax output of a classification model -- an unnormalised score that gets converted to a probability distribution by the softmax function. Logits are useful when you want to compute custom temperature, do beam search, or analyse model uncertainty.
LoRA
Low-Rank Adaptation. A fine-tuning technique that updates only a small low-rank decomposition of the model's weights, dramatically reducing the compute and storage required for fine-tuning. LoRA is the dominant practical fine-tuning approach for open-weights models in 2026.
Loss function
A function that measures how wrong a model's predictions are, used as the optimisation target during training. Cross-entropy for classification, mean-squared-error for regression, KL divergence for distributional alignment. Choosing the right loss is foundational to producing the right behaviour.
LSTM
Long Short-Term Memory. A recurrent neural network architecture from 1997 that solved the vanishing-gradient problem in plain RNNs. LSTMs were the dominant NLP architecture from the 2000s until the 2017 transformer. Largely deprecated in 2026 but still appear in legacy production systems.
M
Machine learning
The subset of AI methods that learn patterns from data rather than from hand-coded rules. Includes classical ML (decision trees, SVMs, linear models) and deep learning. See our AI vs ML vs deep learning comparison.
Machine translation
The task of automatically translating text between languages. Pre-2016: statistical phrase-based methods. 2016-2017: neural machine translation with RNN encoder-decoder. 2017-onwards: transformer-based. Modern LLMs handle translation as a side effect of general training, often outperforming dedicated translation systems.
Memorisation
When a model reproduces training data verbatim rather than generalising. Memorisation is the failure mode behind training-data-leak vulnerabilities and copyright concerns. Some memorisation is unavoidable in large models; controlling it is an active research area.
Meta-learning
Learning to learn -- training models that can adapt to new tasks quickly. The dominant practical version in 2026 is in-context learning, where the LLM "learns" a new task from prompt examples. Earlier model-agnostic meta-learning (MAML) and similar techniques are still used in research contexts.
Mistral
French AI lab founded in 2023 producing open-weights LLMs (Mistral 7B, Mixtral 8x7B, Mistral Large) competitive with the closed frontier on many benchmarks. Mistral is the leading European AI lab as of 2026.
MLOps
The discipline of running ML systems in production -- data pipelines, training infrastructure, deployment, monitoring, retraining. MLOps is to ML what DevOps is to software. The unglamorous but essential layer between research-quality models and production-grade services.
Mixture of Experts (MoE)
A neural network architecture that routes inputs to one of several specialised "expert" subnetworks rather than activating the entire network for every input. MoE allows models with very large total parameter counts to run efficiently at inference. Mixtral, GPT-4 (rumoured), and several other frontier models use MoE.
Model card
A standardised document describing an AI model's intended use, training data, performance metrics across subgroups, known limitations, and ethical considerations. Introduced by Mitchell et al. at Google in 2019. Standard for major model releases in 2026.
Multi-modal
A model that handles multiple input or output types -- text, images, audio, video. Frontier LLMs in 2026 are multi-modal by default; specialised single-modality models are now the exception. Multi-modal does not mean general; it means broader input handling.
Multi-head attention
The variant of self-attention used in transformers, where multiple attention computations run in parallel with different learned projections, then combine. Multi-head attention lets the model attend to different aspects of the input simultaneously.
N
Narrow AI
An AI system that performs a specific task at human level or above, but only that task. Almost all AI in production is narrow in this sense, even systems built on general-purpose foundation models. Contrasted with AGI.
Neural network
A computational model inspired by biological neurons, consisting of layers of nodes connected by weighted edges. The fundamental building block of deep learning. Modern neural networks bear little resemblance to actual brain biology beyond the original metaphor.
NLP (Natural Language Processing)
The branch of AI dealing with human language -- understanding, generating, translating, summarising. NLP shifted from rule-based and statistical methods to neural approaches in the 2010s and is now dominated by transformer-based LLMs.
NIST AI RMF
National Institute of Standards and Technology AI Risk Management Framework, released January 2023. The de facto US template for AI governance, organising risk management into four functions: Govern, Map, Measure, Manage. See our ethics guide.
O
One-shot learning
Learning a new task from a single example. The extreme version of few-shot learning. Modern LLMs handle one-shot learning routinely for tasks similar to their training distribution; it remains hard for genuinely novel tasks.
OpenAI
The AI lab that produced GPT, ChatGPT and DALL-E, founded in 2015. OpenAI's transition from non-profit to capped-profit and its product velocity have made it the most-watched lab in the field. Microsoft is its largest investor and primary infrastructure partner.
Open-weights model
A model whose trained parameters are publicly available, even if the training data and code are not. Llama, Mistral, Qwen, DeepSeek and Yi are major open-weights families. Open-weights models are sometimes called open-source, though strict open-source-software definitions are contested in the AI context.
Optimisation
The process of adjusting model parameters to minimise the loss function. Gradient descent and its variants (Adam, AdamW, Lion) are the dominant optimisers. The optimisation landscape of deep networks is high-dimensional and non-convex; remarkably, simple methods work well in practice.
Overfitting
When a model memorises training data well but generalises poorly to new examples. Caused by excessive model capacity, insufficient training data, or training too long. Mitigations include regularisation, early stopping, dropout, and data augmentation.
P
Parameter
A learnable value inside a model. Frontier LLMs in 2026 have hundreds of billions to over a trillion parameters. Parameter count correlates loosely with capability but is not a complete measure; data quality, training procedure and architecture all matter.
Perceptron
The simplest neural network, proposed by Rosenblatt in 1958. A single-layer perceptron can only learn linearly separable functions, a limitation highlighted in Minsky and Papert's 1969 book that contributed to the first AI winter. Multi-layer perceptrons solve the limitation but require backpropagation to train.
Positional encoding
A technique transformers use to give the model information about the position of tokens in the sequence, since self-attention itself is order-invariant. Modern variants include rotary position embedding (RoPE) and ALiBi, which generalise better to longer sequences than the original sinusoidal scheme.
Pretraining
The first and most expensive stage of training a foundation model: training on broad data with a self-supervised objective like next-token prediction. Pretraining produces a base model with broad competence; post-training adapts it to be useful.
Precision and recall
Two complementary classification metrics. Precision: of predicted positives, how many were actually positive? Recall: of actual positives, how many were correctly predicted? The trade-off between them is fundamental; F1 combines them. Disaggregating them across subgroups is a basic bias-audit step.
Prompt
The input given to an LLM, including instructions, context, examples and the actual query. Prompts are the primary interface to modern AI systems for non-developers. The discipline of writing them well is prompt engineering.
Prompt engineering
The discipline of writing prompts that elicit the desired behaviour from an LLM. Patterns include role context, few-shot examples, chain-of-thought, structured output schemas, and explicit task specification. See our prompt engineering hub.
Prompt injection
An attack where adversarial content in user input or retrieved documents hijacks an LLM's instructions. Prompt injection is one of the most serious AI security vulnerabilities and remains an open problem. Defences include input filtering, instruction hierarchies, and output validation.
Pruning
Removing parameters or connections from a trained model to make it smaller and faster, typically with minimal loss in accuracy. Pruning is part of the inference-optimisation toolkit alongside quantisation and distillation.
Q
Quantisation
Reducing the numerical precision of model weights and activations (from 32-bit floats to 8-bit integers, for example) to reduce memory and speed up inference. Quantisation has become essential for deploying frontier-tier models on consumer hardware. 4-bit quantisation is now standard for many open-weights deployments.
Query (in attention)
One of the three vectors per token in the self-attention mechanism (alongside key and value). The query represents "what am I looking for?" and is matched against keys from other positions to determine attention weights.
R
RAG (Retrieval-Augmented Generation)
A pattern where an LLM is given access to an external document store at query time, retrieving relevant context and using it to ground generation. RAG dramatically reduces hallucination on factual queries and lets models work with information not in their training data. The dominant deployment pattern for enterprise AI in 2026.
Reasoning model
An LLM trained or prompted to produce explicit reasoning steps before final answers. OpenAI's o-series, DeepSeek-R1 and similar models are reasoning models. They trade more inference compute for higher accuracy on hard problems, especially math and code.
Recurrent network (RNN)
A neural network architecture that processes sequences one element at a time, maintaining a hidden state that summarises previous elements. RNNs were the dominant NLP architecture pre-2017; transformers largely displaced them due to better parallelisation. LSTMs are the most common RNN variant.
Red-teaming
A systematic process of adversarial testing where individuals try to make a system fail in ways that would matter -- producing harmful content, leaking training data, exhibiting bias. Red-teaming has become a standard part of responsible model release, with Anthropic, OpenAI and DeepMind publishing methodology details.
Regression
The supervised learning task of predicting a continuous numerical value (house price, temperature, customer lifetime value). Distinguished from classification, which predicts categories. Linear regression, gradient-boosted regression, and neural network regressors are common variants.
Regularisation
Techniques that prevent overfitting by penalising model complexity or adding noise during training. Common methods include weight decay (L2 regularisation), dropout, early stopping, and data augmentation. Regularisation is one of the most reliable interventions for improving generalisation.
Reinforcement learning (RL)
An ML paradigm where an agent learns by taking actions in an environment and receiving rewards. RL is the right framing for sequential decision problems. Powers AlphaGo, AlphaZero, and the post-training of every modern LLM (via RLHF). See how ML and DL work.
ReLU
Rectified Linear Unit. A non-linear activation function that outputs zero for negative inputs and the input value for positive ones. ReLU is computationally cheap, avoids the vanishing-gradient problem of older activations like tanh, and was a key enabler of training very deep networks.
RLHF
Reinforcement Learning from Human Feedback. A post-training technique where a reward model is trained on human preferences between model outputs, then used to fine-tune the model via reinforcement learning. RLHF is what turned base GPT-3 into ChatGPT and remains central to every major LLM's training.
Robustness
A model's resistance to adversarial inputs, distribution shift, and noise. Robust models maintain performance when conditions change; non-robust models fail unpredictably. Robustness is a measurable property, distinct from accuracy on clean data, and increasingly required by frameworks like the EU AI Act.
S
Sampling (top-k, top-p, temperature)
The methods by which an LLM picks the next token from its predicted distribution. Temperature controls randomness (low = deterministic, high = creative). Top-k restricts to the k most probable tokens. Top-p (nucleus sampling) restricts to the smallest set whose probabilities sum above p. Combined they shape the model's output style.
Self-attention
The attention mechanism applied within a single sequence: each token attends to every other token in the same sequence. Self-attention is the central operation of transformers. It enables models to capture long-range dependencies that recurrent networks struggled with.
Self-supervised learning
A training paradigm where labels are derived from the data itself rather than human-annotated. Masked language modelling (predict the missing word) and next-token prediction are the dominant examples. Self-supervised learning is what allows LLMs to be trained on essentially unlimited text.
Semantic search
Search based on meaning rather than keywords, typically by computing embeddings of the query and documents and finding nearest neighbours in vector space. Semantic search is the retrieval technology behind most modern RAG systems and many enterprise search products.
Semi-supervised learning
Training that uses a small labelled dataset alongside a much larger unlabelled one. Common in domains where labels are expensive but raw data is abundant. The line between semi-supervised and self-supervised has blurred in modern practice.
Sentiment analysis
The NLP task of classifying text by emotional polarity (positive, negative, neutral). One of the oldest and most-deployed text classification tasks. Modern LLMs handle it as a side effect of general training, often outperforming specialised sentiment models.
SGD (Stochastic Gradient Descent)
The variant of gradient descent that uses a small batch of examples per update rather than the entire dataset. Stochasticity adds noise but enables training on data too large to fit in memory. Modern SGD is augmented with momentum (Adam, AdamW); pure SGD is rare in deep learning.
SHAP
SHapley Additive exPlanations. A method that assigns each feature a value representing its contribution to a specific prediction, based on cooperative game theory. SHAP is the most widely used explainability technique for tabular ML in 2026.
Singularity
A speculative concept in which AI capability increases recursively, with each system building a better successor, leading to qualitative change in human civilisation. Popularised by Ray Kurzweil. Not a working technical term; mostly a topic in futurology and AI risk discussions.
Softmax
The function that converts a vector of arbitrary real numbers (logits) into a probability distribution. Softmax is what produces the next-token probabilities in a language model and the class probabilities in a classifier. The temperature parameter modulates how peaky the resulting distribution is.
Speech recognition
The task of converting audio of speech into text. Modern speech recognition is dominated by transformer-based models (Whisper from OpenAI, conformer-based models from Google) that are trained on massive multilingual audio datasets. Production accuracy on clean audio in major languages now exceeds human-level for most uses.
Stable Diffusion
The open-weights image generation model first released by Stability AI in August 2022. Stable Diffusion's open release democratised image generation and spawned an enormous ecosystem of fine-tunes, ControlNets, and workflow tools. Stable Diffusion 4 (2025) is the current generation.
State-space model
A class of sequence models including Mamba and S4 that combine the parallelism of attention with the linear-time inference of recurrence. State-space models are an active research alternative to transformers, promising better efficiency on very long sequences. Production deployments are still limited as of 2026.
Supervised learning
Training a model on labelled examples (input + correct output). The most common ML setup in production. Supervised learning's biggest practical bottleneck is label cost; large labelled datasets are expensive to produce.
Support Vector Machine (SVM)
A classical ML algorithm that finds the hyperplane separating classes with the largest margin. SVMs were the dominant classifier in the 2000s, displaced by deep learning for unstructured data but still used for tabular tasks where training data is small.
Synthetic data
Data generated by an AI model, used to augment or replace real training data. Synthetic data is increasingly used for fine-tuning, RLHF, and domains where real data is scarce or sensitive. Quality control is the central challenge -- generated data can amplify biases of the source model.
T
Temperature
A sampling parameter that controls randomness in LLM output. Temperature 0 produces deterministic, repetitive output; temperature 1 produces typical sampling; higher values produce more creative but less coherent output. Tuning temperature for the use case is a basic prompt engineering skill.
Tensor
A multi-dimensional array of numbers. The data type of essentially all neural network computation. Frameworks (PyTorch, JAX, TensorFlow) are tensor operation libraries with autodiff and GPU support. The term is borrowed from physics but used loosely in ML.
Test set
Data held out from training and validation, used only for the final evaluation of a model. The test set is the unbiased estimate of generalisation performance. Reusing the test set during model development causes overfitting to the test set itself, a classic ML mistake.
Token
A unit of input or output for an LLM, typically a fragment of a word. Modern LLMs process text as sequences of tokens rather than characters or whole words; a typical English word is 1.3 tokens on average. Token counts drive both context limits and pricing.
Tokenisation
The process of splitting text into tokens. Modern LLMs use subword tokenisers like Byte-Pair Encoding (BPE) or SentencePiece. The choice of tokeniser affects model behaviour, particularly for non-English languages and code; tokeniser is a model-specific commitment that is hard to change post hoc.
Tool use
An LLM's ability to call external functions (web search, calculator, code execution, database queries) during inference. Tool use is the technical foundation of agentic AI in 2026 and dramatically extends model capability into areas where pure pattern-matching is unreliable.
Training
The process of adjusting model parameters to perform well on the task. Training is rare and expensive (frontier runs cost hundreds of millions); inference is frequent and cheap. The training run is the moment a foundation model is created; everything afterwards is reuse.
Transfer learning
The technique of taking a model trained on one task and adapting it to another. Modern AI is built on transfer learning -- pretrained foundation models are adapted to thousands of downstream tasks via prompting, fine-tuning or RAG. The 2010s shift to transfer learning is what makes practical AI economical.
Transformer
The neural network architecture introduced in the 2017 "Attention Is All You Need" paper. Transformers process sequences using self-attention rather than recurrence, allowing parallel training and efficient scaling. Now dominant across text, code, image, video and audio. The single most important architecture in modern AI.
t-SNE
t-distributed Stochastic Neighbour Embedding. A 2008 dimensionality-reduction technique used primarily for visualising high-dimensional data in 2D or 3D. UMAP is the modern alternative that produces similar visualisations faster and with better global structure.
Turing Test
Alan Turing's 1950 thought experiment proposing that a machine should be considered intelligent if a human evaluator cannot reliably distinguish its conversation from a human's. Modern LLMs effectively pass casual versions of the Turing Test, which has prompted reconsideration of what the test was meant to demonstrate.
U
Underfitting
When a model is too simple to capture the patterns in the training data, performing poorly even on the data it was trained on. The opposite failure mode from overfitting. Increasing model capacity, training longer, or improving feature engineering are typical fixes.
Unsupervised learning
Training without labelled examples; the model learns structure from the data alone. Includes clustering, dimensionality reduction, and (broadly) self-supervised learning. The line between unsupervised and self-supervised has blurred since labels are sometimes derived from the data itself.
V
Validation set
Data held out from training, used to tune hyperparameters and decide when to stop training. Distinct from the test set, which is reserved for the final evaluation. Tuning on the validation set and reporting test-set numbers is the standard ML evaluation discipline.
Variational autoencoder (VAE)
A generative model that learns a probabilistic latent-space representation of data. VAEs were a major generative-model family before the rise of GANs and diffusion models. They are still used as the encoder in latent diffusion architectures.
Vector database
A database optimised for similarity search over high-dimensional vector embeddings. Vector databases (Pinecone, Weaviate, Qdrant, the major cloud providers' offerings) are the storage layer of most production RAG systems. The market grew dramatically with the gen-AI wave.
Vision Transformer (ViT)
The architecture that adapted transformers to image classification by treating an image as a sequence of patches. Introduced in 2020, ViTs match or beat CNNs at scale and are now standard in computer vision research. Production vision systems increasingly use ViT or hybrid architectures.
W
Weight
A learnable parameter in a neural network, typically the multiplier on a connection between two nodes. The weights of a trained model collectively encode everything it has learned. "Weights" and "parameters" are nearly synonymous in casual usage.
Word embedding
A dense vector representation of a word, learned such that semantically similar words have similar vectors. Word2vec (2013) and GloVe (2014) were the breakthrough word embedding methods. In 2026, contextual embeddings from LLMs largely replace static word embeddings for most tasks.
X
XAI (Explainable AI)
The umbrella term for techniques making AI predictions interpretable to humans. See Explainability.
XGBoost
A gradient-boosted decision-tree library that has dominated tabular ML competitions since 2014. XGBoost remains the first thing to try for structured data problems and consistently outperforms deep learning on tabular tasks of moderate size. LightGBM and CatBoost are competing implementations.
Y
YOLO
You Only Look Once. A family of real-time object-detection models, first published in 2016 by Redmon et al. YOLO's speed made it the default for production object detection in robotics, autonomous vehicles, and surveillance. Multiple successor versions exist; YOLOv12 is current as of 2026.
Z
Zero-shot learning
Performing a task without any labelled examples, relying purely on the model's general training. Zero-shot is the limit case of few-shot learning. Modern LLMs handle zero-shot for tasks similar to training distribution; performance degrades as tasks become more novel.
Frequently asked questions
Which AI terms should I learn first?
If you read AI news, the high-leverage starter set is: machine learning, deep learning, neural network, transformer, LLM, training vs inference, fine-tuning, prompt, RAG, hallucination, RLHF. Knowing these ten terms cold lets you read most product announcements and AI journalism without losing the thread.
Is "AI" the same as "machine learning"?
In casual 2026 usage, mostly yes. Technically, machine learning is one approach to AI; AI also includes symbolic methods that are not machine learning. The distinction matters in academic and policy contexts but rarely in product discussions. See our AI vs ML vs deep learning guide.
Why are some of the same concepts called different things?
Because the field is large, fast-moving, and built across overlapping subcommunities (academic ML, deep learning, NLP, computer vision, AI safety, applied AI) each with their own conventions. "Hyperparameter" in research is "config knob" in production; "embedding" in NLP is "feature vector" in classical ML; "fine-tuning" can mean five different things depending on context. The vocabulary will keep proliferating.
How do I keep up with new AI terms?
Follow a small number of sources that explain new terms when they introduce them: the Hugging Face blog, Andrej Karpathy's writing, the Anthropic and OpenAI research pages, and a handful of newsletters (Import AI, The Batch, Deep Learning Weekly). When you encounter a term you do not know, look it up immediately rather than scrolling past; the cost of one minute is much lower than the cost of compounding fog.
Are there terms I should ignore?
"Cognitive computing" (marketing). "AI brain" (cargo cult). "Self-learning AI" (almost always meaningless in context). "Quantum AI" (almost always not actually quantum). When you encounter a term that does not have a precise technical definition, treat the source as marketing rather than information until proven otherwise.
Where can I learn the math behind these terms?
Andrew Ng's machine learning courses on Coursera remain the best on-ramp. Andrej Karpathy's "Neural Networks: Zero to Hero" YouTube series is the best video introduction to the underlying math at depth. The "Deep Learning" textbook by Goodfellow, Bengio and Courville is the reference. Our Learn AI hub covers more learning paths.
The bottom line
The AI vocabulary is wider in 2026 than the AI vocabulary in 2020 by a factor of three or four, and it will keep widening. Use this glossary as a working reference rather than a study text -- skim the index, jump to terms when you encounter them in the wild, and let the most useful definitions accumulate as you encounter them in context. The vocabulary is the bottleneck for a lot of people trying to engage with the field; closing it gets you most of the way to reading the actual material with confidence. Pair this glossary with the rest of our What is AI hub and you have the conceptual foundation for the rest of the field.
Last updated: May 2026
