History of AI: From Dartmouth 1956 to GPT-5

The history of artificial intelligence is mostly a history of optimism colliding with reality. Almost every decade since 1956 has produced a confident claim that human-level machines were imminent, and almost every one of those claims aged badly -- until, suddenly, in the 2020s, several of them stopped looking ridiculous. Reading the history is the best way to develop intuitions about which of today's confident claims will age well and which will not. The pattern is clear once you see it: AI advances in long, quiet build-ups punctuated by short, spectacular bursts; each burst is followed by a correction; and the cumulative trajectory keeps moving even when individual decades disappoint. This guide walks through seven decades in eight stages, each with the technical breakthrough, the hype, and the correction that followed.

Table of contents

The Dartmouth Conference (1956)

The field's founding moment is the Dartmouth Summer Research Project on Artificial Intelligence, held over eight weeks in 1956 at Dartmouth College in New Hampshire. The proposal, signed by John McCarthy, Marvin Minsky, Nathaniel Rochester and Claude Shannon, opens with a sentence that frames the next seventy years: "every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it." The phrase "artificial intelligence" appears in the proposal title, the first time it is used in print.

The conference produced fewer technical results than its organisers had hoped. Allen Newell and Herbert Simon arrived with the Logic Theorist, a program that proved theorems in propositional logic and is generally considered the first AI program. Most attendees left without a finished system but with a shared research agenda and a name for the field.

The optimism of the era is easy to mock with hindsight. Simon predicted in 1957 that "within ten years a digital computer will be the world's chess champion" and "within ten years a digital computer will discover and prove an important new mathematical theorem". Both took roughly forty years longer than predicted. Marvin Minsky told Life magazine in 1970 that "in three to eight years we will have a machine with the general intelligence of an average human being". The pattern of overpromising starts here.

The first AI winter and what caused it

Through the 1960s, AI research absorbed substantial military and government funding, particularly through the US Defense Advanced Research Projects Agency (DARPA). The early successes -- Arthur Samuel's checkers player, the SHRDLU natural-language program, the General Problem Solver -- generated headlines that went well beyond the systems' actual capabilities.

The reckoning arrived in 1973 with the Lighthill Report, commissioned by the British Science Research Council and authored by mathematician James Lighthill. It concluded that AI research had failed to achieve its grand objectives and that the field's "combinatorial explosion" problem made it unlikely to succeed at scale. The UK cut AI funding sharply. DARPA followed in the US within a few years, and a similar contraction happened in Japan. The resulting funding drought is the first AI winter, conventionally dated 1974-1980.

The winter taught the field its first hard lesson: AI is judged by what it delivers, not by what it promises, and over-promising creates the conditions for the funding cuts that follow. Researchers who survived the first winter, including Minsky, became more cautious about claims; researchers who did not survive it left the field for adjacent ones like robotics and operations research.

Expert systems and the second winter

The 1980s revival was built on expert systems -- programs that encoded the knowledge of human specialists as if-then rules. MYCIN at Stanford diagnosed bacterial infections; XCON at Digital Equipment Corporation configured VAX computer orders. By 1985, two-thirds of the Fortune 500 had piloted some kind of expert-system project. The Japanese government launched the Fifth Generation Computer Systems project in 1982, a $400M decade-long bet on parallel logic-programming machines.

The expert-systems boom collapsed in the late 1980s for three intertwined reasons. The systems were expensive to build and update; the knowledge engineering required was much more labour-intensive than vendors had claimed. They were brittle -- they handled cases inside their rules well and failed completely outside them. And the specialised hardware they were sold on (Lisp machines) was undercut by general-purpose workstations that ran the same code without the premium.

The second AI winter ran from roughly 1987 through the mid-1990s. The Japanese Fifth Generation project ended without delivering its goals. The phrase "AI" became commercially toxic for years; researchers rebranded their work as "machine learning", "data mining", or "intelligent systems" to avoid association with the failed boom.

ML's quiet rise (1990s-2000s)

The story most popular histories skip is the 1990s and 2000s, because nothing dramatic happened. What did happen, in retrospect, was the slow construction of every piece that would later make deep learning succeed.

Statistical machine learning -- support vector machines, kernel methods, ensemble methods, graphical models -- became the dominant paradigm in academic AI. The 1986 Rumelhart-Hinton-Williams paper on backpropagation made neural networks trainable; LeCun's work on convolutional networks made them practical for image recognition. The internet produced the first truly large datasets. Storage and compute followed Moore's Law downward. None of this was visible from outside the field.

The successes of the era were modest by 2026 standards but transformative for their time: spam filters, search engines, recommendation systems, and the speech-recognition systems that powered the first generation of smart devices. Google's PageRank (1998) and Netflix's recommendation system are products of this era. So is the deep learning research that, by 2010, had produced strong but not dominant results in image and speech recognition.

Deep learning: 2012 ImageNet onwards

The widely-agreed start date for the modern deep learning era is September 30, 2012, when Geoffrey Hinton's group at the University of Toronto submitted AlexNet to the ImageNet Large Scale Visual Recognition Challenge. AlexNet was a convolutional neural network trained on two NVIDIA GTX 580 GPUs. It achieved a top-5 error rate of 15.3%, beating the second-place entry by more than 10 percentage points. By the next year's competition, almost every entry was a deep learning approach.

The AlexNet result was not the first time a neural network had worked. It was the first time a deep network had so decisively beaten the hand-engineered alternatives on a problem the field had been working on for a decade. The combination of GPUs (making the training compute affordable), ImageNet (providing 1.2 million labelled images), and a few critical engineering choices (ReLU activations, dropout, larger networks) suddenly made deep learning the obvious approach.

The next five years played the same scene out across one domain after another. Speech recognition fell to deep learning by 2014. Image generation followed with GANs in 2014 and diffusion models a few years later. Machine translation switched from statistical to neural in 2016. Reinforcement learning produced AlphaGo's defeat of Lee Sedol in 2016 and AlphaZero's mastery of chess, shogi and Go in 2017. Each result was a breakthrough; together, they were the field's third golden age and its first that did not collapse.

The transformer paper (2017)

Of all the papers in the deep learning era, none is more consequential than "Attention Is All You Need", published by a team at Google in June 2017. The paper proposed the transformer architecture: a neural network that processes sequences using only attention mechanisms, dropping the recurrent layers that had dominated NLP for years.

The immediate motivation was practical. Recurrent networks process sequences one element at a time, which makes them slow to train. Transformers process whole sequences in parallel, which makes them well-suited to GPUs. The performance gain on machine translation was modest at first but the training-speed gain was enormous, and the architecture turned out to scale further than anyone had anticipated.

By late 2018, BERT (a bidirectional transformer for language understanding) and GPT-1 (a generative transformer for language modelling) had established the transformer as the dominant architecture in NLP. By 2020, Vision Transformers (ViT) showed the same architecture beat CNNs on image classification at scale. By 2022, transformers were the foundation of essentially every frontier AI system in every modality.

"Attention Is All You Need" is the paper everyone in the field has read. Its eight authors have collectively founded or led most of the labs producing frontier AI today.

GPT lineage: 2018-2026

The GPT line is the cleanest illustration of how compute, data and architecture interact. Each generation roughly tracks the scaling laws and produces qualitatively different products on top.

ModelYearParametersWhat it could do
GPT-12018117MLanguage modelling, fine-tuned for downstream tasks
GPT-220191.5BCoherent paragraph generation; deemed "too dangerous to release" initially
GPT-32020175BFew-shot learning; the first model that felt qualitatively new
GPT-3.5 / ChatGPT2022~175B (RLHF-tuned)Conversational instruction following at consumer scale
GPT-42023Undisclosed (~1T total)Multi-modal input, expert-level performance on many benchmarks
GPT-4o / 4-turbo2024Optimised for inferenceReal-time voice, image and video input, agentic tool use
GPT-52025UndisclosedFrontier reasoning, long-horizon agentic work, multi-modal generation

Three patterns are worth noting. First, each generation is roughly an order of magnitude more parameters and an order of magnitude more compute than the last, and the capability gain is consistently surprising. Second, the qualitative jumps -- few-shot learning at GPT-3, conversational interaction at ChatGPT, expert-level reasoning at GPT-4 -- were not predicted by any benchmark in advance. They emerged from scale. Third, the cycle from research breakthrough to consumer product has shortened: GPT-3 took two years to reach a mainstream product (ChatGPT); GPT-5 reached one in months.

Other labs followed similar trajectories: Anthropic's Claude line (Claude 1, 2, 3, 3.5 Sonnet, 4 Opus), Google's PaLM and Gemini lines, Meta's Llama line, and the Chinese labs' Qwen, DeepSeek and Yi families. The gap between frontier and open weights has narrowed from "years" in 2022 to "twelve to eighteen months" by 2026.

Lessons from each cycle

Reading seventy years of AI history at once produces a few patterns worth holding on to.

Promises always run ahead of capabilities. The 1957 Simon prediction, the 1970 Minsky prediction, the 1980s expert-systems claims, the 2018 self-driving claims and the 2024 AGI-by-next-year claims all rhyme. The pattern is not that the optimists are always wrong; it is that they are usually wrong about the timeline by a factor of two to ten.

Funding drives research, but research drives capabilities. Both AI winters were funding events. They slowed the field but did not stop the underlying technical work, which kept moving in academic labs at lower cost. The breakthroughs that ended each winter came from research that had continued through it.

Scale matters more than cleverness. Almost every dramatic capability jump since 2012 has come from scaling a known method (deep networks, transformers, RLHF) rather than from a new method. The Sutton "bitter lesson" essay is the canonical articulation: the methods that win are the ones that can absorb more compute.

The problems that look hardest are often easier than the ones that look easy. Mastering Go was supposed to be decades away in 2010 and was solved by 2016. Folding a t-shirt reliably remains an open problem in robotics. Predicting which is which in advance is genuinely hard.

For the conceptual map of where the field is in 2026, see our What is AI pillar. For the implications of the current state, see our guide to AI ethics, bias and best practices.

Frequently asked questions

When was AI invented?

The term "artificial intelligence" was coined for the 1956 Dartmouth Conference, organised by John McCarthy, Marvin Minsky, Nathaniel Rochester and Claude Shannon. The underlying ideas (Turing's 1950 paper, Wiener's cybernetics, Shannon's information theory) are older. The first program generally called AI is the Logic Theorist by Newell, Shaw and Simon, also 1956.

What caused the AI winters?

The first (1974-1980) was caused by funding cuts after the Lighthill Report and similar evaluations concluded the field had over-promised. The second (late 1980s through mid-1990s) followed the collapse of the expert-systems boom, when the cost and brittleness of those systems failed to justify their cost. Both were funding events more than research events; the underlying technical work continued.

Who invented deep learning?

No single person, but Geoffrey Hinton, Yann LeCun and Yoshua Bengio are the three most associated with making it work. They shared the 2018 Turing Award for their foundational contributions. Hinton's work on backpropagation and Boltzmann machines, LeCun's convolutional networks, and Bengio's contributions to representation learning are the three legs of the modern field. AlexNet (2012) was the moment when deep learning beat the alternatives in public.

Why did transformers replace recurrent networks?

Transformers can be trained in parallel, which makes them dramatically faster on GPUs than the inherently sequential recurrent networks. They also turned out to scale to larger models without the gradient instabilities that plagued deep RNNs. The combination of training speed and scaling behaviour made them the dominant architecture within three years of the 2017 paper.

Was ChatGPT really a new technology?

No -- the underlying GPT-3.5 model had existed for some time, and the conversational interface was a careful product wrapper. What was new was the combination: instruction-tuning via RLHF, a chat interface, and free public access. The "moment" was a product moment more than a research one. The research moment was GPT-3, two years earlier.

What is the fastest way to catch up on AI history?

Read three things: Stuart Russell and Peter Norvig's textbook for the academic history; Cade Metz's "Genius Makers" for the deep learning era from a journalist's view; and the Wikipedia "History of artificial intelligence" article for a structured timeline. Then read the original "Attention Is All You Need" paper -- it is much more readable than its influence suggests, and you cannot understand 2026 without it.

The bottom line

The field is older than the current wave makes it look, and the current wave is more historically continuous than its press makes it look. Every technique used in 2026 has antecedents going back decades; what changed was the scale at which they could be trained and the data they could be trained on. Read the history not for nostalgia but for calibration: the next confident prediction you read will probably look much like one of the ones in this article that aged badly. The trajectory keeps moving anyway. Knowing which predictions to take seriously is mostly a matter of knowing which technical bets are scaling-limited (most of them) and which are still ahead of their funding curve (a few). The history is the best calibration tool available.

Last updated: May 2026