30 Must Read AI Papers

Jye Sawtell-Rickson · September 19, 2025

AGI

Whilst the popularity of AI continues to soar, it can feel like we’re making so many new discoveries every day. It’s easy to get lost in the current literature, but it’s important to take a step back and put the field in context. If you’re new to the AI field it’s a great opportunity to make connections, or if you’ve long been around, a chance to review. The following is a list of the top 30 AI research papers compiled based on their impact to the overall AGI discourse, ordered chronologically. (AI is a wider field, e.g. computer vision, this list focuses on our path to AGI-level models.)

I found it a great exercise to read through the papers. Each work represents a time where someone thought differently, formalized a new idea, and challenged assumptions about what AI could achieve. By studying these breakthroughs, you not only learn the techniques but also experience the creativity, curiosity, and persistence that drove them. But, it’s a long list, so take your time and enjoy!

Title	Authors	Year	Reason for Influence
On Formally Undecidable Propositions of Principia Mathematica and Related Systems	Kurt Gödel	1931	Proved that any rich formal system cannot prove all true statements, revealing inherent limits of formal reasoning. This result underlies understanding of what machine reasoning cannot achieve, framing theoretical limits for AI.
On Computable Numbers, with an Application to the Entscheidungsproblem	Alan M. Turing	1936	Defined the Turing machine and formalized the notion of algorithmic computation, proving fundamental limits (undecidability) in computation. Turing’s work founded modern computer science, providing the theoretical basis for all later AI.
A Mathematical Theory of Communication	Claude E. Shannon	1948	Introduced information theory (the bit, entropy, coding theorems). Shannon’s insights created the blueprint for digital communications and storage, which underpin modern computing and data-driven AI. His work is often called the “magna carta of information theory”.
The Organization of Behavior	Donald O. Hebb	1949	Proposed what is now known as Hebb’s rule: neurons that fire together wire together. This introduced a model of synaptic plasticity (learning by strengthening connections) and is the conceptual foundation of neural networks and learning in the brain.
Programming a Computer for Playing Chess	Claude E. Shannon	1950	The first technical paper on computer chess. Shannon introduced the idea of minimax search with heuristic evaluation for game playing. This work launched the study of search algorithms in AI (game trees, alpha-beta pruning), forming an early cornerstone of AI.
A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence	John McCarthy et al.	1955	Launched the first AI workshop (Dartmouth 1956) and coined the term “Artificial Intelligence”. This proposal defined AI as a formal field of study and outlined key goals (e.g. reasoning, learning), effectively founding the AI research community.
The Logic Theory Machine (program)	Allen Newell, Herbert A. Simon, Cliff Shaw	1956	Described in early reports, this program automated theorem-proving in symbolic logic. It was the first AI program deliberately engineered for problem solving. Logic Theorist proved several theorems from Principia Mathematica, demonstrating that machines could perform logical reasoning.
The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain	Frank Rosenblatt	1958	Introduced the perceptron, the first neural network learning algorithm. Rosenblatt proved (via the Perceptron Convergence Theorem) that single-layer perceptrons learn any linearly separable function. This seminal work kickstarted neural network research (later revived as “connectionism”).
Some Studies in Machine Learning Using the Game of Checkers	Arthur L. Samuel	1959	One of the first self-learning programs. Samuel’s checkers player improved by playing games against itself and learning evaluation weights. He also popularized the term “machine learning” in this paper. It demonstrated that a program could improve from experience, pioneering ML concepts.
Programs with Common Sense	John McCarthy	1959	Introduced the “Advice Taker” concept and the use of first-order logic for knowledge representation. McCarthy proposed using logical inference to give machines common-sense knowledge. This paper was among the first to suggest formal logic as the basis for reasoning in AI.
Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I (LISP)	John McCarthy	1960	Defined LISP, the first symbolic programming language for AI. It introduced S-expressions and list processing, enabling elegant representation of code and data. LISP became the dominant AI programming language for decades and embodies the ideas of symbolic AI.
A Formal Theory of Inductive Inference, Part I & II	Ray Solomonoff	1964	Laid the foundation of algorithmic probability and universal induction, combining Occam’s Razor with probabilistic inference. Solomonoff’s theory formalized how to predict/learn from data using the shortest (highest-probability) programs. His work effectively launched Kolmogorov complexity and inductive inference theory.
ELIZA: A Computer Program for the Study of Natural Language Communication between Man and Machine	Joseph Weizenbaum	1966	One of the first chatbots. ELIZA simulated conversation (the famous DOCTOR script) using simple pattern-matching rules, fooling some users into attributing understanding to it. It demonstrated early natural-language interaction and highlighted the “Eliza effect”, inspiring research in dialogue and user perception.
Perceptrons: An Introduction to Computational Geometry	Marvin Minsky, Seymour Papert	1969	Rigorous analysis of perceptrons (single-layer neural nets). This book proved fundamental limitations (e.g. a single-layer perceptron cannot learn XOR). Its pessimistic results led to a decline in neural-net research for years (the first “AI winter”), and it underscored the need for multi-layer networks.
The Physical Symbol System Hypothesis	Allen Newell, Herbert A. Simon	1976	Articulated in their Turing Award lecture, this hypothesis states that a physical symbol system (i.e. symbol-manipulating computer) has necessary and sufficient means for general intelligent action. This became a foundational assumption of symbolic AI and cognitive architectures, asserting that symbol processing can model intelligence.
Neocognitron: A Self-Organizing Neural Network Model	Kunihiko Fukushima	1980	Proposed one of the first hierarchical, convolutional neural networks for pattern recognition. The Neocognitron had layers of edge detectors and pooling, making it robust to shifts. It directly inspired later CNNs (e.g. LeNet) and laid groundwork for deep visual models.
Learning Representations by Back-Propagating Errors	David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams	1986	Introduced the backpropagation algorithm for training multi-layer neural networks. This paper showed how hidden layers could learn internal representations via gradient descent, reviving interest in neural nets. Backprop made it practical to train deep networks and is the workhorse of deep learning.
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference	Judea Pearl	1988	Introduced Bayesian networks (directed graphical models) for reasoning under uncertainty. Pearl showed how probability theory could be used for inference in AI. His formalism allowed compact representation and efficient inference of complex probability distributions, revolutionizing AI’s approach to uncertain reasoning.
Gradient-Based Learning Applied to Document Recognition	Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner	1998	Demonstrated convolutional neural networks (LeNet-5) for handwritten digit recognition. This paper showed that CNNs trained by gradient descent outperform traditional methods on real vision tasks. It validated deep learning on large data and helped establish CNNs as the standard for image recognition.
Efficient Estimation of Word Representations in Vector Space	Tomas Mikolov et al.	2013	Introduced Word2Vec (skip-gram and CBOW models) for learning continuous word embeddings from large text corpora. Mikolov showed these vectors capture rich syntactic/semantic relationships (e.g. “king”–“man”+“woman” ≈ “queen”). The work greatly improved NLP performance and popularized vector-space language models.
Generative Adversarial Nets	Ian J. Goodfellow et al.	2014	Proposed GANs: a framework training two neural networks (generator and discriminator) in a minimax game. This novel approach allowed high-quality sample generation (e.g. images, audio) without explicit likelihood models. GANs have since become a central method for generative modeling and simulation in AI.
Human-level Control through Deep Reinforcement Learning	Volodymyr Mnih et al.	2015	Introduced the Deep Q-Network (DQN), combining Q-learning with deep convolutional networks. The DQN learned to play Atari games directly from raw pixels and reward feedback, achieving human-level performance across 49 games. This was the first demonstration of end-to-end deep RL on high-dimensional inputs, inspiring a renaissance in RL research.
Deep Residual Learning for Image Recognition	Kaiming He et al.	2016	Presented ResNets, enabling very deep neural networks (e.g. 152 layers) via identity “skip” connections. ResNets greatly eased training of deep models and achieved record accuracy on ImageNet (winning ILSVRC 2015 with 3.57% error). This architecture became a building block for modern deep networks.
Mastering the Game of Go with Deep Neural Networks and Tree Search	David Silver et al.	2016	Combined deep neural networks with Monte Carlo tree search to create AlphaGo. The system learned from human games and self-play, achieving a 99.8% win rate against other Go programs and famously defeating a world champion. This was the first AI to beat a human champion in Go, demonstrating deep RL’s power on a complex task.
Mastering the Game of Go without Human Knowledge (AlphaGo Zero)	David Silver et al.	2017	Showed that starting tabula rasa (no human data), a deep RL system (AlphaGo Zero) could learn Go solely by self-play. AlphaGo Zero surpassed the original AlphaGo in strength, proving that superhuman performance could be achieved without hand-crafted knowledge, a milestone for unsupervised learning and AI autonomy.
Attention Is All You Need	Ashish Vaswani et al.	2017	Introduced the Transformer architecture, which relies entirely on self-attention mechanisms instead of recurrence or convolution. Transformers enabled efficient modeling of long-range dependencies and led directly to modern language models. This paper is considered foundational in modern AI (it is among the most-cited AI papers), and gave rise to virtually all large-scale NLP models (BERT, GPT, etc.).
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (AlphaZero)	David Silver et al.	2018	Generalized AlphaGo Zero to other games. AlphaZero learned chess, shogi and Go from scratch by self-play, using no game-specific heuristics. Within hours, it achieved superhuman play in all three games, defeating world-champion programs. This demonstrated that a single deep RL algorithm can master diverse complex tasks.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	Jacob Devlin et al.	2018	Introduced BERT, a bi-directional transformer language model pre-trained on large text corpora. BERT achieved new state-of-the-art results on a wide range of NLP tasks (e.g. GLUE, SQuAD) with minimal fine-tuning. This popularized the “pretrain-then-finetune” paradigm and showed the power of large-scale unsupervised pretraining.
Language Models are Few-Shot Learners (GPT-3)	Tom B. Brown et al.	2020	Presented GPT-3, a 175-billion parameter transformer language model. GPT-3 showed that simply scaling model size greatly improves performance: it can perform many NLP tasks (translation, Q&A, arithmetic, etc.) in a few-shot manner, without task-specific fine-tuning. This result sparked enormous interest in large foundation models and AI capabilities.
Highly Accurate Protein Structure Prediction with AlphaFold	John Jumper et al.	2021	Describes AlphaFold 2, a deep learning system that predicts 3D protein structures from amino acid sequences with near-experimental accuracy. At CASP14, AlphaFold outperformed all previous methods, solving a 50-year-old grand challenge in biology. This breakthrough shows AI’s power in scientific discovery.

Share: Twitter, Facebook