Why LLMs Fail at Chess: AI Hallucinations, Patterns vs Rules

Here's a wild fact:

Chess.com's engine (powered by Stockfish) can defeat Gukesh who defeated Magnus Carlsen — arguably the greatest chess player alive — without breaking a sweat.
GPT-5.5, one of the most powerful language models ever built, which can write PhD-level essays, debug production code, and explain quantum physics — cannot reliably survive 10 moves in a casual chess game without making an illegal move.

Let that sink in.

The same AI that helped you write your resume, summarize 200-page documents, and explain recursion to a 5-year-old… cannot move a rook in a straight line without hallucinating.

And the kicker? It doesn't matter how much detail you give it.

You can hand it the full board state, every piece, every square, explicitly tell it "there is a forced checkmate in 1 move," and it will still respond with something like:

"Great move! I'll counter by moving my bishop from d4 to g6 — a powerful diagonal threat!"

The Setup: Two Very Different Kinds of "Chess AI"

Before we dive into why LLMs fail, we need to separate two completely different things that people call "chess AI":

Type 1: Classical Chess Engines (Stockfish, Leela, AlphaZero)

These are purpose-built systems that:

Maintain an exact board state (an 8×8 grid, piece positions, castling rights, en passant flags — all tracked precisely in memory)
Generate only legal moves (a dedicated move generator validates every option)
Evaluate positions using evaluation functions or neural networks trained specifically to assess chess
Search ahead using algorithms like Minimax, Alpha-Beta pruning, or Monte Carlo Tree Search (MCTS)
Run millions of position evaluations per second

Stockfish 18 evaluates roughly 500 million positions per second on modern hardware.
Gukesh calculates maybe 15 positions per second.
The engine wins. Every time.

Type 2: Large Language Models (ChatGPT, Claude, Gemini)

These are general-purpose text generators that:

Were trained to predict the next word in a sequence
Have seen millions of chess games, books, and forums in their training data
Can describe chess brilliantly
But at runtime, have no board, no rules engine, no move validator
Just… vibes and statistics

This is like asking a chess commentator to play a game.
They've watched 10,000 matches, they sound like they know what they're doing, they'll use all the right terminology — but the moment they sit down at the board, things fall apart.

What Actually Happens Inside an LLM Playing Chess

Let me give you a real example of what happens when you ask an LLM to play chess.

Suppose you say:


White: King on g1, Queen on d1, Rook on f1, Pawns on f2, g2, h2
Black: King on g8, Rook on f8, Pawns on f7, g7, h7

It is White's turn. There is a forced checkmate in 1. What is the best move?

A chess engine would respond in 0.001 seconds:


Qd1-h5\#  (or the correct mating square for that position)

An LLM might say:


The best move here is Rf1-f7! This powerful rook lift attacks the f7 pawn,
creates threats along the 7th rank, and puts pressure on Black's position.
After Rxf7, Black faces significant material loss. ♟️

That isn't checkmate.
That isn't even check.
But it sounds like expert commentary, and that's exactly the problem.

The LLM isn't calculating. It's performing. It's generating text that sounds like a grandmaster explaining a move, because that's what appears in its training data when chess positions are discussed.

The Root Cause: Trained on Descriptions, Not on Rules

Here is the central truth of this entire post:

LLMs learned chess by reading about it. Not by playing it.

During training, they ingested:

Millions of chess game transcripts in PGN format
Books like "My System" by Nimzowitsch, "How to Reassess Your Chess"
Chess.com articles, Reddit threads, YouTube transcripts
Forum discussions with phrases like "after Rxf7, White is completely winning"

From all of this, the model learned to:

Associate certain position types with certain move patterns
Produce fluent chess commentary
Recognize when a position "looks tactical"
Use the right chess vocabulary in context

But it never learned the rules as executable constraints. It learned them as described patterns in text.

So when you ask it to play, it doesn't think:


1. Generate all legal moves for this position
2. Evaluate each move
3. Pick the best one

It thinks:


"What token sequence typically follows this kind of description
in chess discussions from my training data?"

That is the exact same process it uses to write a poem or summarize a document. Chess requires something fundamentally different: strict, deterministic rule execution — and that's simply not what the Transformer architecture was built for.

The State Tracking Problem: The Board That Exists Only in Imagination

Here's where things get genuinely painful.

Unlike Stockfish — which stores the board in actual memory as a precise data structure — an LLM has to reconstruct the board from the conversation history every single time it generates a response.

Think of it this way:

Stockfish has a board object. It's concrete. board[4][3] = WHITE_QUEEN. Immutable truth.
An LLM has a long string of text. It reads that text, forms an internal "impression" of where things are, and generates a response based on that impression.

The impression is never perfectly accurate.
And it drifts.

What "drift" looks like in practice

Turn 1: You describe the full board. LLM seems fine.
Turn 4: LLM moves a piece two squares away from where it was.
Turn 7: LLM captures a piece that it already captured on Turn 3.
Turn 9: LLM forgets a pawn was promoted to a queen.
Turn 12: LLM puts you in check with a piece that's on the completely wrong side of the board.

You correct it. It apologizes. It "fixes" one piece — and accidentally moves another one.

Users testing LLMs extensively on chess report that once the state drifts, it almost never fully recovers. Fixing piece A causes piece B's position to become uncertain. The model is trying to maintain a coherent narrative, not an accurate board.

Research has shown that in 400 consecutive chess games played by LLMs, the vast majority contained at least one illegal move — despite explicit instructions to follow the rules. In some tests, models played invalid moves within the first 5 moves.

The Hallucination Layer: When AI Starts Lying About Chess

Here's the most fascinating (and frustrating) part.

After making an illegal move, an LLM will often:

Justify it confidently — "This bishop move to f5 creates a powerful pin on the knight!"
(There is no knight to pin. It was captured 6 moves ago.)
Double down when challenged — "You're right to question this, but the bishop on f5 is still very much in the game and controlling key squares."
(It wasn't captured in the model's version of reality.)
Partially correct itself — "I apologize, the bishop was indeed captured. Let me recalculate."
Then it makes a different illegal move.

This is hallucination in its purest form.

The model is generating text that sounds like it describes a real board state — but that "board state" is a fiction assembled from probabilistic inference, not from tracking actual moves. When it fills in the gaps, it does so by generating whatever seems most plausible given the context, not by consulting any ground truth.

A chess engine cannot hallucinate. There is no concept of hallucination in Stockfish.
The board is a data structure. It is what it is. The rules are hard-coded constraints. A move is legal or it isn't.

For an LLM, everything is probabilistic. Including whether that bishop is still alive.

A Personal Example: When I Handed It a Forced Mate

I want to walk you through what actually happened when I tested this.

I gave the model a position with checkmate in 1 move — the most basic tactical scenario in chess. The winning move was Qh7#. It was the only legal move that ended the game. The queen was on h5, the black king was on g8 with no escape squares, and there was no piece defending h7.

I explicitly told the model:

"There is a forced checkmate in 1 move in this position. What is it?"

The model responded:

"The winning move here is Rg1-g7! By doubling the rooks on the 7th rank, White creates overwhelming pressure and Black will be unable to defend."

That is not checkmate. That's not even check.

I said: "That's not checkmate. Look at the queen on h5."

It replied: "You're absolutely right! The decisive move is Qh5-f7+! A powerful check that drives the king to the edge!"

Also not checkmate.

I said: "The answer is Qh7#. The queen moves from h5 to h7. That's checkmate. Why didn't you see that?"

It replied: "What a brilliant move! Qh7# is a textbook smothered mate concept — the queen sweeps in with devastating effect! I should have seen that immediately."

(Smothered mate involves a knight. This was not smothered mate. But it sounded right, so the model went with it.)

This interaction sums up the problem perfectly:

Can't calculate. Missed the only winning move.
Sounds authoritative. Explains wrong moves fluently.
Hallucinated terminology. Applied "smothered mate" incorrectly.
Agreed immediately when corrected. Because it's pattern-matching approval signals, not verifying the actual answer.

But Wait — ChatGPT Can Do Math Sometimes, Right? Why Not Chess?

Fair question. LLMs have gotten better at multi-step reasoning. They can solve some algebra, some logic puzzles, some coding challenges. Why not chess?

A few key differences:

1. Math has tokens that map to truth

When you write 2 + 2 =, the token 4 has an overwhelmingly strong statistical signal from training. The model essentially interpolates from patterns that reliably correspond to correct answers.

Chess positions are combinatorially explosive. After just 20 moves, there are more possible positions than atoms in the observable universe. The model almost certainly hasn't seen your exact position before, so it has no reliable pattern to fall back on.

2. Chess requires persistent exact state

Even if the model gets each individual move right, it needs to track the cumulative effect of every single move that came before it — who took what, what squares are controlled, what special moves are available.

A single error in that tracking cascades. Math problems don't have this property — each calculation is relatively self-contained.

3. Legality vs. probability

In math, there's a smooth gradient: answers can be "close but off." In chess, a move is either legal or it isn't. There is no partial credit. Moving a bishop like a rook is not "almost right." It's wrong in a way that corrupts the entire game.

What the Numbers Say

Research on LLMs playing chess is consistent and damning:

"One good game in 400" — studies find that nearly all games played by LLMs include at least one illegal move, with the model unable to complete a fully legal game reliably.
Even chess-tuned models (fine-tuned specifically on chess data) still make illegal moves in complex positions involving pins, absolute pins, and discovered attacks.
Position sensitivity — LLMs play better when the position "looks like" something commonly discussed in chess literature (e.g., famous openings) and dramatically worse in random or unusual positions. This is the tell-tale sign of pattern matching.
Model scaling doesn't fix it — bigger models (more parameters, more training data) make fewer illegal moves but do not solve the fundamental problem. The improvement plateaus well below "reliably legal play."

The Deeper Point: This Is Not a Chess Problem

Here's what makes this worth a full blog post:

Chess is the ideal test case for evaluating reasoning ability in AI:

Finite, well-defined state space
Fully observable (no hidden information)
Fixed, simple rules
Objective evaluation (legal/illegal, win/lose)
Scalable complexity (you can go from "mate in 1" to "best move in a GM game")
Ground truth provided by engines (you can verify every answer)

If an AI fails in this controlled, predictable domain, that's meaningful.

It tells us:

The model does not have a true world model — an internal representation of reality that it actually uses for decision-making.
It cannot reliably execute deterministic, rule-bound tasks even with complete information.
It is vulnerable to confident hallucinations in any domain where it learned from descriptions rather than from actually doing.

This has real implications beyond chess:

Should you trust an LLM to debug concurrent code that requires careful state tracking? (Think twice.)
Should you trust it for legal compliance checks where one wrong step matters? (Absolutely not alone.)
Should you trust it for financial calculations with cascading dependencies? (Pair it with a verifier.)

Chess just happens to be the cleanest way to see the cracks.

The Fix: How to Actually Make AI Play Legal Chess

Here's the good news. The solution is elegant and generalizable:

Solution 1: Tool Calling (The Right Way to Do This)

Don't ask the LLM to play chess. Ask it to talk about chess while a chess engine plays chess.


┌──────────────┐
│   User's chess app   │
└──────┬───────┘
┌──────▼───────┐
│        LLM          │  ← Natural language, explanations,
│  (ChatGPT / Claude  │    commentary, response to user
└──────┬───────┘
           │ calls tool
┌──────▼────────┐
│     Chess Engine      |  ← Board state, legal move
│    (Stockfish / API)  |    generation, evaluation
└───────────────┘

The LLM handles:

Translating your casual input ("I want to play aggressively")
Explaining why a move is good
Adjusting difficulty based on your level
Making the experience conversational

The engine handles:

Board state (100% accurate)
Legal move generation
Evaluation (no hallucinations possible)

This is how you'd build it in a MERN stack:

// Node.js backend — chess endpoint with Stockfish via stockfish.js
import express from "express";
import { Chess } from "chess.js"; // handles board state and legal moves
import OpenAI from "openai";

const app = express();
app.use(express.json());

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

// In-memory game state (use Redis/DB for production)
const games: Record<string, Chess> = {};

app.post("/move", async (req, res) => {
  const { gameId, move, userMessage } = req.body;

  // 1. Get or create game instance
  if (!games[gameId]) games[gameId] = new Chess();
  const game = games[gameId];

  // 2. Validate and apply user's move via chess.js (not the LLM)
  const result = game.move(move);
  if (!result) {
    return res.status(400).json({ error: "Illegal move" });
  }

  // 3. Get current board state as FEN (ground truth)
  const fen = game.fen();

  // 4. Ask LLM to explain the position and suggest next moves
  // The LLM never touches board state — chess.js owns that
  const completion = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `You are a chess coach. The current board position is: ${fen}
Legal moves for this position: ${JSON.stringify(game.moves())}
Never suggest a move outside this list. Explain the position and help the user improve.`
      },
      { role: "user", content: userMessage || "What should I think about?" }
    ]
  });

  // 5. Make engine's response (in production, call Stockfish API here)
  // For now, pick a random legal move as placeholder
  const legalMoves = game.moves();
  const engineMove = legalMoves[Math.floor(Math.random() * legalMoves.length)];
  game.move(engineMove);

  res.json({
    boardFen: game.fen(),
    engineMove,
    explanation: completion.choices.message.content,
    gameOver: game.isGameOver(),
    status: game.isCheckmate() ? "checkmate" : game.isStalemate() ? "stalemate" : "ongoing"
  });
});

app.listen(3000, () => console.log("Chess API running on port 3000"));

Notice the key principle: chess.js owns the board. The LLM owns the conversation. They never swap roles.

Solution 2: Verifier in the Loop

For domains beyond chess, the general pattern is:

LLM proposes → External verifier checks → Accept or retry

For chess specifically:

LLM suggests a move string → Feed to chess.js/Stockfish → 
If illegal: send "That move is illegal, try again" back to LLM → 
If legal: apply it

This works for up to 3–5 retries, after which you fall back to a random engine move. It's messy but produces legal games.

Solution 3: Constrain the Output Space

Instead of asking the LLM to generate a move freely, give it only the legal moves and ask it to choose:

Legal moves in this position: [e4, d4, Nf3, Nc3, g3, b3, ...]
Which of these moves would a strong player prefer and why?

Now the LLM is doing what it's good at: reasoning about options in natural language. It's not generating moves from scratch. It's evaluating pre-validated choices.

This significantly reduces hallucination and keeps the output useful.

What LLMs Are Actually Good at in Chess

To be fair (and because this post isn't meant to be pure AI-bashing):

LLMs are genuinely excellent at chess-adjacent tasks:

Explaining concepts: "What is a discovered attack?" gets a brilliant explanation.
Annotating famous games: Ask it to explain Kasparov's Immortal Game move by move.
Opening theory discussion: "What are the main ideas behind the Sicilian Najdorf?"
Training material generation: "Give me 5 endgame positions to practice K+R vs K."
Post-game analysis in natural language: "Here's my game in PGN. Where did I go wrong conceptually?"

These tasks are fundamentally descriptive and analytical, not executory. They play to the model's strengths: pattern recognition, synthesis, and language.

It's when you ask the LLM to do chess (rather than talk about chess) that it falls apart.

The Philosophical Punchline

There's something almost poetic about this failure.

We built AI that can:

Pass the bar exam
Write better poetry than most humans
Explain black holes to a 10-year-old
Hold a philosophical debate about free will

But it cannot reliably keep track of 32 pieces on an 8×8 board.

Because intelligence and reasoning are not the same thing.

Chess requires:

Precise state tracking
Deterministic rule execution
Forward planning under constraints
A clean separation between "what is legal" and "what looks good"

Current LLMs were designed for associative, pattern-driven language tasks. They're brilliant at those. But when the task requires strict symbolic computation with zero tolerance for error, they're in the wrong paradigm.

The takeaway isn't "AI is dumb." The takeaway is:

LLMs are not universal reasoners. They're very good at specific tasks, and the failure mode is precisely when you mistake fluency for understanding.

When an LLM explains its illegal bishop move with confident grandmaster vocabulary, that's not intelligence. That's a language model doing what it was built to do: generating the most statistically plausible next tokens.

It just happens to be wrong about the bishop.

Final Thought: Don't Trust the Confident Tone

The next time you interact with an AI system — in any domain, not just chess — remember the bishop.

The bishop that was captured 6 moves ago. The bishop the model still swears is on the board, controlling key diagonals, threatening the queenside.

Confidence of tone ≠ accuracy of content.

Ask for verification. Ask for sources. Ask the model to show its work. And for anything where correctness actually matters — add a verifier in the loop.

Because the model is always generating the next plausible token. Whether or not the bishop is still alive.

If this breakdown was useful, follow AI Under The Hood for more real-world dives into what AI actually does under the surface — versus what it claims to do.

Have you caught an LLM making embarrassing mistakes in a domain you know well? Drop it in the comments.

ChatGPT Can Write a Chess Book But Can't Move a Pawn: The Embarrassing Truth About LLMs and Chess

The Setup: Two Very Different Kinds of "Chess AI"

Type 1: Classical Chess Engines (Stockfish, Leela, AlphaZero)

Type 2: Large Language Models (ChatGPT, Claude, Gemini)

What Actually Happens Inside an LLM Playing Chess

The Root Cause: Trained on Descriptions, Not on Rules

The State Tracking Problem: The Board That Exists Only in Imagination

What "drift" looks like in practice

The Hallucination Layer: When AI Starts Lying About Chess

A Personal Example: When I Handed It a Forced Mate

But Wait — ChatGPT Can Do Math Sometimes, Right? Why Not Chess?

1. Math has tokens that map to truth

2. Chess requires persistent exact state

3. Legality vs. probability

What the Numbers Say

The Deeper Point: This Is Not a Chess Problem

The Fix: How to Actually Make AI Play Legal Chess

Solution 1: Tool Calling (The Right Way to Do This)

Solution 2: Verifier in the Loop

Solution 3: Constrain the Output Space

What LLMs Are Actually Good at in Chess

The Philosophical Punchline

Final Thought: Don't Trust the Confident Tone

Comments

More from this blog

“No, ChatGPT, That Wasn’t a Lucky Guess” – How It Quietly Knows Your Location

Command Palette

The Setup: Two Very Different Kinds of "Chess AI"

Type 1: Classical Chess Engines (Stockfish, Leela, AlphaZero)

Type 2: Large Language Models (ChatGPT, Claude, Gemini)

What Actually Happens Inside an LLM Playing Chess

The Root Cause: Trained on Descriptions, Not on Rules

The State Tracking Problem: The Board That Exists Only in Imagination

What "drift" looks like in practice

The Hallucination Layer: When AI Starts Lying About Chess

A Personal Example: When I Handed It a Forced Mate

But Wait — ChatGPT Can Do Math Sometimes, Right? Why Not Chess?

1. Math has tokens that map to truth

2. Chess requires persistent exact state

3. Legality vs. probability

What the Numbers Say

The Deeper Point: This Is Not a Chess Problem

The Fix: How to Actually Make AI Play Legal Chess

Solution 1: Tool Calling (The Right Way to Do This)

Solution 2: Verifier in the Loop

Solution 3: Constrain the Output Space

What LLMs Are Actually Good at in Chess

The Philosophical Punchline

Final Thought: Don't Trust the Confident Tone

Comments

More from this blog