The AI Could Read the Chessboard. It Still Couldn't Solve the Puzzle.

I recently tried a simple experiment: I took a photo of a daily chess puzzle on my phone and asked OpenAI’s GPT-5.5 Thinking and Google’s Gemini 3.1 Proto to solve it.

The models could describe the position, identify pieces and threats, and discuss candidate moves in the language of a chess coach. But they still failed to solve the puzzle, even after a long, token-expensive attempt at reasoning.

That failure was not just a one-off. It reflects a broader pattern seen in LLM chess benchmarks: models may produce fluent chess commentary while still struggling with the harder parts of the task, such as preserving board state, following legal-move constraints, and making correct decisions over multiple turns. (LLM Chess Leaderboard)

The result surprises many people because, in chess, we are used to computers being the ultimate authority.

Perception is not planning. Fluency is not verification. A model that can describe a problem has not necessarily solved it.

Seeing the board is not solving the position

Chess separates AI capabilities very cleanly.

First, there is perception: can the system identify the board, the pieces, the side to move, and the current state of the game?

Then there is calculation: can it generate legal moves, evaluate candidate lines, track consequences several moves ahead, and choose the best continuation?

Those are not the same problem.

A multimodal LLM can be good at the first part because it can map an image into concepts, concepts into language, and language into a fluent explanation. But chess calculation requires something stricter: a formal board state, legal move generation, search, evaluation, and a way to verify that every intermediate step still follows the rules.

If the model imagines a piece on the wrong square, forgets a blocker, proposes an illegal move, or misses a tactical reply, the answer is not slightly off. It is broken.

Natural language is forgiving. Chess is not.

That is why the example matters. In many business settings, a fluent answer can hide a broken process. Chess does not allow that. A move is legal or illegal. A tactic works or it does not. The board state is preserved or it drifts.

But AI is already superhuman at chess, right?

Yes. But that is exactly the point.

Stockfish, one of the strongest chess engines in the world, is not a chatbot that became good at chess by being generally intelligent. It is a chess engine. It represents the board exactly, generates legal moves, searches future positions, evaluates board states, and verifies the consequences of moves.

Modern Stockfish also uses neural-network evaluation, but that component is designed to evaluate positions quickly inside the engine’s search loop. The AI component is powerful, but it is not trying to have a conversation. It is part of a specialized chess system.

The lesson is this:

A chess engine spends its compute on the structure of chess. A general LLM spends its compute on generating the next token.

That distinction matters.

A language model may have seen millions of chess games. It may know openings, famous sacrifices, tactical motifs, and strategic principles. But unless it is connected to the right tools or trained specifically for the task, it is trying to compress perception, legality, search, evaluation, and explanation into one stream of text.

That is the wrong shape for the job.

The enterprise version of the same mistake

Most enterprise AI problems do not look like chess. But many of them have chess-like parts:

They involve state.
They have constrained actions.
Small errors can invalidate the result.
The system needs to reason across multiple steps.
There is often an external tool that can verify part of the answer.

Regulatory intelligence is a good example. An LLM can summarize a regulation beautifully. But if it applies the wrong jurisdiction, misses the effective date, or invents a clause, the answer becomes dangerous. The right system needs retrieval, source grounding, version tracking, citations, and validation.

Clinical trial planning is another example. A model may help draft a protocol or compare precedent studies. But inclusion criteria, dose constraints, endpoints, timelines, and site feasibility are not just writing tasks. They are structured planning problems.

Software engineering is an obvious case. LLMs are already useful for writing code, but reliable systems compile the code, run tests, inspect errors, check dependencies, and iterate. The compiler and test suite become the verifier.

The same pattern appears in finance, logistics, manufacturing, drug discovery, and real-world evidence. The LLM is valuable, but it should not always be the source of truth. Often, its best role is to translate intent, orchestrate tools, explain results, and make specialized systems easier to use.

The wrong lesson is “just wait for a bigger model”

Some of these problems will improve with larger and better models. That is real. Models are getting better at tool use, visual reasoning, coding, planning, and instruction following.

But “wait for a bigger model” is not an enterprise architecture.

Chess engines are strong because their architecture matches the problem. They preserve state. They know the legal action space. They search. They evaluate. They verify.

Enterprise AI needs the same discipline. Fine-tuning can help. Better prompts can help. Larger models can help. But none of those should replace a system of record, a rules engine, a calculator, a simulator, a compiler, a database, or a human review step when the workflow depends on one.

A useful rule is:

If the task has a reliable external verifier, use it.

Do not ask the LLM to pretend to be the verifier.

What the better architecture looks like

For the chess puzzle, the better system is obvious:

Convert the image into a candidate board state.
Validate the position.
Generate legal moves.
Call Stockfish or another chess engine.
Ask the LLM to explain the result clearly.
Check that the explanation matches the actual line.

In that setup, the LLM is still useful. Very useful.

It handles the user’s intent, clarifies ambiguity, chooses which tools to call, adapts the explanation to the user’s level, and turns raw engine output into something understandable. But it is no longer pretending to be Stockfish.

That same pattern generalizes:

For regulatory work, retrieve the authoritative source before drafting the answer.
For clinical operations, connect to structured trial data and feasibility constraints before proposing a plan.
For software, use the repository, compiler, tests, linters, and runtime traces.

The LLM becomes the communication and orchestration layer around specialized capabilities. That is a much stronger architecture than a chatbot answering from memory.

The practical test I now use

When looking at an enterprise AI use case, I like to ask four questions:

What state must the system preserve?
What actions are valid or invalid?
What tool can verify the result?
What should the LLM explain or orchestrate after verification?

Those questions quickly separate promising AI use cases from risky ones. They also prevent a common mistake: treating the LLM as a monolithic expert instead of one component in a designed system.

What chess teaches us

The chess puzzle looked simple: a board, a position, a best move.

But underneath that simple prompt was a full stack of capabilities: vision, state representation, legal move generation, search, evaluation, verification, and explanation.

A general LLM can touch many of those capabilities. That does not mean it can perform all of them reliably as one process.

That is the enterprise lesson.

The opportunity is not to force LLMs to do everything. It is to put them where they create the most value: translating messy human intent into structured workflows, coordinating specialized tools, explaining results, and making expert systems easier to use.

My bet is that the strongest enterprise AI systems will not be one giant model answering every question from inside its weights. They will be systems grounded in state, connected to tools, constrained by validators, and wrapped in interfaces that people can actually work with.

I would be interested to hear where you draw the line.

When should an LLM answer directly, and when should it be connected to a verifier, engine, database, compiler, simulator, or human review step?

If you are building AI systems in regulated, scientific, technical, or operational environments, I would love to compare notes. Comment with examples you have seen, or reach out if this is a design problem you are working through.