Thinking Was Real. The Illusion Was Yours.

AI/ML•June 17, 2025•8 min read

TL;DR: Apple's paper claims LLMs lose reasoning ability as tasks get complex. But their methodology has fundamental flaws - they're measuring token limits, not thinking limits. A rebuttal paper shows the same models solve these problems easily when prompted differently.

So Apple just dropped this paper called The Illusion of Thinking. The premise? That LLMs lose their ability to reason as task complexity grows. Sounds scary, right? But here's the thing - their methodology and scoring choices introduce artifacts that arguably undermine the results. The paper tries to draw general conclusions, but its experiments might be measuring something quite different than reasoning capacity.

The Core Issues

1. Token Limits vs. Reasoning

The paper conflates output length with cognitive complexity. A task requiring 30,000 tokens isn't necessarily harder to reason about - it's just more verbose to express.

2. Impossible Tasks

Some benchmark configurations are mathematically unsolvable, yet models are penalized for not solving them. This is like failing a student for not proving 2+2=5.

Tower of Hanoi: Benchmarking or Bottleneck?

Let's talk about the Tower of Hanoi benchmark. It's used to support the paper's central claim, but there's a catch. At 8 disks, the minimum valid trace requires ~2,500 tokens. At 15 disks? Over 30,000 steps - well beyond the output capacity of any deployed LLM.

Key Numbers:
• 8 disks = 255 steps = ~2,500 tokens
• 15 disks = 32,767 steps = ~30,000 tokens

What the paper frames as "failure" might actually be a model recognizing its own context limits. Several model outputs even state this directly: "Stopping early to avoid verbosity." Scoring this as incorrect seems less like testing reasoning and more like testing token persistence.

The model's choice to truncate isn't necessarily evidence of collapse. It could just be adhering to training norms around concise output and resource optimization.

River Crossing: Unworkable Inputs

Now let's look at the river crossing benchmark. It introduces configurations like 6 entities and a 3-capacity boat - scenarios known to be unsolvable. Yet, models are penalized for not producing a solution.

Irony Alert: When a model correctly identifies that a problem is impossible to solve, it's marked as wrong. This is like failing a student for correctly stating that you can't divide by zero.

If a model correctly identifies that no valid solution exists, and still receives a failing score, it's unclear what exactly is being measured. Recognizing the impossibility of a task is arguably a sign of reasoning, not a failure of it.

Scoring Flaws: Binary Thinking in a Continuous Task

The evaluation pipeline does not differentiate between:

A model that fully understands the logic but avoids printing 1,000 redundant steps
A model that produces plausible but incorrect move sequences

Both can receive zero. Ironically, verbose errors may score higher than concise solutions. This suggests the metric isn't tuned for general reasoning, but for rote output generation. That's a subtle but crucial distinction.

The Rebuttal: Alternate Prompt, Alternate Result

In response, a follow-up paper - The Illusion of the Illusion of Thinking - reframes the benchmark. Instead of asking models to print every move, it prompts them to define the recursive logic directly.

Key Finding: When asked to write a recursive function instead of listing moves, the same model solved 15-disk Hanoi in under 5k tokens. The "collapse" wasn't in the model - it was in the prompt design.

The result? A valid 15-disk Hanoi solver using fewer than 5k tokens. No collapse, no failure - just a different mode of expression. The model wasn't incapable. It was bottlenecked by the prompt design and scoring rules.

Output Length ≠ Reasoning Difficulty

The paper conflates output volume with reasoning complexity. But these are orthogonal.

Tower of Hanoi

• Trivial reasoning
• Exponential output
• Formulaic moves

River Crossing

• Complex reasoning
• Short output
• High constraint juggling

Lumping both into the same evaluation axis risks invalidating the conclusions. If output length is treated as a proxy for difficulty, the benchmark ceases to reflect what it claims to measure.

Broader Concerns: Conclusions Beyond the Data

The most worrying aspect isn't the experimental noise - it's the scale of the conclusions drawn from it. The paper positions its results as foundational limits in LLM reasoning. But if the benchmarks inadvertently penalize compression, abstraction, or brevity, then these aren't limits of thinking. They're limits of evaluation design.

Warning: These misalignments could unintentionally steer future benchmarks and model training toward verbosity rather than clarity. The downstream effects may include overfitting to broken metrics instead of genuine reasoning improvements.

Where Actual Limits Might Be

Large language models do have reasoning limits - some quite serious. But identifying them requires benchmarks that isolate logic from verbosity, and precision from token sprawl.

If our experiments can't distinguish between thoughtful omission and thoughtless confusion, we risk interpreting context restraint as cognitive deficiency.

Final Thought:

What was framed as an illusion of thinking may be better understood as a limitation of measurement. The collapse, if it happened, didn't come from the model - it came from the structure of the task.

References

Apple's original paper: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity by Shojaee, Mirzadeh, Alizadeh, Horton, Bengio & Farajtabar (June 2025)
Response paper: Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity by C. Opus and Alex Lawsen (June 10, 2025)
Discussion on Hacker News: HN Thread on Apple's LLM Paper - "Pretty serious flaws in the original paper. 1. Scoring unsolvable challenges as incorrect. 2. Not accounting for token span."
Reddit Discussion: The 'Reasoning Collapse' is Just a Token Limit - "The collapse point happens exactly when the text for the full solution exceeds the model's maximum output token limit."
Media Coverage:
- The Guardian - Highlights the "complete accuracy collapse" and reduced reasoning effort at high complexity
- 9to5Mac - Reports on Lawsen's rebuttal and the Lua-function demonstration
- Futurism - Emphasizes Apple's framing of "reasoning collapse" and token-budget paradoxes