The illusion of thinking: Apple research finds AI models collapse and give up with hard puzzles

0
4

'The illusion of thinking': Apple research finds AI models collapse and give up with hard puzzles

You and me both.

 By 

Cecily Mauran

 on 

Share on Facebook Share on Twitter Share on Flipboard

tower of hanoi game

The Tower of Hanoi puzzle is too much for reasoning models at a certain point. Credit: CorbalanStudio / iStock / Getty Images

New artificial intelligence research from Apple shows AI reasoning models may not be "thinking" so well after all.

According to a paper published just days before Apple's WWDC event, large reasoning models (LRMs) — like OpenAI o1 and o3, DeepSeek R1, Claude 3.7 Sonnet Thinking, and Google Gemini Flash Thinking — completely collapse when they're faced with increasingly complex problems. The paper comes from the same researchers who found other reasoning flaws in LLMs last year.

The news was a bucket of cold water for artificial general intelligence (AGI) optimists (and welcome news for AI and AGI skeptics), as Apple's research seemed to show damning evidence about the limitations of reasoning model intelligence. While the much-hyped LRM performed better than LLMs on medium-difficulty puzzles, they performed worse on simple puzzles. And according to Apple's research, when they faced hard puzzles, they collapsed completely, giving up on the problem prematurely.

This Tweet is currently unavailable. It might be loading or has been removed.

Or, as the Apple researchers put it, while AI models perform extremely well at math and coding, when it comes to more complex problems, they only provide "The Illusion of Thinking."

Apple was slow to develop large language models and implement AI in its devices, largely staying out of the conversation. The company has added Apple Intelligence AI features, though they have generally been considered underwhelming. With that in mind, this research might explain some of Apple's reticence to go all-in on AI, unlike Google and Samsung, which have frontloaded their devices with AI capabilities.

How Apple researchers tested reasoning skills

The problems researchers used to evaluate the reasoning models, which they call LRMs or Large Reasoning Models, are classic logic puzzles like the Tower of Hanoi. The puzzle consists of discs, stacked largest to smallest on one of three pegs, and the goal is to move the discs to the third peg without ever placing a larger disc on top of a smaller disc. Other puzzles included jumping checker pieces into empty spaces, the river-crossing problem (the one usually involving a fox, a chicken, and a bag of grain), and stacking blocks in a specific configuration.

Mashable Light Speed

You probably recognize these logic puzzles from math class or online games, since it's a simple way of testing humans' ability to reason and problem-solve. Once you figure it out, it's a simple matter of following the logic even as the complexity increases, which in this case means more discs, checkers, animals, or blocks. However, researchers found that LRMs start to fail after a certain point.

"Results show that all reasoning models exhibit a similar pattern with respect to complexity: accuracy progressively declines as problem complexity increases until reaching complete collapse (zero accuracy) beyond a model specific complexity threshold," researchers wrote. In the results shown, Claude 3.7 Sonnet + thinking and DeepSeek R1 start to fail when a fifth disc is added to the Tower of Hanoi problem. Even when more computing power is applied to the LRMs, they still fail at the more complex puzzles.

What's more, researchers found that reasoning models initially apply more thinking tokens as complexity increases, but they actually give up at a certain point. "Upon approaching a critical threshold — which closely corresponds to their accuracy collapse point — models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty," the paper read. So when the problems get harder, they spend less tokens, or "think" less.

But what about when the LRMs are given the answers? Nope, accuracy doesn't improve. Even when researchers included the algorithm in the prompt, so all the models need to do is follow the steps, they continued to fail.

But before you fire up the grill because LLM reasoning is so cooked, season these findings with a grain of salt. The research doesn't mean LRMs don't reason at all, it just means they may not currently be much smarter than humans. As AI expert Gary Marcus pointed out on his blog, "(ordinary) humans actually have a bunch of (well-known) limits that parallel what the Apple team discovered. Many (not all) humans screw up on versions of the Tower of Hanoi with 8 discs." As others have pointed out online, the research does not compare results from human attempts at these puzzles.

This Tweet is currently unavailable. It might be loading or has been removed.

Essentially, LLMs have their uses for tasks like coding and writing, but they also have weaknesses. "What the Apple paper shows, most fundamentally, regardless of how you define AGI, is that LLMs are no substitute for good well-specified conventional algorithms," wrote Marcus, who has been very vocal about the reasoning limitations of AI models.

That's to say, take the findings from Apple researchers for what they are: important data to be considered within the context of other LLM research. It's tempting to categorize AI's overall advancements as overhyped when new research like this comes out. Or, on the flip side, for AGI boosters to claim victory when research has discovered new advancements. But the reality is usually somewhere in the boring middle.

Mashable Image

Cecily is a tech reporter at Mashable who covers AI, Apple, and emerging tech trends. Before getting her master's degree at Columbia Journalism School, she spent several years working with startups and social impact businesses for Unreasonable Group and B Lab. Before that, she co-founded a startup consulting business for emerging entrepreneurial hubs in South America, Europe, and Asia. You can find her on X at @cecily_mauran.


These newsletters may contain advertising, deals, or affiliate links. By clicking Subscribe, you confirm you are 16+ and agree to our Terms of Use and Privacy Policy.

Search
Categories
Read More
Music
Pop Star Set to Play Paul Stanley in Upcoming KISS Movie
Pop Star Set to Play Paul Stanley in Upcoming KISS MovieMonica Schipper / Roy Rochlin, Getty...
By Test Blogger4 2025-06-05 18:00:13 0 167
History
How the Mountains of Afghanistan Defeated the World’s Mightiest Armies
How the Mountains of Afghanistan Defeated the World's Mightiest Armies - History Collection...
By Test Blogger2 2025-06-04 03:00:17 0 186
Games
Which League of Legends skins are currently on sale?
Which League of Legends skins are currently on sale? As an Amazon Associate, we earn from...
By Test Blogger6 2025-05-29 15:00:18 0 253
Home & Garden
Café Curtains Aren't Just for Kitchens—Here's Why a Designer Says to Hang Them Everywhere
Café Curtains Aren't Just for Kitchens—Here's Why a Designer Says They Belong Everywhere in Your...
By Test Blogger9 2025-05-29 04:00:34 0 259
History
The Romanovs: Russia’s Imperial Family and the Collapse of a Dynasty
The Romanovs: Russia’s Imperial Family and the Collapse of a Dynasty - History Collection...
By Test Blogger2 2025-05-30 07:00:11 0 256