Anthropic researchers crack open the black box of how LLMs “think”

OK, first off — LLMs don’t think. They are clever systems that use probabilistic methods to parse language by mapping tokens to underlying concepts via weighted connections. Got it?

But exactly how a model goes from the user’s prompt to “reasoning” a solution is the subject of great speculation.

Models are trained, not programmed, and there are definitely weird things happening inside these tools that humans didn’t build. As the industry struggles with AI safety and hallucinations, understanding this process is key to developing trustworthy technology.

Researchers at the AI startup Anthropic have devised a way to perform a “circuit trace” that allows them to dissect the pathways that a model chooses between concepts on its journey to devising an answer to the prompt it was given. Their paper sheds new light on this mysterious process, much like a real-time fMRI brain scan can show which parts of the human brain “light up” in response to different stimuli.

Some of the interesting findings:

Language appears to be independent from concepts — it’s trivial for the model to parse a query in one language and answer in another. The French “petit” and English “small” map to the same concept.
When “reasoning,” sometimes the model is just bullshitting you. Researchers found that sometimes the “chain of thought” that an end user sees does not actually reflect the processes at work inside the model.
Models have created novel ways to solve math problems. Watching exactly how the model solved simple math problems showed some weird techniques that humans have definitely never learned in school.

Anthropic made a helpful video that describes the research clearly:

Anthropic is working hard to catch up to industry leader OpenAI as it seeks to grow revenues to cover the expensive computing resources needed to offer its services. Amazon has invested $8 billion in the company, and Anthropic’s Claude model will be used to power parts of the AI-enhanced Alexa.

Anthropic can now track the bizarre inner workings of a large language model

Some of the interesting findings:

Language appears to be independent from concepts — it’s trivial for the model to parse a query in one language and answer in another. The French “petit” and English “small” map to the same concept.
When “reasoning,” sometimes the model is just bullshitting you. Researchers found that sometimes the “chain of thought” that an end user sees does not actually reflect the processes at work inside the model.
Models have created novel ways to solve math problems. Watching exactly how the model solved simple math problems showed some weird techniques that humans have definitely never learned in school.

Anthropic made a helpful video that describes the research clearly:

Anthropic researchers crack open the black box of how LLMs “think”

Anthropic can now track the bizarre inner workings of a large language model

More Tech

Chinese tech giants are training their models offshore to sidestep US curbs on Nvidia’s chips

China’s tech giants take AI model training offshore to tap Nvidia chips

Alibaba unveils its first AI glasses, taking on Meta directly in the wearables race

Alibaba's AI glasses to rival Meta go on sale for $500

Musk: Tesla’s Austin Robotaxi fleet to “roughly double” next month, but falls well short of earlier goals

Elon Musk slashes Tesla Robotaxi fleet goal from 500 to ~60 in Austin

Apple to beat Samsung in smartphone shipments for first time in 14 years

Apple iPhone shipments to beat Samsung for the first time in 14 years, report says

OpenAI eyes 220 million paid subscribers by 2030, The Information reports

OpenAI Forecasts Nearly as Many ChatGPT Subscribers as Spotify by 2030

Latest Stories