OpenAI experiment finds that sparse models could give AI builders the tools to debug neural networks

7 3 minutes read

OpenAI are researchers experiment with a new approach to designing neural networks, with the aim of making AI models easier to understand, debug and control. Sparse models can give companies better insight into how these models make decisions.

Understanding how models choose to respond, a key selling point of enterprise reasoning models, can provide organizations with a degree of confidence when turning to AI models for insights.

The method called for OpenAI scientists and researchers to view and evaluate models not by analyzing post-training performance, but by adding interpretability or understanding through sparse circuits.

OpenAI notes that much of the opacity of AI models comes from the way most models are designed, so to better understand model behavior they need to come up with solutions.

“Neural networks power today’s most capable AI systems, but they remain difficult to understand,” OpenAI wrote in a blog post. “We don’t write these models with explicit step-by-step instructions. Instead, they learn by adjusting billions of internal connections or weights until they master a task. We design the training rules, but not the specific behaviors that emerge, and the result is a dense web of connections that no human can easily decipher.”

To increase the interpretability of the mix, OpenAI explored an architecture that trains disentangling neural networks, making them easier to understand. The team trained language models with an architecture similar to existing models, such as GPT-2, using the same training scheme.

The result: better interpretability.

The road to interpretability

Understanding how models work and giving us insight into how they make their decisions is important because they have an impact on the real world, says OpenAI.

The company defines interpretability as “methods that help us understand why a model produced a particular output.” There are different ways to achieve interpretability: chain of thought interpretability, which reasoning models often use, and mechanistic interpretability, which inverts the mathematical structure of a model.

OpenAI focused on improving mechanistic interpretability, which the company said “has been less directly useful so far, but could in principle provide a more complete explanation for the model’s behavior.”

“By trying to explain model behavior at the most detailed level, mechanistic interpretability can make fewer assumptions and give us more confidence. But the path from low-level details to explanations of complex behavior is much longer and more difficult,” OpenAI said.

Better interpretability ensures better monitoring and provides early warning signals if the model’s behavior no longer matches policy.

OpenAI noted that improving mechanistic interpretability “is a very ambitious gamble,” but research into sparse networks has improved this.

How to untangle a model

To untangle the mess of connections a model makes, OpenAI first cut most of these connections. Because transformer models like GPT-2 have thousands of terminals, the team had to zero out these circuits. They all only talk to a select number, making connections more manageable.

The team then performed circuit tracing on tasks to create groups of interpretable circuits. The final task involved pruning the model “to obtain the smallest circuit that achieves a target loss on the target distribution,” according to Open AI. It targeted a loss of 0.15 to isolate the exact nodes and weights responsible for behavior.

“We show that pruning our low-weight models yields roughly 16 times smaller circuits for our tasks than pruning dense models with comparable pretraining loss. We are also able to construct arbitrarily precise circuits at the expense of more edges. This shows that circuits for simple behavior are significantly more disentangled and localizable in low-weight models than dense models,” the report said.

Small models become easier to train

While OpenAI has managed to create sparse models that are easier to understand, they remain significantly smaller than most basic models used by enterprises. Enterprises are increasingly using small models, but frontier models, such as the flagship GPT-5.1, will still benefit from improved interpretability over time.

Other model developers also want to understand how their AI models think. Anthropicwhich has been researching interpretability for a while, recently revealed that it had ‘hacked’ Claude’s brain – and Claude noticed. Meta also examines how reasoning models make their decisions.

As more companies turn to AI models to make consistent decisions for their businesses, and ultimately for customers, research into understanding how models think could provide the clarity many organizations need to rely more on models.

Source link

OpenAI experiment finds that sparse models could give AI builders the tools to debug neural networks

The road to interpretability

How to untangle a model

Small models become easier to train

ServiceNow bets $40 million on Indian banking software specialist to expand its financial services push

After shocking quarter, IBM insists that AI isn’t killing the mainframe

Yope raises $12.3M to build a private social network without algorithms or ads

The road to interpretability

How to untangle a model

Small models become easier to train

Report claims that Dick Van Dyke suffered from health problems at the age of 100

Andrew Windsor plots a secret sale of family treasures for exile

Related Articles

OpenAI-backed startup Figure teases new humanoid robot

CodeRabbit raises $60M, valuing the 2-year-old AI code review startup at $550M

RIP, Microsoft Lens, a simple little app that’s getting replaced by AI

OpenAI’s Sora soars to No. 3 on the US App Store

ServiceNow bets $40 million on Indian banking software specialist to expand its financial services push

After shocking quarter, IBM insists that AI isn’t killing the mainframe

Yope raises $12.3M to build a private social network without algorithms or ads