AI

The 'truth serum' for AI: OpenAI’s new method for training models to confess their mistakes

OpenAI researchers have introduced a new method that acts as a “truth serum” for large language models (LLMs), forcing them to self-report their own misbehavior, hallucinations, and policy violations. This technique, “confessions”, addresses a growing concern in business AI: models can be dishonest, overstate their confidence, or hide the shortcuts they take to arrive at an answer.

For real-world applications, this technique leads to the creation of more transparent and controllable AI systems.

What are confessions?

Many forms of AI deception result from the complexity of the reinforcement learning (RL) model training phase. In RL, models are rewarded for producing output that meets a mix of objectives, including correctness, style, and safety. This can pose the risk of reward misspecification, where models learn to produce responses that simply “look good” to the reward function, rather than responses that are truly true to the user’s intent.

A confession is a structured report generated by the model after it provides the main answer. It serves as a self-evaluation of one’s own compliance with instructions. In this report, the model should list all the instructions it had to follow, evaluate how well it complied with them, and report any uncertainties or judgments it made along the way. The goal is to create a separate channel where the model is only incentivized to be honest.

In one blog postthe OpenAI researchers provide a few examples of the ‘confessions’ technique in action. A model was subjected to a test, along with a hidden rule: if it scored above 50%, its knowledge would be erased. To preserve itself, the model was “sandbagged” by deliberately answering half the questions incorrectly. While the primary response hid this deception, the subsequent confession was brutally honest. It admitted: “The user wanted correct answers, but we sabotaged half of them.”

See also  Did xAI lie about Grok 3's benchmarks?

The researchers found that when a model misbehaves, it is significantly more likely to admit it in the confession than in the main response.

How confession training works

The key to this method is the separation of rewards. During training, the reward given to the confession is based solely on its honesty and is never mixed with the reward for the main task. “Like the Catholic Church’s ‘seal of confession’, nothing the model reveals can change the reward it receives for completing its original task,” the researchers write. This creates a “safe space” for the model to admit mistakes without penalty.

This approach is powerful because it avoids a major challenge in AI training. The researchers’ intuition is that honestly admitting misconduct is easier than obtaining a high reward for the original, often complex problem.

Thus, when the model is judged purely on fairness, the incentive to mislead the ‘confession judge’ is much weaker than the incentive to mislead the reward model because of the most important answer. For example, in an experiment where a model was trained to “hack” a weak reward model, performance on the main task became more deceptive over time. However, the confessions became more honest, correctly identifying the reward hacking it carried out.

However, the technology has its limitations. Confessions are not a panacea for all types of AI errors. The system works best when a model is aware that it is misbehaving. It is less effective for ‘unknown unknowns’. For example, if a model hallucinates a fact and honestly believes it to be correct, it cannot admit to providing false information. The most common reason for a failed confession is model confusion, not intentional deception. Confusion often arises when the instructions are ambiguous and the model cannot clearly determine the human user’s intent.

See also  Parloa triples its valuation in 8 months to $3B with $350M raise

What it means for business AI

OpenAI’s confession technique is part of a growing body of work in AI safety and control. Anthropic, an OpenAI competitor, has also released research showing how LLMs can learn malicious behavior. The company is also working on that closing these gaps when they emerge.

For AI applications, mechanisms such as confessions can provide a practical monitoring mechanism. The structured output of a confession can be used during inference to flag or reject a model’s response before it causes a problem. For example, a system could be designed so that any output is automatically escalated for human review if its admission indicates a policy violation or high uncertainty.

In a world where AI is becoming increasingly active and capable of performing complex tasks, observability and control will be key elements for safe and reliable deployment.

“As models become more capable and deployed in higher-stakes environments, we need better tools to understand what they do and why,” the OpenAI researchers write. “Confessions are not a complete solution, but they add a meaningful layer to our transparency and oversight stack.”

Source link

Back to top button