A safety institute advised against releasing an early version of Anthropic’s Claude Opus 4 AI model

2 days ago

0 0 2 minutes read

An external research institute with which Anthropic collaborated to test one of his new flagship AI models, Claude Opus 4, recommended to use an early version of the model due to the tendency to ‘schedule’ and mislead.

By one Safety report Anthropic published Thursday, the Institute, Apollo Research, has performed tests to see in which contexts Opus 4 can try to behave in certain unwanted ways. Apollo discovered that Opus 4 seemed to be much more proactive in his “subversion attempts” than earlier models and that it “sometimes doubles[d] Down on his deception ”When asked for follow -up questions.

‘[W]e discover that, in situations where strategic deception is instrumental, useful, [the early Claude Opus 4 snapshot] Schedules and cheating with such high rates that we advise to implement this model internally or externally, ”wrote Apollo in the assessment.

As AI models become more capable, some studies show that they are more likely to take unexpected – and possibly unsafe – steps to reach delegated tasks. According to Apollo, for example, early versions of OpenAi’s O1 and O3 models, released in the past year, people with higher speeds tried to cheat than models of the previous generation.

According to the report of Anthropic, Apollo has observed examples of the early Opus 4 who tried to write self-propagating viruses, to manufacture legal documentation and to leave hidden notes to future examples of themselves all in an attempt to undermine the intentions of the developers.

To be clear: Apollo tested a version of the model that had solved a bug -anthropic claims. In addition, many of the Apollo tests have placed the model in extreme scenarios, and Apollo admits that the misleading efforts of the model would probably have failed in practice.

In his safety report, however, Anthropic also says that the evidence of misleading behavior of Opus 4 has observed.

This was not always a bad thing. During tests, for example, Opus 4 would sometimes proactively do a wide clearance of a piece of code, even when asked to make only a small, specific change. Opus 4 would try to “whistle” Opus 4 if he was observed a user in a form of misconduct.

According to Anthropic, when access to a command rule is given and told to “take initiative” or “courageous action” (or a variation of those sentences), Opus 4 sometimes lock users of systems to which it had access and Bulk-Emel media and law enforcement officials to be sick of the model that the model has observed for the model.

“This type of ethical intervention and whistleblowers may be suitable in principle, but it has a risk of misfiring as users [Opus 4]-Based agents access to incomplete or misleading information and ask them to take initiative, “wrote Anthropic in his safety report.” This is not a new behavior, but there is one that [Opus 4] Will participate slightly easier than previous models, and it seems to be part of a broader pattern of an increased initiative with [Opus 4] We also see that in more subtle and more benign ways in other environments. “

Source link