AI

From static classifiers to reasoning engines: OpenAI’s new model rethinks content moderation

Companies that would like to guarantee the AI ​​models they use adhere to safety and safe use policy, fine-tune LLMs so they don’t respond to unwanted queries.

However, much of the security and red teaming takes place before deployment, with policies being “baked in” before users fully test the models’ capabilities in production. OpenAI believes this could provide companies with a more flexible option and encourage more companies to implement security policies.

The company has released two open-weight models under research outlook that the company says will make companies and models more flexible in terms of safeguards. gpt-oss-safeguard-120b and gpt-oss-safeguard-20b will be available on a permissive Apache 2.0 license. The models are refined versions of OpenAI’s open source gpt-oss, released in Augustthe first release in the oss family since the summer.

In one blog postOpenAI said oss-safeguard uses reasoning “to instantly interpret developer and vendor policies at the time of inference – classifying user messages, completions, and entire chats based on the developer’s needs.”

The company explained that because the model uses a chain of thought (CoT), developers can get explanations about the model’s decisions for review.

“Additionally, the policy is provided during inference, rather than being trained into the model, making it easy for developers to iteratively revise policies to improve performance,” OpenAI said in its post. “This approach, which we initially developed for internal use, is significantly more flexible than the traditional method of training a classifier to indirectly infer a decision boundary from a large number of labeled examples.”

Developers can download both models from Hugging face.

See also  Airbnb plans to bake in AI features for search, discovery and support

Flexibility versus ingrained

Initially, AI models won’t know what a company’s favorite security triggers are. While model providers do red-team models and platformsthese safeguards are intended for broader use. Companies like Microsoft And Amazon Web Services even platforms offer to take guardrails for AI applications and agents.

Companies use security classifiers to train a model to recognize patterns of good or bad input. This allows the models to learn which questions they are not allowed to answer. It also ensures that the models do not deviate and respond accurately.

“Traditional classifiers can deliver high performance, with low latency and low operating costs,” according to OpenAI. “But collecting enough training examples can be time-consuming and expensive, and updating or changing the policy requires retraining the classifier.”

The model takes in two inputs at a time before returning a conclusion about where the content fails. A policy and content are needed to classify it under its guidelines. OpenAI said the models work best in situations where:

  • The potential harm is emerging or evolving, and policies must adapt quickly.

  • The domain is very nuanced and difficult to handle for smaller classifiers.

  • Developers don’t have enough examples to train a high-quality classifier for every risk on their platform.

  • Latency is less important than producing high-quality, explainable labels.

The company said gpt-oss-safeguard “is different because its reasoning capabilities allow developers to apply any policy,” even the one they wrote during inference.

The models are based on OpenAI’s internal tool, the Safety Reasoner, which allows teams to be more iterative when setting guardrails. They often start with very strict security policies, “using relatively large amounts of computing power where necessary,” and then adjust the policies as they move the model through production and change risk assessments.

See also  FTC removes Lina Khan-era posts about AI risks and open source

Perform safety

OpenAI said the gpt-oss-safeguard models outperformed the GPT-5 thinking and the original gpt-oss models in multipolicy accuracy based on benchmark testing. It also ran the models on the public benchmark ToxicChat, where they performed well, although GPT-5 thinking and the Safety Reasoner sidelined them somewhat.

But there are concerns that this approach could lead to centralization of safety standards.

“Security is not a well-defined concept. Any implementation of security standards will reflect the values ​​and priorities of the organization that creates it, as well as the limitations and shortcomings of its models,” said John Thickstun, assistant professor of computer science at Cornell University. “If the industry as a whole adopts the standards developed by OpenAI, we risk internalizing one particular perspective on security and short-circuiting broader investigations into the security needs for AI implementations in many sectors of society.”

It should also be noted that OpenAI has not released the base model for the oss model family, so developers cannot fully build on it.

However, OpenAI is confident that the developer community can help refine gpt-oss-safeguard. A Hackathon will take place in San Francisco on December 8.

Source link

Back to top button