OpenAI debuts GPT‑5.1-Codex-Max coding model and it already completed a 24-hour task internally

10 4 minutes read

OpenAI has introduced GPT-5.1-Codex-Maxa new frontier agentic coding model now available in the Codex developer environment. The release marks a significant step forward in AI-enabled software engineering, delivering improved long-horizon reasoning, efficiency, and real-time interactive capabilities. GPT‑5.1-Codex-Max will now replace GPT‑5.1-Codex as the standard model on Codex integrated surfaces.

The new model is designed to serve as a persistent, high-context software development agent, capable of managing complex refactorings, debugging workflows, and executing project-scale tasks in multiple context windows.

It follows Google releasing its powerful new Gemini 3 Pro model yesterday, but it still outperforms or matches key encryption benchmarks:

On SWE-Bench verified, GPT‑5.1-Codex-Max achieved an accuracy of 77.9% with extra high reasoning effort, beyond the 76.2% of Gemini 3 Pro.

It also led further Terminal-Bench 2.0, with an accuracy of 58.1% versus Gemini’s 54.2%, and matched Gemini’s score of 2,439 on LiveCodeBench Pro, a competitive Elo benchmark for coding.

When measured against Gemini 3 Pro’s most advanced configuration – the Deep Thinking model – Codex-Max also has a slight lead in agentic encryption benchmarks.

Performance benchmarks: incremental gains on key tasks

GPT-5.1-Codex-Max demonstrates measurable improvements over GPT-5.1-Codex in a range of standard software engineering benchmarks.

On SWE-Lancer IC SWE, it achieved an accuracy of 79.9%, a significant increase over GPT‑5.1-Codex’s 66.3%. In SWE-Bench Verified (n=500), it achieved 77.9% accuracy at extra-high reasoning efforts, outperforming GPT‑5.1-Codex’s 73.7%.

Performance on Terminal Bench 2.0 (n=89) showed more modest improvements, with GPT-5.1-Codex-Max achieving an accuracy of 58.1%, compared to 52.8% for GPT-5.1-Codex.

All evaluations were performed with compaction and extra high reasoning effort enabled.

These results indicate that the new model provides a higher ceiling for both benchmarked correctness and real-world usability under extended reasoning loads.

Technical architecture: reasoning about long horizons via densification

A major architectural improvement in GPT-5.1-Codex-Max is the ability to effectively reason over extended input-output sessions using a mechanism called compaction.

This allows the model to retain important contextual information and discard irrelevant details as it approaches the limit of the context window. This allows continuous work with millions of tokens without performance loss.

The model has been observed internally to complete tasks that take more than 24 hours, including multi-step refactoring, test-driven iteration, and autonomous debugging.

Densification also improves token efficiency. On average reasoning efforts, GPT-5.1-Codex-Max used approximately 30% fewer thinking tokens than GPT-5.1-Codex for comparable or better accuracy, impacting both cost and latency.

Platform integration and use cases

GPT-5.1-Codex-Max is currently available in multiple Codex-based environments, which refer to OpenAI’s own integrated tools and interfaces built specifically for code-centric AI agents. These include:

Codex CLIthe official command line tool of OpenAI (@openai/codex), where GPT‑5.1-Codex-Max is already live.
IDE extensionslikely developed or maintained by OpenAI, although no specific third-party IDE integrations were mentioned.
Interactive coding environmentssuch as those used to demonstrate frontend simulation apps like CartPole or Snell’s Law Explorer.
Internal code review toolsused by OpenAI engineering teams.

For now, GPT‑5.1-Codex-Max is not yet available via the public API, although OpenAI indicates it will be soon. Today, users who want to work with the model in terminal environments can do so by installing and using the Codex CLI.

It is currently unconfirmed if and how the model will be integrated into third-party IDEs unless they are built on top of the CLI or future API.

The model can communicate with live tools and simulations. Examples from the release include:

An interactive CartPole policy gradient simulator, visualizing reinforcement learning training and activations.
An optics researcher following Snell’s law, which supports dynamic ray tracing across refractive indices.

These interfaces illustrate the model’s ability to reason in real time while maintaining an interactive development session, effectively bridging computation, visualization, and implementation within a single loop.

Cybersecurity and security restrictions

While GPT-5.1-Codex-Max does not meet OpenAI’s “high” cybersecurity capability threshold under its Preparedness Framework, it is currently the most capable cybersecurity model OpenAI has deployed. It supports use cases such as automated vulnerability detection and remediation, but with strict sandboxing and disabled network access by default.

OpenAI reports no increase in large-scale malicious use, but has introduced improved monitoring systems, including activity routing and disruption mechanisms for suspicious behavior. Codex remains isolated from a local workspace unless developers opt for broader access, mitigating risks such as rapid injection of untrusted content.

Implementation context and developer usage

GPT‑5.1-Codex-Max is currently available to users of ChatGPT Plus, Pro, Business, Edu and Enterprise plans. It will also become the new standard in Codex-based environments, replacing GPT-5.1-Codex, which was a more general-purpose model.

OpenAI states that 95% of internal engineers use Codex on a weekly basis, and since launch, these engineers have sent an average of ~70% more pull requests, underscoring the tool’s impact on internal development speed.

Despite its autonomy and persistence, OpenAI emphasizes that Codex-Max should be treated as a coding assistant and not as a replacement for human review. The model produces terminal logs, test quotes, and toolcall output to support transparency in the generated code.

Outlook

GPT-5.1-Codex-Max represents a significant evolution in OpenAI’s strategy toward agentic development tools, providing greater reasoning depth, token efficiency, and interactive capabilities for all software engineering tasks. By extending the context management and compaction strategies, the model is positioned to perform tasks at the scale of entire repositories, rather than individual files or fragments.

With continued emphasis on agentic workflows, secure sandboxes, and real-world evaluation metrics, Codex-Max is paving the way for the next generation of AI-enabled programming environments – while underscoring the importance of supervision in increasingly autonomous systems.

Source link

OpenAI debuts GPT‑5.1-Codex-Max coding model and it already completed a 24-hour task internally

Performance benchmarks: incremental gains on key tasks