Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers

November 7, 2025

3 3 minutes read

The developers of Terminal-Bench, a benchmark suite for evaluating the performance of autonomous AI agents on real-world terminal-based tasks, have released version 2.0 alongside Porta new framework for testing, improving and optimizing AI agents in container environments.

The dual release aims to address long-standing pain points in testing and optimizing AI agents, especially those built to work autonomously in realistic developer environments.

With a more difficult and rigorously verified task set, Terminal-Bench 2.0 replaces version 1.0 as the standard for assessing the capabilities of boundary models.

Harbor, the companion runtime framework, enables developers and researchers to scale assessments across thousands of cloud containers and integrates with both open-source and proprietary agents and training pipelines.

“Harbor is the package we wish we had when creating Terminal-Bench,” the co-creator wrote Alex Shaw on X. “It is intended for agent, model, and benchmark developers and researchers who want to evaluate and improve agents and models.”

Higher bar, cleaner data

Terminal-Bench 1.0 was then quickly accepted release in May 2025and becomes a standard benchmark for evaluating agent performance in AI-powered agents operating in developer-style terminal environments. These agents communicate with systems via the command line and mimic how developers work behind the scenes of the graphical user interface.

However, its broad scope brought with it inconsistencies. Several tasks were identified by the community as poorly specified or unstable due to external service changes.

Version 2.0 addresses these issues directly. The updated suite contains 89 tasks, each subject to several hours of manual and LLM-assisted validation. The emphasis is on making tasks solvable, realistic and clearly specified, increasing the level of difficulty and improving reliability and reproducibility.

A notable example is the download-youtube task, which was removed or refactored in 2.0 due to its dependency on unstable third-party APIs.

“Astute Terminal-Bench fans may notice that SOTA performance is comparable to TB1.0, despite our claim that TB2.0 is more difficult,” said Shaw noted on X. “We believe this is because job quality is significantly higher in the new benchmark.”

Harbor: Unified Deployments at Scale

The team was launched alongside the benchmark update Porta novel framework for running and evaluating agents in cloud-deployed containers.

Harbor supports large-scale deployment infrastructure, with compatibility for major providers such as Daytona And Average.

Harbor is designed to generalize across agent architectures and supports:

Evaluation of each container-installable agent
Scalable pipelines for supervised fine-tuning (SFT) and reinforcement learning (RL).
Custom benchmark creation and implementation
Full integration with Terminal-Bench 2.

Harbor was used internally to perform tens of thousands of deployments during the creation of the new benchmark. It is now publicly available at havenframework.comwith documentation for testing and submitting agents to the public rankings.

First results: GPT-5 leads to task success

Initial results from the Terminal-Bench 2.0 leaderboard show OpenAI’s Codex CLI (command line interface), a GPT-5 powered variant, in the lead, with a 49.6% success rate – the highest of any agent tested to date.

Right behind them are other GPT-5 variants and Claude Sonnet 4.5-based agents.

Top 5 agent results (Terminal-Bench 2.0):

Codex CLI (GPT-5) — 49.6%
Codex CLI (GPT-5-Codex) — 44.3%
OpenHands (GPT-5) — 43.8%
Endpoint 2 (GPT-5-Codex) — 43.4%
Endpoint 2 (Claude Sonnet 4.5) – 42.8%

The close clustering between top models indicates active competition between platforms, with no single agent solving more than half of the tasks.

Submission and Use

To test or submit an agent, users install Harbor and run the benchmark using simple CLI commands. Entries for the leaderboard require five benchmark runs, and the results can be emailed to developers along with the job boards for validation.

harbor run -d terminal-bench@2.0 -m ““-a”” –n-attempts 5 –jobs-dir

Terminal-Bench 2.0 is already being integrated into research workflows focused on agentic reasoning, code generation and tool usage. According to co-creator Mike Merrill, a postdoctoral researcher at Stanford, a detailed preprint on the verification process and design methodology behind the benchmark is in the works.

Strive for standardization

The combined release of Terminal-Bench 2.0 and Harbor marks a step toward a more consistent and scalable agent evaluation infrastructure. As LLM agents proliferate in developer and operational environments, the need for controlled, reproducible testing has grown.

These tools provide a potential foundation for a unified evaluation stack, supporting model improvement, environmental simulation, and benchmark standardization across the AI ecosystem.

Source link