Terminal-Bench 2.0 & Harbor: A New Standard for Evaluating AI Agents
The landscape of Large Language Model (LLM) agents is rapidly evolving. To keep pace, rigorous and reliable evaluation is crucial. That’s why the release of Terminal-Bench 2.0 and the accompanying Harbor framework represent a significant leap forward for the AI community. These tools aren’t just incremental updates; they’re building blocks for a more standardized and scalable future of agent assessment.
As someone deeply involved in the growth and evaluation of AI systems, I’ve seen firsthand the challenges of comparing agent performance. Existing benchmarks often lack robustness or are susceptible to gaming. Terminal-Bench 2.0 directly addresses these issues, offering a more challenging and representative test of real-world capabilities.
What’s New in Terminal-Bench 2.0?
Terminal-Bench 2.0 is a benchmark designed to assess the ability of AI agents to interact with a Linux terminal surroundings. It focuses on practical tasks - the kind developers and system administrators face daily. Here’s what’s changed:
* Increased Difficulty: The new benchmark is demonstrably harder than its predecessor, Terminal-Bench 1.0. This ensures a more discerning evaluation of agent capabilities.
* Higher Quality Tasks: Despite the increased difficulty, performance is comparable to TB1.0. This is due to a substantial improvement in task quality and relevance.
* Removed Unstable Dependencies: A previous component relying on unstable third-party apis has been removed and refactored, ensuring long-term stability and reliability.
* Focus on Real-World Skills: The benchmark emphasizes tasks requiring reasoning, code generation, and effective tool use – skills vital for practical applications.
Introducing Harbor: Scalable Agent Evaluation
Alongside the benchmark, the team launched harbor, a framework designed to streamline the process of running and evaluating agents at scale. Think of it as a robust infrastructure for testing your AI creations in a controlled, cloud-based environment.
Harbor offers key benefits:
* Global Agent Support: Evaluate any agent that can be run within a container.
* Scalable Pipelines: supports both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) workflows.
* Customization: Create and deploy your own benchmarks tailored to specific needs.
* Seamless Integration: Fully integrates with Terminal-Bench 2.0 for immediate evaluation.
Harbor is already proven, having been used internally to run tens of thousands of rollouts during the development of Terminal-Bench 2.0. You can explore it further at harborframework.com.
Early Leaderboard results: GPT-5 Takes the Lead
Initial results from the Terminal-Bench 2.0 leaderboard are generating excitement.OpenAI’s Codex CLI, powered by GPT-5, currently leads the pack with a 49.6% success rate. This is the highest score achieved by any agent tested to date.
Here’s a snapshot of the top five performers:
- Codex CLI (GPT-5) – 49.6%
- Codex CLI (GPT-5-Codex) - 44.3%
- OpenHands (GPT-5) – 43.8%
- Terminus 2 (GPT-5-Codex) – 43.4%
- Terminus 2 (Claude Sonnet 4.5) - 42.8%
The tight competition among these top models highlights the rapid progress being made across diffrent platforms.Notably, no single agent has yet achieved a success rate exceeding 50%, indicating there’s still significant room for improvement.
how to Get Involved: Testing and submission
Want to put your agent to the test? Here’s how:
- Install Harbor: Set up the framework on your system.
- Run the Benchmark: Use the simple CLI commands to execute Terminal-Bench 2.0.
- Submit Results: Submit five benchmark runs, including job directories, to the developers for validation.
The command to run the benchmark is:
`harbor run -d [email protected] -m “