Terminal-Bench 2.0 & Harbor: Container Agent Testing Framework Released

Terminal-Bench 2.0 ⁤& ⁣Harbor: A New Standard for Evaluating AI Agents

The landscape of Large Language Model‌ (LLM) agents⁢ is rapidly evolving. ⁤To keep pace, rigorous and reliable evaluation ⁢is crucial. That’s why the release of Terminal-Bench 2.0 and the accompanying Harbor framework represent a⁢ significant leap forward⁢ for the AI community.⁢ These tools aren’t just ⁤incremental updates; they’re building blocks⁢ for a more standardized and⁣ scalable future of agent assessment.

As someone deeply involved in the growth and evaluation of ⁤AI‍ systems, I’ve seen ⁣firsthand ⁤the challenges of comparing agent performance. ​Existing benchmarks ⁤often lack robustness ⁣or are​ susceptible to gaming. Terminal-Bench 2.0 directly addresses these issues, offering a more challenging and representative test of real-world ⁤capabilities.

What’s New in Terminal-Bench 2.0?

Terminal-Bench 2.0 is⁣ a‌ benchmark⁤ designed to assess the ability of AI agents to interact with a ⁢Linux ​terminal surroundings. It ‌focuses on practical tasks ‌- the⁤ kind developers and system⁤ administrators face daily. Here’s what’s changed:

* Increased Difficulty: The⁢ new benchmark is demonstrably harder than its⁢ predecessor, Terminal-Bench ⁤1.0. This ensures a more discerning⁤ evaluation of agent capabilities.
* Higher Quality Tasks: Despite the increased difficulty, performance is comparable to TB1.0. This is due to a substantial improvement in task quality⁤ and relevance.
* Removed Unstable⁤ Dependencies: A previous component relying on unstable third-party apis has been removed and refactored, ensuring ‍long-term stability and ‍reliability.
* Focus on Real-World Skills: The benchmark emphasizes tasks‌ requiring reasoning, code generation, and effective tool use – skills vital ‍for practical applications.

Introducing Harbor: Scalable​ Agent ⁣Evaluation

Alongside the benchmark, the team launched harbor, a framework designed to⁤ streamline the‌ process of running‌ and evaluating agents at scale. Think of it as a robust infrastructure for testing ⁤your AI ‌creations in ‍a controlled, ​cloud-based environment.

Harbor offers key benefits:

* Global Agent Support: Evaluate any agent that can be run within a container.
* Scalable Pipelines: supports both Supervised Fine-Tuning (SFT) and Reinforcement‍ Learning (RL) workflows.
* Customization: Create and deploy your own benchmarks tailored to ‌specific needs.
* Seamless Integration: ‍ Fully integrates with Terminal-Bench⁣ 2.0 for immediate evaluation.

Harbor is already proven, ​having been used internally to run tens of thousands of rollouts during the development of Terminal-Bench 2.0.⁢ You can explore it further at harborframework.com.

Early Leaderboard results: GPT-5 Takes the Lead

Initial results from‌ the Terminal-Bench 2.0⁣ leaderboard are generating excitement.OpenAI’s Codex CLI, powered ‍by GPT-5, currently leads the pack with a 49.6% success rate. This is the highest score achieved by any agent tested to date.

Here’s a snapshot of the top five performers:

  1. Codex CLI ​(GPT-5) – 49.6%
  2. Codex CLI (GPT-5-Codex) -⁣ 44.3%
  3. OpenHands (GPT-5) – ​43.8%
  4. Terminus 2 (GPT-5-Codex) – 43.4%
  5. Terminus 2 (Claude Sonnet 4.5) -‌ 42.8%

The ⁢tight ‌competition among these top ‌models⁢ highlights‍ the rapid⁢ progress being⁤ made across diffrent platforms.Notably, no single agent has yet achieved a success rate exceeding 50%, indicating there’s still significant room for improvement.

how to Get Involved: Testing ​and submission

Want to put your agent to the⁣ test? Here’s ⁤how:

  1. Install ⁢Harbor: Set up the framework on your‍ system.
  2. Run the Benchmark: Use the ‌simple CLI commands to execute Terminal-Bench 2.0.
  3. Submit Results: Submit five benchmark runs, including job directories, ⁤to⁢ the developers for validation.

The command ⁣to⁣ run the benchmark is:

`harbor run -d [email protected] -m “” ‌-a “

Leave a Comment