Home / Tech / MCP-Universe benchmark shows GPT-5 fails more than half of real-world orchestration tasks

MCP-Universe benchmark shows GPT-5 fails more than half of real-world orchestration tasks

MCP-Universe benchmark shows GPT-5 fails more than half of real-world orchestration tasks
Emilia David 2025-08-22 20:50:00

Want smarter insights in your inbox? Sign‌ up for our weekly newsletters to get only what⁤ matters to enterprise AI,data,adn security leaders. subscribe Now


The adoption of interoperability ‍standards, such as the Model Context Protocol (MCP), ⁣can provide enterprises with insights into how agents and models function outside their walled confines.However, many benchmarks fail to capture real-life interactions with MCP.

Salesforce AI Research developed‌ a new open-source benchmark it calls MCP-Universe, which aims to track LLMs as these interact⁣ with MCP servers in the real world, arguing that it will paint ⁢a better⁢ picture of real-life and real-time interactions of models with tools enterprises ​actually use. ⁣In its initial ⁣testing, it found that models ⁢like OpenAI’s recently released GPT-5 are strong, but still do not perform as well in real-life scenarios.

“existing benchmarks ​predominantly focus ​on isolated aspects of LLM performance,‌ such as instruction following, math reasoning, or function calling, without providing a comprehensive assessment of how models interact with real-world ‍MCP ⁤servers across diverse scenarios,” ‍Salesforce said in a⁢ paper.

MCP-Universe captures model performance thru tool usage, multi-turn tool calls, long context windows‌ and ⁣large⁣ tool spaces. It’s grounded on existing MCP servers with access to actual data sources⁣ and environments.


AI scaling Hits Its Limits

Power caps, rising token costs, and inference delays ‍are ⁣reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

  • Turning energy into a ‌strategic advantage
  • Architecting efficient inference for real throughput gains
  • Unlocking competitive ROI with sustainable AI systems
  • Secure your spot to stay ahead: https://bit.ly/4mwGngO


    Junnan Li, director of AI research at Salesforce, told VentureBeat that many models “still face‍ limitations that hold them back on enterprise-grade tasks.”

    Also Read:  Agentic AI & Beyond: Nvidia's Vision for Physical AI | Computerworld

    “Two of the biggest are: Long context challenges, models can lose track of data or struggle to reason consistently when handling very long or complex⁤ inputs,” Li ​said. “And,⁢ Unknown tool challenges, models often aren’t able to seamlessly⁢ use unfamiliar tools or systems in the way humans can adapt on the fly. This is why it’s ⁣crucial not ‌to take a DIY approach with a⁤ single model to power agents alone, but instead, to rely on a ⁢platform that combines data context, enhanced reasoning,‌ and trust guardrails to truly meet the needs of enterprise AI.”

    MCP-Universe joins other MCP-based proposed benchmarks, ⁤such as MCP-Radar from the University of Massachusetts Amherst and Xi’an jiaotong University, as well ‌as ⁣the ⁤ Beijing University of Posts and Telecommunications’ MCPWorld. It also builds on MCPEvals, which Salesforce released​ in July, which‍ focuses mainly on agents.Li said the biggest difference between MCP-Universe and MCPEvals is that the latter‍ is evaluated with synthetic‌ tasks.

    How it effectively works

    MCP-Universe evaluates how well each model performs a series of tasks ‍that‍ mimic those undertaken by enterprises.Salesforce⁢ said it designed MCP-Universe ⁤to encompass six core domains used by enterprises: location navigation, repository management, financial analysis, 3D design, browser automation and web search. It accessed 11 MCP servers for a total of ​231 tasks.

    • Location navigation focuses on geographic reasoning and the execution of spatial tasks. The ⁢researchers tapped the Google Maps MCP server​ for this process.
    • The repository management domain looks at codebase operations and connects to the GitHub MCP to expose version control tools like repo search, ⁣issue tracking and code editing.
    • Financial analysis ⁢connects to the Yahoo finance MCP server to evaluate quantitative reasoning and financial market decision-making.
    • 3D design⁣ evaluates‍ the use of computer-aided design tools through the Blender MCP.
    • Browser automation,connected to Playwright’s‌ MCP,tests browser ‌interaction.
    • The web searching domain employs the Google Search MCP server and the Fetch MCP  to check “open-domain information seeking” and is structured as a more open-ended task.
    Also Read:  AI & Tax Fraud: HMRC's New Tech to Catch Evaders | UK Tax Updates

    Salesforce said that it had to design new MCP tasks that reflect real use cases. For each domain, they created four to five kinds⁣ of tasks that the researchers think llms can easily complete.⁢ Such as, the researchers‌ assigned the models a goal that involved ⁢route planning, identifying the optimal stops and then locating the destination.

    Each model is evaluated on how they completed the tasks. Li and his team opted to follow an execution-based evaluation ⁤paradigm rather than the more common LLM-as-a-judge system. The researchers noted the LLM-as-a-judge paradigm “is not well-suited ⁣for our MCP-universe scenario, since some tasks are designed to use⁢ real-time data,⁤ while the knowledge of the LLM judge is static.”

    Salesforce ⁢researchers used three types of evaluators:‍ format evaluators to see if the agents and models follow format requirements,⁢ static evaluators to assess correctness over time and dynamic evaluators for fluctuating answers like flight prices or GitHub issues.

    “MCP-Universe focuses on creating challenging ⁢real-world tasks with execution-based evaluators, which ​can stress-test the agent in complex scenarios. moreover, MCP-Universe offers an extendable framework/codebase for building and evaluating‌ agents,” Li⁣ said.

    Even the big models have trouble

    To test MCP-Universe, Salesforce evaluated several popular proprietary and open-source models.⁤ these include Grok-4 from xAI, Anthropic’s claude-4 Sonnet and Claude 3.7⁢ Sonnet, OpenAI’s⁣ GPT-5, o4-mini, o3, GPT-4.1, GPT-4o, GPT-oss, Google’s Gemini ‌2.5 Pro and‍ Gemini 2.5 Fkash, GLM-4.5 from Zai, Moonshot’s Kimi-K2, Qwen’s Qwen3 ‌Coder⁤ and⁣ Qwen3-235B-A22B-Instruct-2507⁢ and DeepSeek-V3-0304 from DeepSeek. Each model tested had at least 120B parameters.

    Also Read:  Refurbished MacBook Air Sale: Beats Tablet Prices - [Year] Deals

    In its testing, Salesforce found GPT-5 had the best success rate, especially for financial⁤ analysis tasks. Grok-4 followed, beating all the models for browser automation, and Claude-4.0 Sonnet rounds out the ​top three, although it ‌did not post any performance numbers ⁤higher than either of the models it ⁣follows. Among open-source models, GLM-4.5 performed the best.

    However, MCP-Universe showed the models had difficulty handling⁤ long ⁢contexts, especially for ⁣location navigation, browser automation and financial analysis, with efficiency falling significantly. The moment the LLMs encounter unknown tools, their performance⁣ also drops. The LLMs demonstrated difficulty in completing more than half of‌ the tasks that enterprises typically perform.

    “These findings highlight​ that current frontier llms still fall short in reliably executing tasks​ across diverse real-world MCP tasks. Our ‍MCP-Universe benchmark, therefore, provides a challenging and necessary testbed for evaluating LLM performance in areas underserved by existing benchmarks,” the paper said.

    Li told​ VentureBeat that he hopes enterprises will use MCP-Universe to gain a deeper understanding of ⁤where agents and ‌models fail on tasks so that​ they can⁢ improve either their frameworks or the implementation of their MCP tools.

    Leave a Reply