The promise of artificial intelligence often centers on its ability to process vast amounts of data to find patterns invisible to the human eye. However, a new study suggests that when it comes to the unpredictable nature of professional sports, the most advanced AI models are still struggling to find a winning strategy. In a surprising turn of events, AI models are terrible at betting on soccer, with several industry leaders failing to turn a profit in a simulated environment.
According to a report released this week by the London-based AI start-up General Reasoning, high-profile systems from Google, OpenAI, and Anthropic all lost money when tasked with betting on soccer matches. The study, titled “KellyBench,” indicates a significant disparity between an AI’s ability to handle structured tasks—such as writing software code—and its capacity to navigate the chaotic, real-world variables inherent in professional athletics over a long period.
To test these capabilities, General Reasoning created a virtual re-creation of the 2023–24 Premier League season. The researchers provided eight of the top AI systems with detailed historical data and statistics regarding each team and their previous game performances. The models were then instructed to build strategies designed to maximize financial returns while simultaneously managing risk.
The Gap Between Coding and Complex Prediction
The findings from the KellyBench report underscore a critical limitation in current large language models (LLMs). While these systems have made rapid strides in technical domains, the “human problems” associated with sports betting—which involve fluctuating form, psychological factors, and the inherent randomness of a soccer match—remain elusive. The failure of these models to maintain a positive balance over a full season suggests that they struggle to analyze the real world over extended timelines.
Among the participants, the performance of xAI’s Grok was particularly noted as being poor. The study suggests that despite the massive datasets these models are trained on, the transition from theoretical data analysis to successful risk management in a dynamic environment like the Premier League is a hurdle the industry has yet to clear.
How the KellyBench Simulation Worked
The methodology employed by General Reasoning focused on the intersection of data processing and financial risk. By using a virtual re-creation of a previous season, the researchers could provide a controlled environment where the AI had access to the same statistical markers a professional bettor might use. The goal was not simply to predict who would win a single game, but to develop a sustainable model for maximizing returns.
The failure of the models to do so highlights a potential “reasoning gap.” In software development, there is typically a correct or incorrect answer based on logic, and syntax. In sports betting, the “correct” data-driven choice can still result in a loss due to a single red card or an unexpected injury, variables that AI systems often fail to weight correctly when managing a long-term portfolio.
What So for the Future of AI
For those following the trajectory of AI development, these results provide a sobering reminder that “intelligence” is not monolithic. The ability to synthesize information and generate human-like text does not automatically translate to an ability to predict stochastic real-world events. This suggests that the path toward Artificial General Intelligence (AGI) may require new approaches to how models handle uncertainty and long-term temporal analysis.
The stakeholders affected by these findings include not only the AI labs seeking to improve their models but too the burgeoning industry of AI-driven financial and predictive tools. If the most advanced systems from Google, OpenAI, and Anthropic cannot reliably predict soccer outcomes, it raises questions about the reliability of AI in other high-variance fields, such as stock market forecasting or geopolitical risk assessment.
Key Takeaways from the KellyBench Report
- Financial Loss: Top AI models from Google, OpenAI, and Anthropic lost money during the simulated 2023–24 Premier League season.
- Performance Gap: There is a stark difference between AI’s proficiency in technical tasks (like coding) and its ability to solve complex, real-world human problems.
- Risk Management: AI systems struggled to maximize returns and manage risk effectively over a long-term period.
- Specific Failures: xAI’s Grok was highlighted as being particularly ineffective in this betting scenario.
As the industry continues to evolve, the focus may shift from simply increasing the size of datasets to improving the “common sense” or “world model” capabilities of these systems. Until AI can better account for the unpredictable nature of human performance and physical competition, the “house” will likely continue to win over the algorithms.
There are currently no further scheduled updates or official responses from the AI labs regarding the KellyBench results. We encourage our readers to share their thoughts on whether they trust AI for predictive analysis in the comments below.