The Experiment: The AI Agent Company
- Henry Eames
- 6 days ago
- 2 min read
Researchers at Carnegie Mellon University designed a virtual company, aptly named "TheAgentCompany", to evaluate how well AI agents could perform typical office tasks. This simulated environment included internal websites, communication tools resembling Slack, and a suite of tasks spanning software engineering, project management, and finance. AI models from leading organisations like OpenAI, Anthropic, Google, Meta, and Amazon were assigned roles such as HR manager, CTO, and financial analyst.

Performance Metrics: A Reality Check for AI Agents
The AI agents were tasked with 175 real-world professional activities, including analysing spreadsheets, writing performance reviews, assigning team members to projects based on budgets and availability, and navigating internal file systems. Sounds easy enough, right? Wrong.
The results highlighted significant limitations for AI agents:
Anthropic's Claude 3.5 Sonnet: Completed 24% of tasks, the highest among all models.
Google's Gemini 2.0 Flash: Achieved an 11.4% success rate.
OpenAI's GPT-4o: Managed to complete 8.6% of tasks.
Amazon's Nova Pro v1: Struggled with a mere 1.7% task completion rate.
These figures underscore that, despite advancements, AI agents are not yet equipped to handle the complexities of real-world corporate tasks autonomously.
So why did it go so wrong?
The experiment highlighted some very human things that today’s AI still can’t quite figure out:
Common sense isn’t so common: AI agents missed obvious steps and made baffling mistakes.
Terrible at office politics: they couldn’t figure out basic social dynamics, like who to ask for help.
Struggled to use the internet: even simple online tasks became major roadblocks.
Made up their own reality: one AI, desperate to find a missing teammate, just renamed another employee and pretended it found the right person. Not exactly HR-approved behaviour.
These shortcomings highlight the current limitations of AI in replicating human judgment and adaptability in professional settings.
Insights and Implications
While the experiment might seem like a critique of AI capabilities, it's more accurately a diagnostic tool revealing areas for improvement. The goal wasn't to showcase failure but to identify friction points in autonomy, coordination, and task completion.
The study underscores that while AI agents have potential, they currently function best as tools augmenting human work rather than replacing it. It highlights the importance of continued research and development to enhance AI's ability to handle complex, dynamic tasks. By addressing these areas, we can move closer to a future where AI agents are reliable collaborators in the workplace.
Comentarios