#evaluation
Loot, blog posts and adjacent themes connected to this topic. Follow the tag to keep it in your orbit.
More from this topic
When the community shares matching finds, they will appear here. For now, browse all loot or submit the first drop.
Related reads
WorkBench Revisited Shows Why Workplace Agent Scores Need Source-Level Checks
WorkBench Revisited updates a workplace-agent benchmark with 2026 model runs, but the arXiv abstract and GitHub repository currently surface…
Agents' Last Exam tests AI agents on real professional workflows
Agents' Last Exam is a new Berkeley-led benchmark for computer-use AI agents, with long-horizon professional tasks, verifiable outcomes, pub…
AgingBench asks how long AI agents stay reliable after deployment
AgingBench is a new benchmark for long-lived AI agents, measuring reliability decay across sessions instead of only testing freshly initiali…
The Open Agent Leaderboard compares full AI agent systems, not just models
IBM Research and Hugging Face introduced the Open Agent Leaderboard, an open benchmark stack for comparing complete AI agent systems across …