Topic

#evaluation

Loot, blog posts and adjacent themes connected to this topic. Follow the tag to keep it in your orbit.

Loot

More from this topic

Explore all loot

No loot for #evaluation yet

When the community shares matching finds, they will appear here. For now, browse all loot or submit the first drop.

Blog

Related reads

Wissen & Lernen

WorkBench Revisited Shows Why Workplace Agent Scores Need Source-Level Checks

WorkBench Revisited updates a workplace-agent benchmark with 2026 model runs, but the arXiv abstract and GitHub repository currently surface…

Wissen & Lernen

Agents' Last Exam tests AI agents on real professional workflows

Agents' Last Exam is a new Berkeley-led benchmark for computer-use AI agents, with long-horizon professional tasks, verifiable outcomes, pub…

Wissen & Lernen

AgingBench asks how long AI agents stay reliable after deployment

AgingBench is a new benchmark for long-lived AI agents, measuring reliability decay across sessions instead of only testing freshly initiali…

Wissen & Lernen

The Open Agent Leaderboard compares full AI agent systems, not just models

IBM Research and Hugging Face introduced the Open Agent Leaderboard, an open benchmark stack for comparing complete AI agent systems across …