#benchmarks
Loot, blog posts and adjacent themes connected to this topic. Follow the tag to keep it in your orbit.
More from this topic
When the community shares matching finds, they will appear here. For now, browse all loot or submit the first drop.
Related reads
Hugging Face Shows How to Benchmark Whether Tools Are Agent-Friendly
Hugging Face published an agent-evaluation harness that tests whether coding agents can use a library efficiently, not only whether they rea…
CEO-Bench Tests Whether AI Agents Can Run a Startup for 500 Days
WorkBench Revisited Shows Why Workplace Agent Scores Need Source-Level Checks
WorkBench Revisited updates a workplace-agent benchmark with 2026 model runs, but the arXiv abstract and GitHub repository currently surface…
CoDA-Bench tests whether coding agents can find the right data before writing code
CoDA-Bench is a new ICML 2026 benchmark for code agents that must search noisy data folders, identify relevant files, write code, and answer…
Deep-XPIA tests prompt injection across multi-agent handoffs
Deep-XPIA is an open-source benchmark for cross-prompt injection in multi-agent systems, with live Claude Haiku measurements, a confused-dep…
Evaluation Cards exposes why AI benchmark scores are hard to trust
EvalEval's beta Evaluation Cards project maps AI evaluation results with reproducibility, completeness, provenance, and comparability signal…
Hugging Face benchmark tests voice agents on code-switched customer speech
ServiceNow-AI published a Hugging Face benchmark and dataset for code-switched ASR, testing how voice-agent transcription handles Spanish-En…
AgingBench asks how long AI agents stay reliable after deployment
AgingBench is a new benchmark for long-lived AI agents, measuring reliability decay across sessions instead of only testing freshly initiali…
The Open Agent Leaderboard compares full AI agent systems, not just models
IBM Research and Hugging Face introduced the Open Agent Leaderboard, an open benchmark stack for comparing complete AI agent systems across …
GPT-5.5: OpenAI Wants More Agent, Less Chatbot
According to heise, OpenAI is positioning GPT-5.5 as an agentic work model: more planning, more tool use, and more consistent execution acro…