Topic

#benchmarks

Loot, blog posts and adjacent themes connected to this topic. Follow the tag to keep it in your orbit.

Loot

Related reads

Hugging Face published an agent-evaluation harness that tests whether coding agents can use a library efficiently, not only whether they rea…

WorkBench Revisited updates a workplace-agent benchmark with 2026 model runs, but the arXiv abstract and GitHub repository currently surface…

CoDA-Bench is a new ICML 2026 benchmark for code agents that must search noisy data folders, identify relevant files, write code, and answer…

Deep-XPIA is an open-source benchmark for cross-prompt injection in multi-agent systems, with live Claude Haiku measurements, a confused-dep…

EvalEval's beta Evaluation Cards project maps AI evaluation results with reproducibility, completeness, provenance, and comparability signal…

ServiceNow-AI published a Hugging Face benchmark and dataset for code-switched ASR, testing how voice-agent transcription handles Spanish-En…

AgingBench is a new benchmark for long-lived AI agents, measuring reliability decay across sessions instead of only testing freshly initiali…

IBM Research and Hugging Face introduced the Open Agent Leaderboard, an open benchmark stack for comparing complete AI agent systems across …

According to heise, OpenAI is positioning GPT-5.5 as an agentic work model: more planning, more tool use, and more consistent execution acro…