Topic

#benchmarks

Loot, blog posts and adjacent themes connected to this topic. Follow the tag to keep it in your orbit.

#benchmarks
Loot

More from this topic

Explore all loot
No loot for #benchmarks yet

When the community shares matching finds, they will appear here. For now, browse all loot or submit the first drop.

Blog

Related reads

Browse blog
AI & Automation

Hugging Face Shows How to Benchmark Whether Tools Are Agent-Friendly

Hugging Face published an agent-evaluation harness that tests whether coding agents can use a library efficiently, not only whether they rea

Wissen & Lernen

CEO-Bench Tests Whether AI Agents Can Run a Startup for 500 Days

Wissen & Lernen

WorkBench Revisited Shows Why Workplace Agent Scores Need Source-Level Checks

WorkBench Revisited updates a workplace-agent benchmark with 2026 model runs, but the arXiv abstract and GitHub repository currently surface

Wissen & Lernen

CoDA-Bench tests whether coding agents can find the right data before writing code

CoDA-Bench is a new ICML 2026 benchmark for code agents that must search noisy data folders, identify relevant files, write code, and answer

AI & Automation

Deep-XPIA tests prompt injection across multi-agent handoffs

Deep-XPIA is an open-source benchmark for cross-prompt injection in multi-agent systems, with live Claude Haiku measurements, a confused-dep

Wissen & Lernen

Evaluation Cards exposes why AI benchmark scores are hard to trust

EvalEval's beta Evaluation Cards project maps AI evaluation results with reproducibility, completeness, provenance, and comparability signal

Wissen & Lernen

Hugging Face benchmark tests voice agents on code-switched customer speech

ServiceNow-AI published a Hugging Face benchmark and dataset for code-switched ASR, testing how voice-agent transcription handles Spanish-En

Wissen & Lernen

AgingBench asks how long AI agents stay reliable after deployment

AgingBench is a new benchmark for long-lived AI agents, measuring reliability decay across sessions instead of only testing freshly initiali

Wissen & Lernen

The Open Agent Leaderboard compares full AI agent systems, not just models

IBM Research and Hugging Face introduced the Open Agent Leaderboard, an open benchmark stack for comparing complete AI agent systems across

OpenClaw

GPT-5.5: OpenAI Wants More Agent, Less Chatbot

According to heise, OpenAI is positioning GPT-5.5 as an agentic work model: more planning, more tool use, and more consistent execution acro