WorkBench Revisited Shows Why Workplace Agent Scores Need Source-Level Checks
WorkBench Revisited updates a workplace-agent benchmark with 2026 model runs, but the arXiv abstract and GitHub repository currently surface different top-line scores, making it a useful case study in how to verify agent benchmarks before citing them.
WorkBench Revisited is a June 2026 follow-up to the WorkBench benchmark for workplace agents. The arXiv abstract reports that the best 2026 agent, Claude Opus 4.8, completed 89% of tasks and caused unintended harmful actions on 2.5%, compared with GPT-4's 43% completion and 26% harmful-action rate in 2024. The linked GitHub repository currently presents a newer-looking top line for Claude Fable 5 at 92% completion and 1.9% harmful actions, so readers should verify which artifact and version they are citing.
Key takeaways
- The arXiv paper is listed as
2606.13715v1, submitted June 10, 2026, under cs.AI with cs.CL and cs.MA cross-listing. - The arXiv abstract says WorkBench Revisited updates the benchmark with data and code quality improvements, new model scores, and analysis of agent progress since 2024.
- The paper's abstract and the repository agree on the direction of change: workplace-agent task completion improved sharply while harmful side effects dropped.
- The top-line best-model result differs across artifacts: arXiv names Claude Opus 4.8 at 89% and 2.5%, while the repository README names Claude Fable 5 at 92% and 1.9%.
- The repository says per-task results, metadata sidecars, figures, and scoring scripts are committed so evaluation numbers can be reproduced without new inference runs.
Practical LinkLoot angle
This is less a "model leaderboard winner" story than a benchmark hygiene story. If you build or buy workplace agents for email, calendar, CRM, documents, or internal ops, the useful lesson is to inspect the benchmark artifact before repeating the headline number. A score can change between an arXiv version, a repository README, committed result files, and later corrections.
| Artifact | Best use | Limitation | Source |
|---|---|---|---|
| arXiv abstract | Stable citation and paper metadata | May lag repository updates or corrections | arXiv |
| GitHub README | Implementation notes, reproducibility path, current project framing | Can change after the paper version | WorkBench repository |
| Committed result files | Reproducing scores and checking side effects | Requires local audit of data, scoring scripts, and metadata | WorkBench repository |
For teams turning agent research into operating procedure, LinkLoot's AI agent tools guide is the better next stop than a raw leaderboard. The deployment question is not only "which model scores highest?" It is which actions need human confirmation, rollback, logging, and policy checks.
What to verify before you act
Match the number to the artifact. If you cite the paper, use the arXiv top line unless you explicitly cite the GitHub README or committed results. If you cite the repository, note the date you checked it because README language and result files can change.
Check whether your task risk resembles WorkBench. The benchmark covers realistic workplace tasks with read/write tools, including actions that can change state. That is closer to daily business automation than many chat-only benchmarks, but it still cannot prove your CRM, inbox, compliance rules, or customer data workflow is safe.
Audit failure modes, not only completion. The arXiv abstract highlights residual basic mistakes that can still cause irreversible harm, such as sending an email to the wrong person. That is the class of error a production workflow should route through confirmation, preview, or reversible staging.
It is a 2026 follow-up to the WorkBench workplace-agent benchmark, with updated model runs, corrected data, and new analysis.
Source check
The arXiv page confirms the paper title, author, June 10, 2026 submission date, subjects, follow-up relationship to arXiv:2405.00823, and the 89%/2.5% top-line result for Claude Opus 4.8. The WorkBench GitHub repository corroborates the benchmark's purpose, the 2026 follow-up framing, committed result artifacts, reproducibility path, and a repository-level 92%/1.9% result for Claude Fable 5.
