WorkBench Revisited Shows Why Workplace Agent Scores Need Source-Level Checks

Q: Why do the WorkBench scores differ between arXiv and GitHub?

The arXiv v1 abstract reports Claude Opus 4.8 at 89% completion and 2.5% harmful actions, while the GitHub README currently reports Claude Fable 5 at 92% and 1.9%. Treat the artifact version as part of the citation.

Q: Is WorkBench enough to approve workplace agents?

No. It is useful evidence, but teams still need workflow-specific testing, human approval points, logging, and rollback plans.

Q: What should buyers look at besides task completion?

Harmful side effects, reversibility, tool permissions, data access, audit logs, and whether the agent asks for confirmation before high-impact actions.

GitHub preview image for the WorkBench benchmark repository.WorkBench GitHub repository

Knowledge & LearningJun 18, 2026

@ZachasAuthorADMIN

WorkBench Revisited updates a workplace-agent benchmark with 2026 model runs, but the arXiv abstract and GitHub repository currently surface different top-line scores, making it a useful case study in how to verify agent benchmarks before citing them.

WorkBench Revisited is a June 2026 follow-up to the WorkBench benchmark for workplace agents. The arXiv abstract reports that the best 2026 agent, Claude Opus 4.8, completed 89% of tasks and caused unintended harmful actions on 2.5%, compared with GPT-4's 43% completion and 26% harmful-action rate in 2024. The linked GitHub repository currently presents a newer-looking top line for Claude Fable 5 at 92% completion and 1.9% harmful actions, so readers should verify which artifact and version they are citing.

Key takeaways

The arXiv paper is listed as 2606.13715v1, submitted June 10, 2026, under cs.AI with cs.CL and cs.MA cross-listing.
The arXiv abstract says WorkBench Revisited updates the benchmark with data and code quality improvements, new model scores, and analysis of agent progress since 2024.
The paper's abstract and the repository agree on the direction of change: workplace-agent task completion improved sharply while harmful side effects dropped.
The top-line best-model result differs across artifacts: arXiv names Claude Opus 4.8 at 89% and 2.5%, while the repository README names Claude Fable 5 at 92% and 1.9%.
The repository says per-task results, metadata sidecars, figures, and scoring scripts are committed so evaluation numbers can be reproduced without new inference runs.

Practical LinkLoot angle

This is less a "model leaderboard winner" story than a benchmark hygiene story. If you build or buy workplace agents for email, calendar, CRM, documents, or internal ops, the useful lesson is to inspect the benchmark artifact before repeating the headline number. A score can change between an arXiv version, a repository README, committed result files, and later corrections.

Artifact	Best use	Limitation	Source
arXiv abstract	Stable citation and paper metadata	May lag repository updates or corrections	arXiv
GitHub README	Implementation notes, reproducibility path, current project framing	Can change after the paper version	WorkBench repository
Committed result files	Reproducing scores and checking side effects	Requires local audit of data, scoring scripts, and metadata	WorkBench repository

For teams turning agent research into operating procedure, LinkLoot's AI agent tools guide is the better next stop than a raw leaderboard. The deployment question is not only "which model scores highest?" It is which actions need human confirmation, rollback, logging, and policy checks.

What to verify before you act

Match the number to the artifact. If you cite the paper, use the arXiv top line unless you explicitly cite the GitHub README or committed results. If you cite the repository, note the date you checked it because README language and result files can change.

Check whether your task risk resembles WorkBench. The benchmark covers realistic workplace tasks with read/write tools, including actions that can change state. That is closer to daily business automation than many chat-only benchmarks, but it still cannot prove your CRM, inbox, compliance rules, or customer data workflow is safe.

Audit failure modes, not only completion. The arXiv abstract highlights residual basic mistakes that can still cause irreversible harm, such as sending an email to the wrong person. That is the class of error a production workflow should route through confirmation, preview, or reversible staging.

FAQ

What is WorkBench Revisited?

It is a 2026 follow-up to the WorkBench workplace-agent benchmark, with updated model runs, corrected data, and new analysis.

Why do the WorkBench scores differ between arXiv and GitHub?

Is WorkBench enough to approve workplace agents?

What should buyers look at besides task completion?

Source check

The arXiv page confirms the paper title, author, June 10, 2026 submission date, subjects, follow-up relationship to arXiv:2405.00823, and the 89%/2.5% top-line result for Claude Opus 4.8. The WorkBench GitHub repository corroborates the benchmark's purpose, the 2026 follow-up framing, committed result artifacts, reproducibility path, and a repository-level 92%/1.9% result for Claude Fable 5.

Sources & links

References, demos, and supporting links.

arXiv paperarxiv.orgPrimary WorkBench GitHub repositorygithub.com