#arxiv
Loot, blog posts and adjacent themes connected to this topic. Follow the tag to keep it in your orbit.
More from this topic
When the community shares matching finds, they will appear here. For now, browse all loot or submit the first drop.
Related reads
PlanBench-XL tests whether agents can recover when tool paths break
PlanBench-XL is a June 2026 arXiv benchmark for long-horizon LLM tool-use agents, with 327 retail tasks, 1,665 tools, retrieval-limited visi…
AgentBench Shows Why AI Agent Accuracy Is Also a Compute Budget Problem
A KAIST paper and its AgentBench repository measure how dynamic reasoning changes AI agent latency, energy use, and infrastructure cost, not…
SIA Tests Self-Improving AI Across Agent Harnesses and Model Weights
A new arXiv paper and official implementation show SIA updating both an agent scaffold and model weights, with reported gains on LawBench, G…
CEO-Bench Tests Whether AI Agents Can Run a Startup for 500 Days
WorkBench Revisited Shows Why Workplace Agent Scores Need Source-Level Checks
WorkBench Revisited updates a workplace-agent benchmark with 2026 model runs, but the arXiv abstract and GitHub repository currently surface…
CoDA-Bench tests whether coding agents can find the right data before writing code
CoDA-Bench is a new ICML 2026 benchmark for code agents that must search noisy data folders, identify relevant files, write code, and answer…
Agents' Last Exam tests AI agents on real professional workflows
Agents' Last Exam is a new Berkeley-led benchmark for computer-use AI agents, with long-horizon professional tasks, verifiable outcomes, pub…
New arXiv Paper Tests Compact Models Against LLMs for Multilingual Fact-Checking
A June 2026 arXiv paper from Factiverse reports that compact fine-tuned models can stay practical for multilingual fact-checking when latenc…
SkillOpt trains agent skills as editable artifacts, not model weights
SkillOpt is a Microsoft Research project and arXiv paper that treats natural-language agent skills as trainable external state, using scored…
JetBrains Mellum2 ships as an open MoE model for coding agents
JetBrains released Mellum2, an Apache-2.0 open-weight 12B Mixture-of-Experts model built for software engineering, routing, RAG, and low-lat…
LongTraceRL trains long-context reasoning from search-agent trajectories
LongTraceRL uses search-agent trajectories, tiered distractors, and entity-level rubric rewards to improve long-context reasoning across fiv…
ActCam brings zero-shot camera-path control to AI video generation
ActCam is a new arXiv paper and project release that combines character-motion transfer with per-frame camera control, aiming to make AI vid…