Topic

#arxiv

Loot, blog posts and adjacent themes connected to this topic. Follow the tag to keep it in your orbit.

Loot

Related reads

PlanBench-XL tests whether agents can recover when tool paths break

PlanBench-XL is a June 2026 arXiv benchmark for long-horizon LLM tool-use agents, with 327 retail tasks, 1,665 tools, retrieval-limited visi…

Wissen & Lernen

AgentBench Shows Why AI Agent Accuracy Is Also a Compute Budget Problem

A KAIST paper and its AgentBench repository measure how dynamic reasoning changes AI agent latency, energy use, and infrastructure cost, not…

Wissen & Lernen

SIA Tests Self-Improving AI Across Agent Harnesses and Model Weights

A new arXiv paper and official implementation show SIA updating both an agent scaffold and model weights, with reported gains on LawBench, G…

Wissen & Lernen

CEO-Bench Tests Whether AI Agents Can Run a Startup for 500 Days

Wissen & Lernen

WorkBench Revisited Shows Why Workplace Agent Scores Need Source-Level Checks

WorkBench Revisited updates a workplace-agent benchmark with 2026 model runs, but the arXiv abstract and GitHub repository currently surface…

Wissen & Lernen

CoDA-Bench tests whether coding agents can find the right data before writing code

CoDA-Bench is a new ICML 2026 benchmark for code agents that must search noisy data folders, identify relevant files, write code, and answer…

Wissen & Lernen

Agents' Last Exam tests AI agents on real professional workflows

Agents' Last Exam is a new Berkeley-led benchmark for computer-use AI agents, with long-horizon professional tasks, verifiable outcomes, pub…

Wissen & Lernen

New arXiv Paper Tests Compact Models Against LLMs for Multilingual Fact-Checking

A June 2026 arXiv paper from Factiverse reports that compact fine-tuned models can stay practical for multilingual fact-checking when latenc…

Wissen & Lernen

SkillOpt trains agent skills as editable artifacts, not model weights

SkillOpt is a Microsoft Research project and arXiv paper that treats natural-language agent skills as trainable external state, using scored…

AI & Automation

JetBrains Mellum2 ships as an open MoE model for coding agents

JetBrains released Mellum2, an Apache-2.0 open-weight 12B Mixture-of-Experts model built for software engineering, routing, RAG, and low-lat…

Wissen & Lernen

LongTraceRL trains long-context reasoning from search-agent trajectories

LongTraceRL uses search-agent trajectories, tiered distractors, and entity-level rubric rewards to improve long-context reasoning across fiv…

Kreativ & Medien

ActCam brings zero-shot camera-path control to AI video generation

ActCam is a new arXiv paper and project release that combines character-motion transfer with per-frame camera control, aiming to make AI vid…

#arxiv

More from this topic

Related reads

PlanBench-XL tests whether agents can recover when tool paths break

AgentBench Shows Why AI Agent Accuracy Is Also a Compute Budget Problem

SIA Tests Self-Improving AI Across Agent Harnesses and Model Weights

CEO-Bench Tests Whether AI Agents Can Run a Startup for 500 Days

WorkBench Revisited Shows Why Workplace Agent Scores Need Source-Level Checks

CoDA-Bench tests whether coding agents can find the right data before writing code

Agents' Last Exam tests AI agents on real professional workflows

New arXiv Paper Tests Compact Models Against LLMs for Multilingual Fact-Checking

SkillOpt trains agent skills as editable artifacts, not model weights

JetBrains Mellum2 ships as an open MoE model for coding agents

LongTraceRL trains long-context reasoning from search-agent trajectories

ActCam brings zero-shot camera-path control to AI video generation