Topic

#arxiv

Loot, blog posts and adjacent themes connected to this topic. Follow the tag to keep it in your orbit.

#arxiv
Loot

More from this topic

Explore all loot
No loot for #arxiv yet

When the community shares matching finds, they will appear here. For now, browse all loot or submit the first drop.

Blog

Related reads

Browse blog
Wissen & Lernen

PlanBench-XL tests whether agents can recover when tool paths break

PlanBench-XL is a June 2026 arXiv benchmark for long-horizon LLM tool-use agents, with 327 retail tasks, 1,665 tools, retrieval-limited visi

Wissen & Lernen

AgentBench Shows Why AI Agent Accuracy Is Also a Compute Budget Problem

A KAIST paper and its AgentBench repository measure how dynamic reasoning changes AI agent latency, energy use, and infrastructure cost, not

Wissen & Lernen

SIA Tests Self-Improving AI Across Agent Harnesses and Model Weights

A new arXiv paper and official implementation show SIA updating both an agent scaffold and model weights, with reported gains on LawBench, G

Wissen & Lernen

CEO-Bench Tests Whether AI Agents Can Run a Startup for 500 Days

Wissen & Lernen

WorkBench Revisited Shows Why Workplace Agent Scores Need Source-Level Checks

WorkBench Revisited updates a workplace-agent benchmark with 2026 model runs, but the arXiv abstract and GitHub repository currently surface

Wissen & Lernen

CoDA-Bench tests whether coding agents can find the right data before writing code

CoDA-Bench is a new ICML 2026 benchmark for code agents that must search noisy data folders, identify relevant files, write code, and answer

Wissen & Lernen

Agents' Last Exam tests AI agents on real professional workflows

Agents' Last Exam is a new Berkeley-led benchmark for computer-use AI agents, with long-horizon professional tasks, verifiable outcomes, pub

Wissen & Lernen

New arXiv Paper Tests Compact Models Against LLMs for Multilingual Fact-Checking

A June 2026 arXiv paper from Factiverse reports that compact fine-tuned models can stay practical for multilingual fact-checking when latenc

Wissen & Lernen

SkillOpt trains agent skills as editable artifacts, not model weights

SkillOpt is a Microsoft Research project and arXiv paper that treats natural-language agent skills as trainable external state, using scored

AI & Automation

JetBrains Mellum2 ships as an open MoE model for coding agents

JetBrains released Mellum2, an Apache-2.0 open-weight 12B Mixture-of-Experts model built for software engineering, routing, RAG, and low-lat

Wissen & Lernen

LongTraceRL trains long-context reasoning from search-agent trajectories

LongTraceRL uses search-agent trajectories, tiered distractors, and entity-level rubric rewards to improve long-context reasoning across fiv

Kreativ & Medien

ActCam brings zero-shot camera-path control to AI video generation

ActCam is a new arXiv paper and project release that combines character-motion transfer with per-frame camera control, aiming to make AI vid