CoDA-Bench tests whether coding agents can find the right data before writing code
CoDA-Bench is a new ICML 2026 benchmark for code agents that must search noisy data folders, identify relevant files, write code, and answer analytical questions.
CoDA-Bench is a benchmark for code agents working in data-heavy environments rather than tidy coding tasks with obvious input files. The paper describes 1,009 tasks across 31 communities, with each task environment averaging 980 files. Agents must discover the relevant data, write code, and produce an analytical answer, which makes the benchmark useful for judging data-science agents and repository automation beyond simple patch generation.
Key takeaways
- CoDA-Bench focuses on the combined problem of data discovery and code execution.
- The benchmark uses Linux sandbox environments built around Kaggle-style data communities.
- The paper reports 1,009 tasks, 31 communities, and an average of 980 files per task environment.
- The authors say even top-performing systems struggle, with the best reported success rate at 61.1%.
- The official repository includes Docker-oriented evaluation material, benchmark files, and a hard subset for more difficult cases.
Practical LinkLoot angle
Most agent benchmarks hide an important workflow problem: the agent often gets clean context before it starts coding. CoDA-Bench tests the messier version: can the agent inspect a directory, choose the right files, combine data sources, write the analysis, and avoid false confidence?
| Benchmark angle | What it tests | Why builders should care | Limitation |
|---|---|---|---|
| CoDA-Bench | Data discovery plus code execution | Useful for data-science agents and analytics assistants | New benchmark; reproduce setup before trusting rankings |
| SWE-style coding benchmarks | Issue fixing and repository edits | Useful for software maintenance agents | Often less focused on noisy data search |
| Internal eval suites | Your actual workflows | Best signal for buying or routing decisions | Requires careful task design and maintenance |
For LinkLoot readers building AI workflow automation, the immediate use is not the leaderboard. Use the benchmark design as a template for internal evals: give agents cluttered folders, realistic file names, hidden irrelevant data, and answer checks. Pair it with the automation guide at /guides/ai-workflow-automation when deciding which agent workflows deserve production access.
What to verify before you act
Read the repository setup before running the benchmark because the data size, Docker mode, and LLM API access shape cost and reproducibility. Check dataset licenses before reusing community data in a commercial eval. If a vendor cites CoDA-Bench later, look for the exact agent harness, model version, tool permissions, timeout, and whether the evaluation used the hard subset or the full task set.
Source check
The arXiv paper confirms the benchmark scope, task count, subject area, ICML 2026 acceptance note, and reported 61.1% success-rate ceiling. The GitHub repository confirms the evaluation framing, Docker isolation goal, benchmark files, and public project state. The project page and Hugging Face paper page corroborate the project identity, paper link, repository link, and data-intensive agent focus; the Hugging Face page contained an installation snippet, so it was used only for factual cross-checks, not as operational instruction.
CoDA-Bench is a benchmark for code agents that must discover relevant data files, write code, and answer data-analysis questions.
