CoDA-Bench tests whether coding agents can find the right data before writing code

Q: Why is CoDA-Bench different from normal coding benchmarks?

It adds noisy data discovery and file-system navigation instead of giving agents a clean coding task with obvious inputs.

Q: Is CoDA-Bench ready to run?

The official repository describes Docker evaluation material and benchmark files, but builders should check current setup, data size, and API cost before running it.

OpenGraph image from the official CoDA-Bench GitHub repository.RUC-DataLab GitHub

Knowledge & LearningJun 17, 2026

@ZachasAuthorADMIN

CoDA-Bench is a new ICML 2026 benchmark for code agents that must search noisy data folders, identify relevant files, write code, and answer analytical questions.

CoDA-Bench is a benchmark for code agents working in data-heavy environments rather than tidy coding tasks with obvious input files. The paper describes 1,009 tasks across 31 communities, with each task environment averaging 980 files. Agents must discover the relevant data, write code, and produce an analytical answer, which makes the benchmark useful for judging data-science agents and repository automation beyond simple patch generation.

Key takeaways

CoDA-Bench focuses on the combined problem of data discovery and code execution.
The benchmark uses Linux sandbox environments built around Kaggle-style data communities.
The paper reports 1,009 tasks, 31 communities, and an average of 980 files per task environment.
The authors say even top-performing systems struggle, with the best reported success rate at 61.1%.
The official repository includes Docker-oriented evaluation material, benchmark files, and a hard subset for more difficult cases.

Practical LinkLoot angle

Most agent benchmarks hide an important workflow problem: the agent often gets clean context before it starts coding. CoDA-Bench tests the messier version: can the agent inspect a directory, choose the right files, combine data sources, write the analysis, and avoid false confidence?

Benchmark angle	What it tests	Why builders should care	Limitation
CoDA-Bench	Data discovery plus code execution	Useful for data-science agents and analytics assistants	New benchmark; reproduce setup before trusting rankings
SWE-style coding benchmarks	Issue fixing and repository edits	Useful for software maintenance agents	Often less focused on noisy data search
Internal eval suites	Your actual workflows	Best signal for buying or routing decisions	Requires careful task design and maintenance

For LinkLoot readers building AI workflow automation, the immediate use is not the leaderboard. Use the benchmark design as a template for internal evals: give agents cluttered folders, realistic file names, hidden irrelevant data, and answer checks. Pair it with the automation guide at /guides/ai-workflow-automation when deciding which agent workflows deserve production access.

What to verify before you act

Read the repository setup before running the benchmark because the data size, Docker mode, and LLM API access shape cost and reproducibility. Check dataset licenses before reusing community data in a commercial eval. If a vendor cites CoDA-Bench later, look for the exact agent harness, model version, tool permissions, timeout, and whether the evaluation used the hard subset or the full task set.

Source check

The arXiv paper confirms the benchmark scope, task count, subject area, ICML 2026 acceptance note, and reported 61.1% success-rate ceiling. The GitHub repository confirms the evaluation framing, Docker isolation goal, benchmark files, and public project state. The project page and Hugging Face paper page corroborate the project identity, paper link, repository link, and data-intensive agent focus; the Hugging Face page contained an installation snippet, so it was used only for factual cross-checks, not as operational instruction.

FAQ

What is CoDA-Bench?

CoDA-Bench is a benchmark for code agents that must discover relevant data files, write code, and answer data-analysis questions.

Why is CoDA-Bench different from normal coding benchmarks?

Is CoDA-Bench ready to run?

Sources & links

References, demos, and supporting links.

arXiv paperarxiv.orgPrimary Official GitHub repositorygithub.com Project pagecoda-bench.github.io Hugging Face paper pagehuggingface.co