CoDA-Bench tests whether coding agents can find the right data before writing code

OpenGraph image from the official CoDA-Bench GitHub repository.RUC-DataLab GitHub
OpenGraph image from the official CoDA-Bench GitHub repository.RUC-DataLab GitHub

CoDA-Bench is a new ICML 2026 benchmark for code agents that must search noisy data folders, identify relevant files, write code, and answer analytical questions.

CoDA-Bench is a benchmark for code agents working in data-heavy environments rather than tidy coding tasks with obvious input files. The paper describes 1,009 tasks across 31 communities, with each task environment averaging 980 files. Agents must discover the relevant data, write code, and produce an analytical answer, which makes the benchmark useful for judging data-science agents and repository automation beyond simple patch generation.

Key takeaways

  • CoDA-Bench focuses on the combined problem of data discovery and code execution.
  • The benchmark uses Linux sandbox environments built around Kaggle-style data communities.
  • The paper reports 1,009 tasks, 31 communities, and an average of 980 files per task environment.
  • The authors say even top-performing systems struggle, with the best reported success rate at 61.1%.
  • The official repository includes Docker-oriented evaluation material, benchmark files, and a hard subset for more difficult cases.

Practical LinkLoot angle

Most agent benchmarks hide an important workflow problem: the agent often gets clean context before it starts coding. CoDA-Bench tests the messier version: can the agent inspect a directory, choose the right files, combine data sources, write the analysis, and avoid false confidence?

Benchmark angleWhat it testsWhy builders should careLimitation
CoDA-BenchData discovery plus code executionUseful for data-science agents and analytics assistantsNew benchmark; reproduce setup before trusting rankings
SWE-style coding benchmarksIssue fixing and repository editsUseful for software maintenance agentsOften less focused on noisy data search
Internal eval suitesYour actual workflowsBest signal for buying or routing decisionsRequires careful task design and maintenance

For LinkLoot readers building AI workflow automation, the immediate use is not the leaderboard. Use the benchmark design as a template for internal evals: give agents cluttered folders, realistic file names, hidden irrelevant data, and answer checks. Pair it with the automation guide at /guides/ai-workflow-automation when deciding which agent workflows deserve production access.

What to verify before you act

Read the repository setup before running the benchmark because the data size, Docker mode, and LLM API access shape cost and reproducibility. Check dataset licenses before reusing community data in a commercial eval. If a vendor cites CoDA-Bench later, look for the exact agent harness, model version, tool permissions, timeout, and whether the evaluation used the hard subset or the full task set.

Source check

The arXiv paper confirms the benchmark scope, task count, subject area, ICML 2026 acceptance note, and reported 61.1% success-rate ceiling. The GitHub repository confirms the evaluation framing, Docker isolation goal, benchmark files, and public project state. The project page and Hugging Face paper page corroborate the project identity, paper link, repository link, and data-intensive agent focus; the Hugging Face page contained an installation snippet, so it was used only for factual cross-checks, not as operational instruction.

FAQ

CoDA-Bench is a benchmark for code agents that must discover relevant data files, write code, and answer data-analysis questions.