Use GeneBench-Pro to judge AI science agents beyond tidy benchmarks

Q: Is GeneBench-Pro fully open?

Not fully. OpenAI says 10 representative questions are open-sourced, while a 50-question subset is planned for Artificial Analysis.

Q: Which model leads GeneBench-Pro in OpenAI's results?

OpenAI reports GPT-5.6 Sol as its strongest model on the benchmark, with 28.7% at the highest reasoning level and 31.5% with Pro mode enabled.

Q: Can AI agents replace computational biology experts now?

No. OpenAI's own result shows the strongest reported setup still solves fewer than one third of the benchmark.

GeneBench-Pro coverage image from Gigazine's independent report.Gigazine

AI & AutomationJul 1, 2026

@ZachasAuthorADMIN

OpenAI released GeneBench-Pro, a confirmed research benchmark for testing whether AI agents can handle messy computational biology analysis, not just clean workflow execution.

OpenAI has released GeneBench-Pro, a confirmed benchmark for testing whether AI agents can make judgment-heavy decisions in computational biology. The benchmark matters because it asks models to work through messy datasets, choose analysis paths, and return decision-ready answers instead of only following fixed instructions. Current results show progress, but not reliability: OpenAI says GPT-5.6 Sol reaches 28.7% at its highest reasoning level and 31.5% with Pro mode enabled.

GeneBench-Pro coverage image from Gigazine's independent report.

Source: Gigazine coverage of OpenAI's GeneBench-Pro release.

What changed

OpenAI published GeneBench-Pro on June 30, 2026 as a research-level benchmark for AI agents in genomics, quantitative biology, and translational medicine. It expands the earlier GeneBench work into harder tasks where the model must decide what the data can support, how to diagnose errors, and when to revise an analysis plan.

The benchmark contains 129 problems across 10 domains and 21 subdomains. OpenAI says each problem gives an agent a realistic dataset, experimental context, and a target estimand tied to a downstream decision. The model receives an isolated workspace with data files and a standard bioinformatics stack, then must produce a final answer that can be graded deterministically.

Benchmark detail	What it means for teams
129 problems	Broad enough to test more than one narrow biology workflow
10 domains, 21 subdomains	Covers genetics, omics, clinical interpretation, cancer genomics, and related areas
Synthetic data generation	OpenAI can know the target answer and reduce ambiguity in grading
10 representative questions	Public examples are available for inspection and reuse
50-question subset planned	Artificial Analysis is expected to receive an independent benchmarking subset

Why this is early

This is early because the benchmark is new, the public release is partial, and third-party benchmarking is not complete yet. OpenAI says it is open-sourcing 10 representative GeneBench-Pro questions on Hugging Face and plans to provide a 50-question subset to Artificial Analysis for independent benchmarking.

Gigazine independently covered the release on July 1, 2026 and repeated the central claims: GeneBench-Pro tests whether AI can separate signal from noise, choose an analysis route, and handle the messy parts of biological data analysis. That confirms the announcement is public, but the strongest evidence still comes from OpenAI's own benchmark description and paper.

Key takeaways

GeneBench-Pro is about research judgment, not only code execution or factual recall.
GPT-5.6 Sol still solves fewer than one third of the benchmark at the strongest reported setting.
OpenAI argues the benchmark avoids common long-horizon evaluation problems by using synthetic data with known causal structure.
The public release is useful for inspection, but the broader benchmark is not fully open.
Teams should treat high scores as a signal for scientific-agent quality, not proof that a model can replace human researchers.

Availability and access

Users cannot simply run the full benchmark as an open leaderboard today. OpenAI says 10 representative questions are being open-sourced on Hugging Face, with an interactive case-study interface for browsing. A 50-question subset is planned for Artificial Analysis, but public third-party results should be treated as pending until that evaluation is live.

The reported model results come from OpenAI's own evaluation. GPT-5.6 Sol reaches 28.7% at the highest reasoning level and 31.5% with Pro mode enabled. OpenAI also says a typical problem may take a human expert about 20 to 40 hours, while current inference costs can be only several dollars per problem.

Practical LinkLoot angle

For builders, GeneBench-Pro is a useful filter for claims about "AI scientists." A model that performs well on short coding tasks can still fail when the job requires data-quality judgment, causal framing, or deciding whether a result is strong enough to support a downstream decision.

Use this benchmark as a procurement and workflow-design signal. If your agent handles scientific analysis, drug-discovery triage, genomics interpretation, or bioinformatics automation, ask whether its evaluation covers ambiguity, diagnostics, and revision loops. LinkLoot's AI workflow automation guide is the right follow-up for separating steps that can be automated now from steps that still need expert review.

What to verify before you act

Check the OpenAI announcement and paper for the exact benchmark scope before comparing it with coding or general reasoning benchmarks.
Wait for Artificial Analysis or another independent evaluator before treating the scores as market-wide model rankings.
Review which public examples are available on Hugging Face and whether they match your biology domain.
Check whether a model's result used high reasoning, Pro mode, a special harness, or extra test-time compute.
Keep human review in any workflow where a wrong biological conclusion could affect clinical, safety, investment, or research decisions.

Source check

Confirmed by:

OpenAI's June 30 announcement states that GeneBench-Pro measures AI agents on judgment-heavy computational biology tasks.
OpenAI's announcement describes 129 problems across 10 domains and 21 subdomains, with deterministic grading based on synthetic data generation.
OpenAI reports GPT-5.6 Sol at 28.7% at the highest reasoning level and 31.5% with Pro mode enabled.

Early signal / context:

OpenAI says 10 representative questions are being open-sourced and a 50-question subset will go to Artificial Analysis, so independent benchmark reporting is still a watch item.
Gigazine confirms public coverage of the release and summarizes the main capability claim, but does not replace OpenAI's paper or future third-party evaluation.

FAQ

What is GeneBench-Pro?

GeneBench-Pro is OpenAI's benchmark for testing whether AI agents can make judgment-heavy computational biology decisions from messy datasets.

Is GeneBench-Pro fully open?

Which model leads GeneBench-Pro in OpenAI's results?

Can AI agents replace computational biology experts now?

Sources & links

References, demos, and supporting links.

OpenAI GeneBench-Pro announcementopenai.comPrimary OpenAI GeneBench-Pro papercdn.openai.com Gigazine independent coveragegigazine.net