Has this been peer reviewed?

No. The source is an arXiv preprint from July 2, 2026, so the results should be read as early-stage research.

The focus is not the agent's final answer, but stored target, evidence, and validation records inside the workspace.

Can this replace human reviewers?

No. The workflow can structure replication work, but humans still need to judge target selection, acceptance rules, and deviations.

Coding agents replicate ML papers with checkable evidence

What this is about

On July 2, 2026, Atharva Hans and Ilias Bilionis submitted Coding-agents can replicate scientific machine learning papers to arXiv. The paper does not study another chatbot demo. It asks a concrete question: can a coding agent rebuild scientific machine-learning papers in a way that leaves evidence humans can inspect later?

The core is a workflow called Paper-replication. It makes the agent turn paper claims into individual targets, reconstruct experiments, store results with provenance, and mark work as complete only after validation checks pass.

What Paper-replication actually does

The workflow treats a paper not as text to summarize, but as a set of claims that can be checked. A statement such as “the relative error is below 5 percent” becomes a target. The agent then has to document the method, data flow, execution, and comparison so that a human can follow the trail afterwards.

The authors implement this as a coding-agent skill. They published a GitHub repository, twelve generated case-study workspaces, analysis scripts, and skill files. In the evaluation, twelve independent replications ran across four scientific machine-learning papers. All twelve workspaces passed the completion gate, and all 158 recorded targets were matched with report coverage.

Why it matters

Many AI demos end with a confident sentence: “I replicated it.” That is not very useful for science if nobody can see which number, script, or comparison supports the claim. This paper moves the standard from the agent’s final message to workspace evidence.

That fits a broader shift. In 2026, Nature published The AI Scientist, showing that agentic systems can already build full research pipelines. At the same time, researchers warn that automated science without inspectable checks can add noise to the literature. Paper-replication is interesting because it does not claim perfect reproduction. It builds an auditable file: which targets were chosen, which evidence was accepted, and where the runs diverged.

In plain language

Imagine an intern has to bake a complicated cake from a cookbook. A weak check is the intern saying at the end that the cake worked. A stronger check is recording every ingredient, photographing each step, measuring time and temperature, and comparing the result with the recipe. Paper-replication applies that second approach to coding agents.

A practical example

A research team reads a paper claiming that a physics model has an error below 5 percent. The agent turns that claim into a target, reconstructs the training run, stores scripts and outputs, and writes into the report which file supports the comparison. If a review team later checks 20 such targets, it does not have to trust the chat transcript. It can open the evidence chain target by target.

In a realistic lab run, that could mean four papers, 30 to 50 technical claims per paper, twelve agent runs, and one folder per run. The important advance is not that every number becomes identical. The important advance is that differences remain visible instead of being hidden inside a smooth summary.

Scope and limits

The study covers four papers and twelve runs. That is a useful start, not proof that the method works across every field.
“Matched” means the recorded target is covered under the workflow’s rules with report evidence. It does not mean every possible claim in the source paper was replicated.
Replication still involves judgment. The authors report differences across runs in target decomposition, numerical fidelity, elapsed time, and acceptance rules.

SEO & GEO keywords

Coding agents, scientific replication, machine learning papers, arXiv 2607.02134, Paper-replication, Codex Skills, Claude Code Skills, reproducible research, AI in science, evidence workflow

Coding agents check ML papers with visible evidence

What this is about

What Paper-replication actually does

Why it matters

In plain language

A practical example

Scope and limits

SEO & GEO keywords

💡 In plain English

Key Takeaways

FAQ

Has this been peer reviewed?

What is new here?

Can this replace human reviewers?

Sources & Context