Agent-Written Pull Requests Need a Review Contract

TL;DR: Agent-written pull requests are turning code review into a context problem. The winning teams will not just ask agents for cleaner diffs. They will require every agent PR to carry the decision context a human reviewer needs: source of intent, scope boundary, evidence, risk, and the exact judgment being requested.
The strange thing about coding agents is how quickly they made a pull request feel less like the start of engineering work and more like the handoff after work already happened somewhere else.
For years, a PR was the first visible artifact after a human understood the task, made tradeoffs, wrote the code, ran the tests, and decided the change was worth review. Now that artifact can arrive from an agent that was handed an issue, given repository access, and allowed to work in the background.
That is not science fiction anymore. OpenAI describes Codex as a cloud software engineering agent that can work on multiple tasks in parallel, edit files, run tests, provide evidence from terminal logs, and propose pull requests for review.1 GitHub describes Copilot cloud agent as an autonomous GitHub Actions-powered environment that can create a plan, change code on a branch, and optionally open a pull request.2
So the question changes.
It is no longer only "can AI write code?"
Mostly, yes.
The better question is: what does a human need in order to trust the pull request?
The Pull Request Is Becoming an Agent Handoff
The simple version of the workflow sounds clean.
Assign an issue. Wait. Review the pull request. Merge if it looks good.
That framing is useful because it makes agent delegation feel approachable. It is also missing the part where real engineering teams make decisions.
A pull request is not just a diff. It is a request for judgment. It says: this change solves the right problem, fits the current direction, respects the system, and deserves to enter the product.
Humans used to carry a lot of that judgment context in memory. The developer remembered the planning call. The reviewer remembered the tradeoff from last sprint. The PM remembered the customer thread. The engineering manager remembered why a similar idea was cut three months ago.
Agent workflows expose how fragile that was.
An agent can read repository instructions. It can follow an AGENTS.md. It can inspect tests and logs. It can work through a scoped implementation. But unless the review context is explicit, the agent starts from a smaller world than the team actually uses to decide whether work should ship.
That smaller world is usually the ticket.
And the ticket is often too small.
The Evidence Says Review Is Still Human
The current research points in a useful direction: agents are increasingly capable of getting work into PR flows, but merge authority still sits with people.
One 2026 study of 29,585 pull requests across OpenAI, Copilot, Devin, Cursor, and Claude Code found that agents increasingly initiate work, while humans still keep merge authority. In the study, agents initiated more than 96% of "Collaborator" PRs, but less than 0.1% of merges were authorized by agents or automation without explicit human approval.3
That is the shape of the next few years: more agent-initiated work, but humans still carrying the judgment layer.
Another MSR 2026 study comparing AI coding agents found that task type had more influence on acceptance rates than the agent brand. Documentation tasks reached 82.1% acceptance, while new features reached 66.1%.4 That should not surprise anyone who has reviewed real product work. The harder the task is to judge against intent, hidden constraints, edge cases, and user impact, the less useful raw implementation speed becomes.
Failure studies make the same point from the other side. A January 2026 arXiv study of 33,000 agent-authored pull requests found that documentation, CI, and build-update tasks had the highest merge success, while performance and bug-fix tasks performed worst. Not-merged PRs tended to involve larger changes, more touched files, CI failures, weak reviewer engagement, duplicate work, unwanted features, and agent misalignment.5
That is not an argument against agents.
It is an argument for better handoffs.
A Bigger Prompt Is Not a Review Contract
The tempting response is to stuff more instructions into the agent.
Longer prompts. Bigger AGENTS.md files. More rules. More checklists. More warnings about what not to touch.
Some of that helps. Most of it does not scale.
OpenAI's harness engineering writeup makes the cleaner lesson explicit. The team says repository knowledge became the system of record, and that a short AGENTS.md worked better as a map than as a giant instruction manual. The deeper knowledge lived in structured docs: architecture, design docs, execution plans, generated references, and product specs.6
That is the right lesson for PR review too.
The agent does not need a bloated prompt that tries to remember the entire company. The reviewer does not need another generic AI summary that says files changed and tests passed. The PR needs a contract.
That contract should make five things visible before the reviewer opens the diff: where the intent came from, what scope the agent believed it had, what evidence it gathered, what risks remain, and what decision it wants from the human.
If any of those are missing, review turns into archaeology.
The Contract Starts With Source of Intent
The first job of an agent PR is to show where the work came from.
Not just the issue number. Not just the branch name. The actual source of intent.
Was this generated from a roadmap initiative, a bug report, a spec, a customer escalation, a failing test, a security finding, or a follow-up from a previous review?
That distinction changes the review. A bug fix should be judged against reproduction and regression risk. A product feature should be judged against user behavior and acceptance criteria. A refactor should be judged against blast radius and whether it preserves behavior. A cleanup task should be judged against scope discipline.
When every agent PR looks the same, reviewers have to rebuild the context themselves. When the source of intent is explicit, they can start with the right mental model.
This is where roadmap context matters without turning the article into a broad product-ops sermon. The roadmap is not decoration. It tells the reviewer whether the change belongs to an active bet, a stale idea, a local cleanup, or a customer-facing commitment.
Scope Boundary Is the Second Half of Trust
Agent-written pull requests need a sharper scope boundary than human-written ones.
Not because agents are bad. Because agents are tireless.
A human developer will often stop when the task feels complete or when the change starts becoming politically weird. An agent may keep polishing, expanding, or "helpfully" adjusting adjacent code unless the working boundary is obvious.
That makes scope notes part of the review surface. The PR should say what was intentionally changed, what was intentionally left alone, and where the agent noticed adjacent problems but did not touch them.
This one habit prevents a lot of review waste.
Reviewers should not have to guess whether a changed utility file is part of the fix or a drive-by refactor. They should not have to ask why a UI copy tweak appeared inside a backend bug fix. They should not have to reverse-engineer whether an agent followed the intended plan or wandered into a nearby improvement.
Scope is not bureaucracy. It is how humans keep the right to say, "this part is good, this part needs splitting, and this part should not be in the PR at all."
Evidence Needs to Be Reviewable, Not Decorative
Agents are good at producing evidence-looking text.
That is not the same thing as evidence.
A useful agent PR should connect claims to artifacts: tests run, logs captured, screenshots produced, type checks passed, failing cases reproduced, edge cases considered, and known gaps left open. If the agent says a test passed, the reviewer should be able to see which test. If it says a risk is low, the reviewer should know what assumption that confidence depends on.
Codex already points in this direction by providing evidence from terminal logs and test outputs so users can trace what happened during task completion.1 That is the right instinct. Evidence should make the review shorter because the reviewer can verify the path, not just read the conclusion.
The bar is not "the agent sounded confident."
The bar is "the reviewer can inspect enough to make a decision."
The Metric Is Decision Quality, Not Agent Volume
Teams will be tempted to measure agent adoption by easy counters: number of agent PRs opened, lines changed, tasks delegated, time saved, or tokens spent.
Those numbers are not useless. They are just insufficient.
If you want to know whether agent-written PRs are improving software delivery, measure the quality of the human decision surface around them.
How often does an agent PR arrive with enough source context for review?
How often does a reviewer have to ask, "why are we doing this?"
How often does generated work map cleanly to the intended initiative, spec, bug, or customer problem?
How often do agent PRs pass CI but fail product review?
How much human time is spent reconstructing context before merge?
Those are the signals that show whether agents are improving the system or just increasing throughput inside a broken one.
Atlassian's 2025 State of Developer Experience report already points at the broader tension. The company surveyed 3,500 developers and managers and found that while more teams believe AI is giving them time back, they are also reporting greater organizational inefficiencies.7 That is the paradox. AI can reduce local work and still increase system confusion if handoffs stay vague.
More agent work is not automatically more progress.
Better review contracts are progress.
Where One Horizon Fits
One Horizon is not useful here because it can say "AI" near a pull request.
It is useful because agent-written pull requests need context from outside the diff.
They need the task. The spec. The initiative. The previous discussion. The status signal. The reason this work mattered before an agent touched the code.
That is the layer we care about: connecting planning artifacts to execution so humans can review work against intent instead of reviewing code in a vacuum. When agents become another execution actor, the need for that shared record gets sharper. The reviewer should not have to interrogate five tools to know what decision they are being asked to make.
The next maturity step for AI coding is not "let agents merge everything."
It is much less flashy and much more valuable: make every agent-written PR easier to understand, easier to challenge, and easier to connect back to the work that matters.
That is what a review contract gives you.
And that is the kind of operating layer we are building at One Horizon.
Footnotes
-
OpenAI. "Introducing Codex." Published May 16, 2025, with later availability updates. OpenAI describes Codex as a cloud-based software engineering agent that can work on tasks in parallel, edit code, run tests, and propose pull requests for review. https://openai.com/index/introducing-codex/ ↩ ↩2
-
GitHub Docs. "About GitHub Copilot cloud agent." GitHub describes Copilot cloud agent as an autonomous GitHub Actions-powered environment that can create a plan, make code changes on a branch, and optionally open a pull request. https://docs.github.com/en/copilot/concepts/agents/cloud-agent/about-cloud-agent ↩
-
Chung and Hassan. "Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request Lifecycles." OpenReview, AIware 2026. Published March 28, 2026. https://openreview.net/forum?id=PXSlMpbbsn ↩
-
Pinna, Gong, Williams, and Sarro. "Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance." MSR 2026 Mining Challenge. https://2026.msrconf.org/details/msr-2026-mining-challenge/16/Comparing-AI-Coding-Agents-A-Task-Stratified-Analysis-of-Pull-Request-Acceptance ↩
-
Ehsani et al. "Where Do AI Coding Agents Fail? An Empirical Study of Failed Agentic Pull Requests in GitHub." arXiv:2601.15195. Submitted January 21, 2026. https://arxiv.org/abs/2601.15195 ↩
-
OpenAI. "Harness engineering: leveraging Codex in an agent-first world." Published February 11, 2026. https://openai.com/index/harness-engineering/ ↩
-
Atlassian. "State of Developer Experience Report 2025." Atlassian says it surveyed 3,500 developers and managers and found teams perceive more AI time gains while reporting greater organizational inefficiencies. https://www.atlassian.com/teams/software-development/state-of-developer-experience-2025 ↩



