
Ever asked an agent to add a feature and ended up with 47 files you didn’t expect to change? Yeah… that sinking “what did it just touch?” feeling — I know it well. I’m Hanks, and I’ve been throwing AI tools into real projects for months: breaking stuff on purpose, tracking the fallout, and figuring out what actually sticks. Over the past couple of weeks, I put Codex macOS App’s review pane to the test on real tasks, not demos, not toy examples. The question I kept asking myself: can I trust what the agent changed without reading every single line?
That’s exactly what I tested, and here’s what actually works — a workflow for scanning diffs, dropping comments, and iterating without letting things slip through to prod.

When Codex finishes a task, the review pane shows you exactly what changed. But here's the thing—not all changes need the same level of scrutiny.
I learned this the hard way after staging a refactor that looked clean in the diff but broke three integration tests I didn't know existed.
My pre-commit checklist now looks like this:
The review pane defaults to showing uncommitted changes, but you can switch the scope to:
This context switching is critical. I often start with "last turn" to see what the agent just did, then expand to "all branch changes" to catch cumulative drift.
One real example: I asked Codex to "add error handling to the API routes." It did—but also refactored the entire auth middleware. The "last turn" view only showed the error handling. The "all branch changes" view revealed the middleware rewrite I never asked for.
That's when I started reviewing in layers.

The review workflow that stuck for me:
Here's the part most guides skip: inline comments are anchored to specific lines, which means Codex can respond more precisely than if you just said "fix the bug."

Generic comments like "this looks wrong" get vague responses. Specific comments get fixes.
What doesn't work:
"Review this"
"Can you improve this?"
"Something's off here"
What does work:
After leaving inline comments, I send a follow-up message like:
"Address the inline comments and keep the scope minimal."
This tells Codex to focus on the flagged issues without rewriting everything. You can also use AGENTS.md files to define team-specific review guidelines that Codex follows automatically.
Real behavior I observed: If you leave 5 inline comments but don't send a follow-up message, Codex often ignores them. The comments are treated as review guidance, not direct instructions.
If you use the /review command, Codex will post inline comments directly in the review pane as part of its own code review process—basically reviewing its own work. Learn more about code review with Codex.
# Example: reviewing a data processing function
# Inline comment on line 23:
"This will fail if the API returns an empty array. Add a length check."
# Follow-up message in thread:
"Fix the inline comment on line 23, then run the test suite to confirm."
# Codex applies the fix and outputs test results in the same thread.
This iterative loop—comment, send instruction, verify—replaced most of my manual code edits. Instead of jumping into my editor to fix things myself, I guide Codex to fix them.

The review pane includes Git controls at three levels:
Here's how I use each:
Example scenario: Codex refactored a utility file and added a new helper function. The refactor was good, but the helper function introduced a dependency I didn't want.
This granular control means you don't have to accept or reject entire files—you can carve out the parts you trust.

Git operations you can do without leaving the app:
# These all work from the review pane UI:
- Stage changes (selective or all)
- Unstage changes
- Revert to last commit
- Commit with message
- Push to remote
- Create pull request
The commit message field appears after staging. I typically write something like:
"feat: add error handling to API routes
Applied Codex suggestions with manual review.
Verified: unit tests pass, no config drift."
Including "Applied Codex suggestions" in commit messages helps when you're tracing back why certain changes were made.
After running hundreds of agent tasks, I identified patterns that always need manual review—no exceptions.
One time I almost shipped a disaster: Codex "optimized" a database query by removing a JOIN. The code looked cleaner. Tests passed (because they used mocked data).
In production, it would've caused N+1 queries and crashed the API under load.
The stop sign? Any change to ORM queries or raw SQL goes through manual load testing. This is especially critical with GPT-5.2-Codex, which can generate sophisticated code changes that require human validation for production systems.
Before merging agent changes that touch these areas, I:
The review pane makes this easier because you can:
When to reject entirely and start over: If more than 30% of the diff needs inline comments, the agent probably misunderstood the task. Better to clarify the prompt and re-run than to patch dozens of issues.
The Codex app launched February 2, 2026 for macOS. If you're testing agent-driven development inside real projects—not just demos—the review pane is where you'll spend most of your time after the agent finishes a task.
At Macaron, we've built workflows where agents handle multi-step tasks without breaking user context. If you're running similar experiments—where AI handles execution and you handle judgment—the review pane pattern (diff → comment → iterate) maps directly to how we structure task handoffs.
Try it with a real task. The review pane shows you exactly where the agent stayed on track and where it drifted.