
If you're the kind of developer who's tired of babysitting AI agents through multi-hour refactors, only to watch them lose context and drift into nonsense halfway through—yeah, I've been there too.
I've tested every major coding agent over the last year. Most of them break down the same way: they start strong, maintain coherence for 20-30 minutes, then quietly lose track of what they're supposed to be doing. You don't notice until you review the output and realize the last hour of work is unusable.
GPT-5.3 Codex dropped on February 5th, 2026, with OpenAI claiming it could handle tasks spanning days. I didn't believe it. So I threw it at the kind of work that usually exposes these models: dependency migrations, cross-file refactors, and test rewrites—the stuff that requires holding architectural context for hours without breaking.
Here's what I learned about actually using it for long-running tasks, including where it breaks and what "steering" actually means in practice.

Let's define terms. When I say "long-running," I'm not talking about a 10-minute script generation. I mean:
Real examples from my testing:
The common thread: these aren't tasks you can just "prompt and forget." They require the agent to make consistent architectural decisions, remember constraints across hundreds of intermediate steps, and adapt when it hits blockers.
Most agents fail around the 90-minute mark. Context starts degrading, earlier decisions get forgotten, and you end up with code that looks correct but doesn't compile because the agent lost track of which interfaces it already changed.
GPT-5.3 Codex handled the Express migration without major context loss. It did lose track once—forgot that one route handler needed special error handling—but caught and fixed it during test execution. That's what "long-running" means: can it maintain thread over hours, not minutes?

This is where most people screw up. They treat Codex like a chatbot: vague request → hope for the best. That doesn't work for multi-hour tasks.
What I learned: the quality of your task framing directly determines whether Codex finishes or spirals.
Here's my structure for tasks that actually complete:
1. Define the success state in executable terms
Bad prompt:
"Refactor this codebase to be more maintainable"
Good prompt:
"Migrate all database queries in /src/db/ to use the connection pool pattern.
Success criteria:
- All 23 query functions use pool.query() instead of direct connections
- Tests pass (npm test)
- No memory leaks in stress test (npm run stress-test)
- Execution time per query stays under 50ms"
The difference: the second prompt gives Codex a testable definition of "done." It knows when to stop, what to verify, and how to measure success.
2. Provide architectural guardrails
When working across multiple files, Codex needs to know what not to change. I learned this the hard way when it rewrote my entire auth middleware because I didn't specify boundaries.
Now I always include:
"Do NOT modify:
- /src/auth/* (authentication logic is frozen)
- /src/config/* (configuration is environment-specific)
- Any file with a @preserve comment at the top
DO modify:
- /src/api/routes/* (this is what needs updating)
- /tests/api/* (update tests to match new structure)"
3. Specify verification steps
This is critical. You need to tell Codex how to verify its own work, because it will make mistakes.
Template I use:
"After completing each phase:
1. Run linter (npm run lint)
2. Run tests (npm test)
3. If tests fail, analyze the failure and fix
4. Document what changed in CHANGELOG.md
5. Confirm no TypeScript errors (tsc --noEmit)"
On the React hooks migration, Codex caught 6 bugs during its own test runs that I would have found in code review. It's not that it doesn't make errors—it does—but giving it verification steps means it finds and fixes them before you see the output.

The biggest mistake I made early: trying to hand Codex an entire refactor as one giant task.
It works better when you break work into phases with checkpoints.
Here's how I structure it now:
Phase 1: Analysis and planning (15-30 minutes)
"Analyze the current codebase structure in /src/payments/.
Identify all files that handle payment processing.
Map dependencies between these files.
List any external APIs or services they call.
Generate a migration plan that maintains backward compatibility."
Codex outputs a plan. I review it. If the plan makes sense, I give it the green light to execute. If not, I correct the plan before it writes any code.
Phase 2: Isolated changes (1-2 hours per subsystem)
"Implement Phase 1 of the migration plan: payment validation logic.
Work only on files in /src/payments/validation/.
Keep /src/payments/processing/ unchanged for now.
Run validation tests after each file.
Stop if any test fails and report the failure."
This isolates risk. If something breaks, it's contained to one subsystem. I can review that subsystem's changes, approve them, then move to the next phase.
Phase 3: Integration and verification (30-60 minutes)
"Now integrate the updated validation logic with processing.
Update /src/payments/processing/ to use the new validation interface.
Run full test suite.
Fix any integration bugs.
Verify backward compatibility with existing API contracts."
By the time I reach Phase 3, the validation logic is already reviewed and working. Codex is just connecting pieces that are known to work independently.
This phased approach cut my review time by 60%. Instead of reviewing 2,000 lines of changes all at once, I review 400 lines at a time, approve, then move forward.

Alright, here's the part that confused me at first: when do you let Codex work independently, and when do you jump in?
I tested this systematically. Ran 15 tasks, varied my intervention frequency, tracked completion rate and code quality.
What I found: intervening too early destroys productivity. Intervening too late produces garbage.
The pattern that worked:
Let it run for the first 30-45 minutes without interruption. This is when Codex is loading context, exploring the codebase, and forming its execution plan. If you interrupt during this phase, it loses coherence. I know it's tempting to course-correct early mistakes, but resist.
Check progress at natural breakpoints. Codex surfaces progress updates every 10-15 minutes (you can see these in the Codex app interface). These are your intervention opportunities. If it's on track, let it continue. If it's drifting, redirect.
Intervene when you see architectural drift. This is the failure mode that matters. If Codex starts making decisions that conflict with your system's architecture, stop immediately. Example: I was running a database migration and Codex decided to introduce a new ORM layer. That wasn't in scope. I stopped, clarified constraints, and it corrected course.
Don't interrupt for minor code style issues. If it's using let instead of const, or naming variables differently than you would—let it finish. You can clean that up in review. Interrupting for style breaks its flow.
From my 15-task experiment:
The "breakpoint" strategy—where you only intervene when Codex surfaces a decision point or when you notice architectural drift—had the best results. It kept Codex in flow while preventing major mistakes.
The steering feature became stable in the latest release. Here's how it works in practice.
When Codex is running, you can send corrections without killing the task. In the Codex CLI, pressing Enter during a running task sends a steering message immediately. Tab queues the message for after the current step finishes.
I use steering for three things:
1. Constraint clarification
Mid-task message:
"Actually, keep the existing error handling pattern.
Don't introduce new try-catch blocks unless the current code doesn't handle errors at all."
Codex adjusts without losing progress on the rest of the refactor.
2. Scope adjustment
Mid-task message:
"Skip the test file updates for now.
Focus on getting the implementation working first."
This speeds up iteration when I realize I defined scope too broadly.
3. Bug correction
Mid-task message:
"The TypeScript error on line 47 is because you're using the old interface.
Import the new PaymentRequest type from @/types/payment."
Codex incorporates the fix and continues. Without steering, I'd have to wait for it to finish, review all the code, then restart with corrections.
Real example from my Express migration:
I was 90 minutes in when Codex hit an error: one of the legacy route handlers used a callback-based auth check that wouldn't work with async/await. It paused and asked how to handle it.
I sent: "Wrap the auth check in a Promise and await it. Pattern: await new Promise((resolve, reject) => authCheck(req, (err, user) => err ? reject(err) : resolve(user)))"
Codex applied that pattern to all 7 affected routes, ran the tests, confirmed they passed, and kept going. The entire correction took 3 minutes instead of requiring a full restart.
The steering feature is what makes GPT-5.3 Codex different from earlier versions. Before this, mid-task corrections meant stopping, losing context, and starting over. Now it's more like working with a junior developer: you course-correct in real time, they adjust, and work continues.
Let's talk about the review process, because this is where most of my time goes.
When Codex finishes a multi-hour task, you're looking at potentially hundreds of file changes. You can't review every line—that defeats the purpose of automation. But you also can't merge blindly.
Here's the review workflow I've settled on:
1. Review the architectural decisions first
Before looking at individual files, I scan for structural changes:
These are the changes that propagate. If Codex made a bad architectural call, it affects everything downstream. Catch it here before reviewing implementation details.
Tool: I use git diff --stat to see which files changed and by how much. If I see 500+ line changes in a core module, I review that first.
2. Run the automated verification
Don't trust that Codex ran the tests correctly. Run them yourself.
npm run lint
npm test
npm run type-check
npm run build
On the React migration, tests passed in Codex's environment but failed in mine because of a dependency version mismatch. Caught it because I ran tests independently.
3. Spot-check high-risk areas
I don't review every file. I review:
For the Express migration, I reviewed all 12 async database calls in detail. Everything else got a quick scan.
4. Use diffs intelligently
The Codex app's diff view groups changes by type: new files, modified files, deleted files. I review in this order:
5. Test in isolation
For any subsystem Codex touched, I test it separately before merging:
# Isolate the payment module
cd src/payments
npm test -- --coverage
# Check integration points
npm run test:integration -- --grep "payment"
# Load test if performance-critical
npm run load-test:payments
This caught a race condition in the async migration that didn't show up in unit tests.
Average review time by task size:
The 2-3 hour review for a task that took Codex 6 hours is still a massive time save. I didn't write any code. I just verified that what it wrote is correct.
One more thing: when I find bugs during review, I don't fix them manually. I feed them back to Codex:
"Found a bug in src/payments/validate.ts line 34:
You're calling processPayment before validating the amount.
This will throw if amount is undefined.
Fix: move validation check before processPayment call."
Codex fixes it in 2 minutes. This keeps the entire task in one context instead of fragmenting across manual edits and agent-generated code.
After running 15 long-running tasks through GPT-5.3 Codex, here's what stays in my workflow:
Dependency migrations: Codex handles these better than I do. Less error-prone, faster, and it actually reads changelogs to avoid breaking changes.
Test rewrites: When I need to migrate from Jest to Vitest, or update test patterns, Codex is faster and more thorough than doing it manually.
Boilerplate expansion: If I need to add the same pattern across 30 files (logging, error handling, metrics), Codex does it in 20 minutes instead of my spending an afternoon.
What I'm not delegating yet:
System design decisions. Codex can implement an architecture, but I'm not trusting it to invent one from scratch.
Security-critical code. I still write auth, payment, and PII handling logic myself. Too much at stake.
Greenfield projects with unclear requirements. Codex needs a defined target. If requirements are fuzzy, the back-and-forth takes longer than just coding.
The pattern I've settled on: I design the structure, define success criteria, and specify constraints. Codex executes. I review architecture-level decisions and test the output. If it passes, I ship.
It's not replacing me. It's compressing the execution layer so I can spend more time thinking about what to build instead of how to implement it.
At Macaron, we handle exactly this workflow—structured task delegation that runs without constant supervision. If you want to test how your multi-hour tasks hold up when you're not babysitting every step, try it with a real refactor and see if context actually stays intact.