How to Use GPT-5.3 Codex for Long-Running Coding Tasks

If you're the kind of developer who's tired of babysitting AI agents through multi-hour refactors, only to watch them lose context and drift into nonsense halfway through—yeah, I've been there too.

I've tested every major coding agent over the last year. Most of them break down the same way: they start strong, maintain coherence for 20-30 minutes, then quietly lose track of what they're supposed to be doing. You don't notice until you review the output and realize the last hour of work is unusable.

GPT-5.3 Codex dropped on February 5th, 2026, with OpenAI claiming it could handle tasks spanning days. I didn't believe it. So I threw it at the kind of work that usually exposes these models: dependency migrations, cross-file refactors, and test rewrites—the stuff that requires holding architectural context for hours without breaking.

Here's what I learned about actually using it for long-running tasks, including where it breaks and what "steering" actually means in practice.

What "long-running" means in real projects

Let's define terms. When I say "long-running," I'm not talking about a 10-minute script generation. I mean:

Multi-hour execution: Tasks that span 2-6 hours of continuous work
Cross-file coherence: Changes that touch 15-50+ files and require understanding how they fit together
Iterative refinement: Work that involves repeated test-fix cycles where context must persist
Architectural awareness: Tasks where losing track of the overall structure means producing broken code

Real examples from my testing:

Migrating a 12,000-line Express API from callbacks to async/await (3.5 hours)
Updating a React codebase from class components to hooks across 40 components (4 hours)
Refactoring a monolithic service into microservices with shared types (6 hours, split across two sessions)

The common thread: these aren't tasks you can just "prompt and forget." They require the agent to make consistent architectural decisions, remember constraints across hundreds of intermediate steps, and adapt when it hits blockers.

Most agents fail around the 90-minute mark. Context starts degrading, earlier decisions get forgotten, and you end up with code that looks correct but doesn't compile because the agent lost track of which interfaces it already changed.

GPT-5.3 Codex handled the Express migration without major context loss. It did lose track once—forgot that one route handler needed special error handling—but caught and fixed it during test execution. That's what "long-running" means: can it maintain thread over hours, not minutes?

Framing tasks Codex can actually finish

This is where most people screw up. They treat Codex like a chatbot: vague request → hope for the best. That doesn't work for multi-hour tasks.

What I learned: the quality of your task framing directly determines whether Codex finishes or spirals.

Here's my structure for tasks that actually complete:

1. Define the success state in executable terms

Bad prompt:

"Refactor this codebase to be more maintainable"

Good prompt:

"Migrate all database queries in /src/db/ to use the connection pool pattern. 
Success criteria:
- All 23 query functions use pool.query() instead of direct connections
- Tests pass (npm test)
- No memory leaks in stress test (npm run stress-test)
- Execution time per query stays under 50ms"

The difference: the second prompt gives Codex a testable definition of "done." It knows when to stop, what to verify, and how to measure success.

2. Provide architectural guardrails

When working across multiple files, Codex needs to know what not to change. I learned this the hard way when it rewrote my entire auth middleware because I didn't specify boundaries.

Now I always include:

"Do NOT modify:
- /src/auth/* (authentication logic is frozen)
- /src/config/* (configuration is environment-specific)
- Any file with a @preserve comment at the top
DO modify:
- /src/api/routes/* (this is what needs updating)
- /tests/api/* (update tests to match new structure)"

3. Specify verification steps

This is critical. You need to tell Codex how to verify its own work, because it will make mistakes.

Template I use:

"After completing each phase:
1. Run linter (npm run lint)
2. Run tests (npm test)
3. If tests fail, analyze the failure and fix
4. Document what changed in CHANGELOG.md
5. Confirm no TypeScript errors (tsc --noEmit)"

On the React hooks migration, Codex caught 6 bugs during its own test runs that I would have found in code review. It's not that it doesn't make errors—it does—but giving it verification steps means it finds and fixes them before you see the output.

Breaking work into execution phases

The biggest mistake I made early: trying to hand Codex an entire refactor as one giant task.

It works better when you break work into phases with checkpoints.

Here's how I structure it now:

Phase 1: Analysis and planning (15-30 minutes)

"Analyze the current codebase structure in /src/payments/.
Identify all files that handle payment processing.
Map dependencies between these files.
List any external APIs or services they call.
Generate a migration plan that maintains backward compatibility."

Codex outputs a plan. I review it. If the plan makes sense, I give it the green light to execute. If not, I correct the plan before it writes any code.

Phase 2: Isolated changes (1-2 hours per subsystem)

"Implement Phase 1 of the migration plan: payment validation logic.
Work only on files in /src/payments/validation/.
Keep /src/payments/processing/ unchanged for now.
Run validation tests after each file.
Stop if any test fails and report the failure."

This isolates risk. If something breaks, it's contained to one subsystem. I can review that subsystem's changes, approve them, then move to the next phase.

Phase 3: Integration and verification (30-60 minutes)

"Now integrate the updated validation logic with processing.
Update /src/payments/processing/ to use the new validation interface.
Run full test suite.
Fix any integration bugs.
Verify backward compatibility with existing API contracts."

By the time I reach Phase 3, the validation logic is already reviewed and working. Codex is just connecting pieces that are known to work independently.

Phase Type

Typical Duration

Codex Autonomy Level

Review Frequency

Analysis & Planning

15-30 min

Low (you guide heavily)

Review plan before execution

Isolated Implementation

1-2 hours

High (let it work)

Review after subsystem complete

Integration

30-60 min

Medium (spot-check decisions)

Review after each integration point

Testing & Verification

30-45 min

High (automated checks)

Review final test results

This phased approach cut my review time by 60%. Instead of reviewing 2,000 lines of changes all at once, I review 400 lines at a time, approve, then move forward.

Letting Codex run — and when to intervene

Alright, here's the part that confused me at first: when do you let Codex work independently, and when do you jump in?

I tested this systematically. Ran 15 tasks, varied my intervention frequency, tracked completion rate and code quality.

What I found: intervening too early destroys productivity. Intervening too late produces garbage.

The pattern that worked:

Let it run for the first 30-45 minutes without interruption. This is when Codex is loading context, exploring the codebase, and forming its execution plan. If you interrupt during this phase, it loses coherence. I know it's tempting to course-correct early mistakes, but resist.

Check progress at natural breakpoints. Codex surfaces progress updates every 10-15 minutes (you can see these in the Codex app interface). These are your intervention opportunities. If it's on track, let it continue. If it's drifting, redirect.

Intervene when you see architectural drift. This is the failure mode that matters. If Codex starts making decisions that conflict with your system's architecture, stop immediately. Example: I was running a database migration and Codex decided to introduce a new ORM layer. That wasn't in scope. I stopped, clarified constraints, and it corrected course.

Don't interrupt for minor code style issues. If it's using let instead of const, or naming variables differently than you would—let it finish. You can clean that up in review. Interrupting for style breaks its flow.

From my 15-task experiment:

Intervention Pattern

Task Completion Rate

Code Quality (1-10)

Time to Complete

No intervention (let it run fully)

47%

6.2

2.1 hours avg

Intervene every 15 min

73%

7.8

3.4 hours avg

Intervene every 30-45 min

87%

8.1

2.8 hours avg

Intervene only at breakpoints

93%

8.4

2.5 hours avg

The "breakpoint" strategy—where you only intervene when Codex surfaces a decision point or when you notice architectural drift—had the best results. It kept Codex in flow while preventing major mistakes.

Mid-task corrections

The steering feature became stable in the latest release. Here's how it works in practice.

When Codex is running, you can send corrections without killing the task. In the Codex CLI, pressing Enter during a running task sends a steering message immediately. Tab queues the message for after the current step finishes.

I use steering for three things:

1. Constraint clarification

Mid-task message:
"Actually, keep the existing error handling pattern. 
Don't introduce new try-catch blocks unless the current code doesn't handle errors at all."

Codex adjusts without losing progress on the rest of the refactor.

2. Scope adjustment

Mid-task message:
"Skip the test file updates for now. 
Focus on getting the implementation working first."

This speeds up iteration when I realize I defined scope too broadly.

3. Bug correction

Mid-task message:
"The TypeScript error on line 47 is because you're using the old interface. 
Import the new PaymentRequest type from @/types/payment."

Codex incorporates the fix and continues. Without steering, I'd have to wait for it to finish, review all the code, then restart with corrections.

Real example from my Express migration:

I was 90 minutes in when Codex hit an error: one of the legacy route handlers used a callback-based auth check that wouldn't work with async/await. It paused and asked how to handle it.

I sent: "Wrap the auth check in a Promise and await it. Pattern: await new Promise((resolve, reject) => authCheck(req, (err, user) => err ? reject(err) : resolve(user)))"

Codex applied that pattern to all 7 affected routes, ran the tests, confirmed they passed, and kept going. The entire correction took 3 minutes instead of requiring a full restart.

The steering feature is what makes GPT-5.3 Codex different from earlier versions. Before this, mid-task corrections meant stopping, losing context, and starting over. Now it's more like working with a junior developer: you course-correct in real time, they adjust, and work continues.

Reviewing outputs efficiently

Let's talk about the review process, because this is where most of my time goes.

When Codex finishes a multi-hour task, you're looking at potentially hundreds of file changes. You can't review every line—that defeats the purpose of automation. But you also can't merge blindly.

Here's the review workflow I've settled on:

1. Review the architectural decisions first

Before looking at individual files, I scan for structural changes:

New directories or modules created
Changes to interfaces or type definitions
New dependencies added
Modifications to config files

These are the changes that propagate. If Codex made a bad architectural call, it affects everything downstream. Catch it here before reviewing implementation details.

Tool: I use git diff --stat to see which files changed and by how much. If I see 500+ line changes in a core module, I review that first.

2. Run the automated verification

Don't trust that Codex ran the tests correctly. Run them yourself.

npm run lint
npm test
npm run type-check
npm run build

On the React migration, tests passed in Codex's environment but failed in mine because of a dependency version mismatch. Caught it because I ran tests independently.

3. Spot-check high-risk areas

I don't review every file. I review:

Anything touching authentication or authorization
Database queries or mutations
External API calls
Error handling logic
Security-sensitive code paths

For the Express migration, I reviewed all 12 async database calls in detail. Everything else got a quick scan.

4. Use diffs intelligently

The Codex app's diff view groups changes by type: new files, modified files, deleted files. I review in this order:

Deleted files first (make sure nothing critical was removed)
Modified core files (architecture, types, interfaces)
New files (usually safe, but check for duplicated logic)
Modified implementation files (quick scan, trust the tests)

5. Test in isolation

For any subsystem Codex touched, I test it separately before merging:

# Isolate the payment module
cd src/payments
npm test -- --coverage
# Check integration points
npm run test:integration -- --grep "payment"
# Load test if performance-critical
npm run load-test:payments

This caught a race condition in the async migration that didn't show up in unit tests.

Average review time by task size:

Task Scope

Files Changed

Review Time

Key Focus Areas

Small refactor

5-10 files

15-20 min

Architecture + tests

Medium migration

20-40 files

45-60 min

Core modules + integration

Large overhaul

50+ files

2-3 hours

Subsystem by subsystem

The 2-3 hour review for a task that took Codex 6 hours is still a massive time save. I didn't write any code. I just verified that what it wrote is correct.

One more thing: when I find bugs during review, I don't fix them manually. I feed them back to Codex:

"Found a bug in src/payments/validate.ts line 34:
You're calling processPayment before validating the amount.
This will throw if amount is undefined.
Fix: move validation check before processPayment call."

Codex fixes it in 2 minutes. This keeps the entire task in one context instead of fragmenting across manual edits and agent-generated code.

What I'm keeping from this experiment

After running 15 long-running tasks through GPT-5.3 Codex, here's what stays in my workflow:

Dependency migrations: Codex handles these better than I do. Less error-prone, faster, and it actually reads changelogs to avoid breaking changes.

Test rewrites: When I need to migrate from Jest to Vitest, or update test patterns, Codex is faster and more thorough than doing it manually.

Boilerplate expansion: If I need to add the same pattern across 30 files (logging, error handling, metrics), Codex does it in 20 minutes instead of my spending an afternoon.

What I'm not delegating yet:

System design decisions. Codex can implement an architecture, but I'm not trusting it to invent one from scratch.

Security-critical code. I still write auth, payment, and PII handling logic myself. Too much at stake.

Greenfield projects with unclear requirements. Codex needs a defined target. If requirements are fuzzy, the back-and-forth takes longer than just coding.

The pattern I've settled on: I design the structure, define success criteria, and specify constraints. Codex executes. I review architecture-level decisions and test the output. If it passes, I ship.

It's not replacing me. It's compressing the execution layer so I can spend more time thinking about what to build instead of how to implement it.

At Macaron, we handle exactly this workflow—structured task delegation that runs without constant supervision. If you want to test how your multi-hour tasks hold up when you're not babysitting every step, try it with a real refactor and see if context actually stays intact.