
Hey fellow AI tinkerers — if you've been watching GPT-5.4's launch benchmarks and wondering "okay, but what does this actually change about my workday," this one's for you.
I've been running GPT-5.4 through real workflows since launch day. Not the polished demos. Actual tasks with messy inputs, incomplete context, and the kind of edge cases that make AI tools fall over. Here's what I found.

The practical shift with GPT-5.4 isn't any single capability — it's convergence. GPT-5.4 brings reasoning, coding, and agentic workflows into one frontier model, which means you're no longer routing "think hard" tasks to one model, coding to another, and computer automation to a third.
Before GPT-5.4, a multi-step work task — say, research a competitor, build a comparison spreadsheet, and email a summary — required you to stitch together tools manually. The research output didn't automatically feed the spreadsheet, and the spreadsheet didn't automatically draft the email. GPT-5.4 is the first OpenAI general-purpose model where all three steps can happen in a single agent session.
That's the change that matters for work. Not any single benchmark number.
GPT-5.4 is OpenAI's first general-purpose model with native computer-use capabilities in Codex and the API, enabling agents to operate computers and carry out multi-application workflows. This isn't a wrapper around an existing tool — it's the same model that does your reasoning also issuing the mouse clicks.
The mechanism: the model takes a screenshot, decides what to do, issues a mouse or keyboard command, takes another screenshot, and repeats. No DOM manipulation required. No API surface needed. If a human can see it on screen and click it, GPT-5.4 can in principle do the same.
In practice, this opens up automation for legacy software, internal portals, and any tool that has never had an API. More on the reliability gaps below.


This is GPT-5.4's single strongest validated use case. On an internal benchmark of spreadsheet modeling tasks that a junior investment banking analyst might do, GPT-5.4 achieves a mean score of 87.3%, compared to 68.4% for GPT-5.2.
What "87.3%" means in plain language: build a three-statement model from a prompt, with correct formula logic, proper formatting, and citations. That's a task that takes a junior analyst 2–4 hours. GPT-5.4 handles the mechanical assembly of it.
ChatGPT for Excel is now in beta — an Excel add-in that brings GPT-5.4 directly into workbooks to help build and update models, run scenarios, and generate outputs based on cells and formulas. Google Sheets integration is listed as coming soon.
How to set it up: For the ChatGPT for Excel beta, you need Plus, Pro, Business, Enterprise, Edu, or Teachers plan (US, Canada, Australia at launch). Open the ChatGPT pane inside Excel, describe what you need in plain language, and the model reads your existing ranges and builds or updates the model in place.
But watch out: GPT-5.4 excels at structured, well-specified modeling tasks. Give it an ambiguous brief ("make the model more realistic") and it improvises assumptions you'll need to audit. Always specify your assumptions explicitly. It also doesn't yet handle circular references in Excel gracefully — break them manually before handing off.

On a set of presentation evaluation prompts, human raters preferred presentations from GPT-5.4 68.0% of the time over those from GPT-5.2, citing stronger aesthetics, greater visual variety, and more effective use of image generation.
This is a real, measurable improvement — not just "it writes better bullet points." The model now understands visual hierarchy, layout variation, and when to use a chart vs. a table vs. a text slide. For first-draft decks from a brief or a document, it saves 1–2 hours of outline-and-structure work.
How to set it up: In ChatGPT, paste your source material (brief, data, document) and prompt:
Build a 10-slide executive presentation on [topic].
Slide 1: Title and key message.
Slides 2–8: One key point per slide, with supporting data and a suggested visual.
Slide 9: Risks and limitations.
Slide 10: Next steps and owner.
Use a professional, minimalist layout. Generate speaker notes for each slide.
But watch out: GPT-5.4 generates presentation content, not a finished PowerPoint file unless you're using a connected tool or API workflow. The visual output quality still depends heavily on your theme and template — it won't fix a bad deck template. Run a human review pass on any slide that includes external data citations.
On Online-Mind2Web, GPT-5.4 reports 92.8% success using screenshot-based browser interaction alone. For form-heavy workflows — government portals, insurance systems, CRM data entry, vendor onboarding forms — this is the use case where computer use earns its cost.
One validated real-world example: Mainstay tested GPT-5.4 across roughly 30,000 HOA and property tax portals, achieving a 95% first-attempt success rate and 100% completion within three tries — roughly 3x faster and using 70% fewer tokens than previous computer-use models.
How to set it up: Via the Responses API with computer use enabled:
response = client.responses.create(
model="gpt-5.4-2026-03-05",
tools=[{"type": "computer_use"}],
input=[{
"role": "user",
"content": "Navigate to [URL], fill in the following fields: [field: value pairs], and submit the form. Screenshot before submitting."
}]
)
For repetitive multi-record data entry, batch the task with a structured input list and have the model iterate through records, with a confirmation screenshot after each submission.
But watch out: Pages with CAPTCHAs, two-factor authentication steps, or modal dialogs that shift layout between sessions break the flow. Always test on a staging environment before running on production forms. Add explicit human confirmation gates for irreversible submissions.
GPT-5.4 improved by 17% absolute over GPT-5.2 on BrowseComp, with GPT-5.4 Pro reaching 89.3% — described as a new state of the art for persistent web research. Standard GPT-5.4 hits 82.7%. For knowledge workers who spend significant time aggregating information from multiple sources, this is the most day-to-day useful improvement in the model.
GPT-5.4 was improved on real-world tasks including financial modeling, scenario analysis, data extraction, and long-form research, and can produce structured, cited outputs that export to PDF or Microsoft Word.
How to set it up: For a multi-source research task in ChatGPT:
Research [topic] using the following sources: [source list or URLs].
Produce a structured summary with:
1. Key findings (with citations)
2. Conflicting data points between sources
3. Open questions not answered by the sources
4. Recommended next steps
Format: use headers, bullet points, and a source reference list at the end.
But watch out: GPT-5.4 is 33% less likely to make false individual claims than GPT-5.2, but "less likely" isn't "never." For any research output that feeds a high-stakes decision, treat it as a first draft requiring expert review, not a finished deliverable. GPT-5.4 Pro's 89.3% BrowseComp score is significantly better than standard's 82.7% — for critical research tasks, the Pro tier is worth the cost.
Playwright Interactive, an experimental Codex skill, lets GPT-5.4 visually debug web and Electron apps — even testing applications while building them. The loop: GPT-5.4 writes code, Playwright runs it in the browser, the model sees the visual output, identifies what's broken, and fixes it — without you manually pointing out every failure.
On CodeRabbit's evaluation across 300 pull requests, GPT-5.4 identified 254 of 300 bugs (84.7%), compared to 200–207 for other frontier models.
This is the workflow that changes frontend development in practice. Not the raw code-writing capability — the self-correction loop.
How to set it up: In Codex, enable the Playwright Interactive skill. Describe what you're building, then prompt the model to:
Build [component/feature]. After each significant change, run a Playwright test to verify:
- Visual rendering matches the spec
- Key interactions (click, input, navigation) work as expected
- No console errors
If a test fails, diagnose the issue from the screenshot and fix before proceeding.
But watch out: Playwright Interactive is an experimental skill — it's not stable for production CI/CD pipelines yet. GPT-5.4 leads on structured frontend tasks; for multi-file architectural reasoning across large codebases, Claude Opus 4.6 still has the edge. Use GPT-5.4 for component-level work, not full-stack architectural refactors.
This is a lower-stakes computer use case but one of the highest practical ROI use cases for knowledge workers. Email triage, meeting scheduling, and calendar blocking are high-frequency, low-cognitive-value tasks that eat a disproportionate amount of the workday.
GPT-5.4 can navigate Gmail or Outlook via computer use, read your inbox, draft context-aware replies, and interact with calendar apps to check availability and create events — all without a custom API integration.
How to set it up:
response = client.responses.create(
model="gpt-5.4-2026-03-05",
tools=[{"type": "computer_use"}],
input=[{
"role": "user",
"content": """
1. Open Gmail. Find unread emails from the last 24 hours marked important.
2. For each: draft a reply maintaining my usual direct, concise tone.
3. Do NOT send — save each as a draft. Screenshot the draft list when done.
"""
}]
)
But watch out: Never give an agent unsupervised send access for email. Draft-only mode is the right default. For calendar management, explicitly define working hours and buffer preferences in the prompt — the model will schedule back-to-back meetings without breaks if you don't specify otherwise.

This is where GPT-5.4's convergence architecture pays off most. GPT-5.4 scored a record 83% on OpenAI's GDPval test for knowledge work tasks, and took the lead on Mercor's APEX-Agents benchmark for professional services work.
A real example: read a PDF contract, extract key terms and dates, upload a summary to a shared drive, and log the entry in a CRM — four steps, three different applications, one agent session. GPT-5.4 can hold context across all four steps without losing state mid-task, which prior models consistently failed at.
How to set it up: The key prompt pattern for multi-step agent tasks:
Task: [Full workflow description]
Steps:
1. [Action] from [source]
2. Extract: [specific data points]
3. Upload to [destination] with filename format: [format]
4. Log the following fields in [system]: [field list]
Constraints:
- Confirm before any irreversible action (send, delete, submit)
- If step N fails, pause and report — do not proceed to step N+1
- Screenshot the completed state of each step
The "pause and report on failure" constraint is the most important part. Multi-step agent runs without explicit failure handling silently produce incomplete outputs that look complete.
But watch out: GPT-5.4 supports up to 1 million tokens of context, but input tokens above 272K are charged at double the rate. Long multi-step sessions with large documents accumulate context fast. Monitor token usage per session and use summarization passes between major steps to keep context lean.
For API access, authenticate with your OpenAI API key and include "tools": [{"type": "computer_use"}] in your Responses API call to enable computer use.
Three patterns I keep coming back to:
The spec-first pattern — describe the desired output before the steps. The model backtracks from the output spec to choose its own best path.
The checkpoint pattern — ask for a screenshot or structured summary after each major step. This forces the model to verify its own state, catching errors before they compound.
The constraint-first pattern — list what NOT to do before what to do. For agentic tasks, constraint violations (sending before confirming, skipping steps, hallucinating field values) are the main failure mode. State them explicitly upfront.

GPT-5.4 is excellent at executing well-specified tasks. It struggles with tasks that require genuine judgment about ambiguous situations — "is this contract clause unusual enough to flag?" or "should I accept this meeting given everything else on my plate?" These are tasks where the model's output is a first draft for human decision-making, not a final action.
For any task where the right answer depends on context the model can't see (relationship history, organizational politics, risk appetite), treat GPT-5.4 as a fast researcher and drafter, not a decision-maker.
The OSWorld score is 75.0% — which means roughly 1 in 4 desktop tasks fails or produces an incorrect result on a controlled benchmark. In real production environments with messier interfaces, that failure rate is higher. For long overnight agent runs, expect failures at some steps and design your workflows for graceful recovery, not perfect execution.
Specific reliability gaps I've observed: dynamic interfaces that shift layout between screenshots cause mis-clicks; multi-tab workflows where the model loses track of which tab is active; and any interface that uses non-standard custom UI components that don't behave like standard HTML elements.
For teams running high-volume agent workflows, the 272K token context cliff is real. Sessions that drift above that threshold double the input token cost, which can make multi-step, document-heavy workflows significantly more expensive than the headline $2.50/M input rate suggests. See the full breakdown in GPT-5.4 pricing explained.
Here's the honest version. For the 7 tasks above, GPT-5.4 isn't replacing your judgment — it's replacing your mechanical execution time. A three-statement financial model that takes a junior analyst 3 hours to build from scratch takes GPT-5.4 a few minutes to assemble, but still requires a senior analyst 30–60 minutes to review, validate assumptions, and correct errors.
The ROI calculation: if your time is worth $100/hour and GPT-5.4 saves 2 hours of mechanical work per day at a cost of $5–20 in API calls, the math is obvious. The constraint isn't the model — it's your ability to write tight specs, review outputs efficiently, and design workflows with appropriate human checkpoints.
The tasks where GPT-5.4 genuinely surprises me: multi-source research synthesis (it finds connections across sources that I'd miss in a manual pass), and browser automation for legacy software (things I'd resigned to doing manually because "there's no API" can now be automated).
The tasks where I still prefer doing it myself: anything where the brief is ambiguous and the iteration cost of a wrong first draft exceeds the time saved, and any task where I'd be uncomfortable explaining my process to a stakeholder if they asked "how did you do this?"
I'd recommend starting with the task that costs you the most time per week and has a clear, verifiable output. For most knowledge workers, that's either research synthesis or spreadsheet work. Both have well-defined success criteria, which means you can evaluate GPT-5.4's output quality quickly.
A 30-minute test protocol that works:
That comparison gives you a real ROI signal, not a benchmark number. Run it twice more with different tasks before you commit to redesigning your workflow around the model.
One more thing worth saying: GPT-5.4 is built to handle the professional execution layer — the spreadsheets, the forms, the research, the code. If you're also looking for something that handles the personal side of your life — your routines, your goals, the tasks that don't fit neatly into a workflow — that's a different kind of tool. At Macaron, we've been building for exactly that side: the personal AI that actually remembers what matters to you and helps you act on it.
Related Articles