GPT-5.4 Computer Use: What It Can Actually Do

Blog image

Hey fellow AI tinkerers — if you've been watching the agentic AI space with one eyebrow permanently raised, this one's for you.

I'll be honest: when I first heard "GPT-5.4 can use your computer," I assumed it was another demo-ware announcement. I've seen too many "autonomous AI agent" reveals that look incredible in a scripted screen recording and collapse the second you put a real task in front of them. So I kept my expectations low, opened the docs, and started poking around.

Here's what I actually found — and where this thing still drives me up the wall.

What "Computer Use" Actually Means in GPT-5.4

Blog image

How It Works: Screenshot → Mouse + Keyboard Instructions

GPT-5.4 computer use isn't magic. The loop is pretty mechanical once you understand it: the model takes a screenshot of the current screen state, decides what to do next (click, type, scroll, drag), issues that command, takes another screenshot, and repeats. That's the whole thing. No DOM manipulation, no browser API hooks — just visual interpretation and action.

According to OpenAI's official launch announcement, GPT-5.4 operates through two parallel modes simultaneously: writing code to control computers via libraries like Playwright, and issuing direct mouse and keyboard commands in response to screenshots. That dual-mode approach is what makes it different from single-path browser automation tools.

Here's a minimal API call to kick off a computer-use session:

from openai import OpenAI
client = OpenAI()
response = client.responses.create(
    model="gpt-5.4-2026-03-05",
    tools=[{"type": "computer_use"}],
    input=[{
        "role": "user",
        "content": "Open the browser, go to our CRM, and export this week's leads as a CSV."
    }]
)
for action in response.actions:
    if action.type == "click":
        pyautogui.click(action.x, action.y)
    elif action.type == "type":
        pyautogui.typewrite(action.text)

One thing worth knowing: each API call is stateless unless you manage conversation history yourself. Long automation sessions require explicit state management in your agent code — the model doesn't "remember" what it did two steps ago unless you pass that context forward.

What's Different from Previous Tool-Use Approaches

Before GPT-5.4, computer use in AI systems basically meant one of three things: browser-specific scripts (brittle), DOM-based agents (limited to web), or wrapper tools that needed constant human confirmation. GPT-5.4 operates at the visual layer — it sees what a human sees and acts the way a human acts. No API dependency. No DOM structure requirement.

The practical difference: it can navigate legacy enterprise software, PDF-heavy portals, and custom internal tools that have no API surface at all.

What GPT-5.4 Can Do Right Now

Browser Tasks: Forms, Emails, Navigation

This is the highest-reliability use case. Filling multi-step forms, navigating authenticated portals, extracting data from web pages without scraping APIs — GPT-5.4 handles all of this. One real-world data point: Mainstay tested the model across roughly 30,000 HOA and property tax portals (notoriously messy, decade-old interfaces), and GPT-5.4 achieved a 95% first-attempt success rate with 100% completion within three tries — roughly 3x faster and using 70% fewer tokens than previous computer-use models.

Blog image

Desktop Tasks: Calendar, Spreadsheets, File Ops

Blog image

On spreadsheet modeling tasks benchmarked against junior investment banking analyst work, GPT-5.4 scored 87.3% — up from GPT-5.2's 68.4%. Calendar management, file organization, and moving data between applications all land in this category. The model handles multi-application workflows reasonably well as long as the interface isn't actively adversarial (more on that below).

OSWorld Score 75.0% — What That Means in Plain English

Blog image

OSWorld-Verified is the benchmark that matters here. It puts AI models in front of real operating systems with real applications — tasks like "find and open the most recently modified spreadsheet" or "send this document to the right folder across three open apps." Human experts score 72.4% on this test. GPT-5.4 scores 75.0%, making it the first frontier model to exceed human baseline on autonomous desktop task completion. For context, GPT-5.2 scored 47.3% on the same benchmark — so this is a generational jump, not an incremental one.

Real-World Demo Examples

Blog image

Theme Park Simulator Built via Playwright

OpenAI's flagship demo: a complete isometric theme park simulation game built from a single prompt using Playwright Interactive for browser playtesting. The output included tile-based path placement, ride and scenery construction, guest pathfinding with queueing logic, and live park metrics (funds, guest count, happiness, cleanliness). Playwright ran automated browser tests throughout the build cycle — verifying path placement, camera navigation, guest responses, and UI indicators as the game was being constructed. This is the "build and test in the same loop" workflow that previously required a separate QA pass.

Tactical Turn-Based RPG

The RPG demo, built by developer Cory Ching using GPT-5.4 and Codex with Playwright Interactive, is the more technically interesting showcase. The game runs in the browser on Phaser, featuring turn-based combat on a grid map with movement, positioning, and encounter flow — developed iteratively over multiple turns, with the AI writing code, testing it in the browser, and refining the combat system and visual style in a continuous loop. The key insight here isn't the game itself. It's the loop: AI writes → AI tests visually → AI fixes → repeat. That feedback cycle is what changes the workflow.

What the Demos Actually Show About the Coding + Visual QA Loop

Both demos reveal the same underlying capability: GPT-5.4 can now close the feedback loop between writing code and verifying it works. A model that writes frontend code and then watches it run and catches its own bugs is a fundamentally different tool from one that just generates output. It's not reliable enough to run unsupervised on production code yet. But as a coding + QA assistant that can iterate without you manually pointing out every breakage, this is genuinely new.

GPT-5.4 vs Claude Opus 4.6 on Computer Use

Blog image

This is the comparison most developers actually care about in March 2026. Here's what the benchmarks look like:

Benchmark

GPT-5.4

Claude Opus 4.6

Notes

OSWorld-Verified (desktop tasks)

75.00%

72.70%

Human baseline: 72.4%

SWE-Bench Verified (coding)

~57.7%

80.80%

Opus leads significantly

BrowseComp (web research)

82.70%

84.00%

Opus slight edge

GDPval (knowledge work)

83.00%

78.00%

GPT-5.4 leads

Pricing (input / output per 1M tokens)

$2.50 / $15

$5 / $25

GPT-5.4 cheaper

Where Each Model Still Struggles

GPT-5.4's computer use edge is real but narrow on OSWorld — a 2.3-point gap over Opus 4.6 isn't a moat. For pure coding tasks, Opus 4.6's SWE-Bench lead of roughly 23 percentage points is significant. The practical routing logic for March 2026: use GPT-5.4 for computer use agents, form automation, and document-heavy professional workflows; use Opus 4.6 for production code, debugging, and complex web research. Neither model wins everything.

Limitations & Risks

What It Still Can't Reliably Do

Highly dynamic interfaces: Pages that shift layout between sessions, or apps that use non-standard UI frameworks, still cause mis-clicks and navigation failures.
Long overnight workflows: Multi-hour agentic runs require careful state management. A dropped API connection mid-task leaves systems in incomplete states with no built-in recovery.
Specialized domain knowledge: On highly specialized topics with limited training data, the model still guesses. Human validation remains essential for high-stakes outputs.
Mobile automation: Android and iOS automation requires emulators or device management bridges — there's no direct mobile control path yet.

Security and Permission Considerations

This is the part that gets under-discussed. When you give an AI model mouse and keyboard access to your computer, you're expanding the attack surface significantly. The main risks per OpenAI's system card: prompt injection from malicious web pages, data exfiltration through connected tools, and destructive actions triggered by hidden instructions in content the agent browses.

Minimum viable security setup for computer-use agents:

- Run inside Docker containers with limited file system mounts
- Use a dedicated low-privilege OS user account
- Never run on your primary machine with access to personal files
- Set explicit confirmation policies for irreversible actions (send, delete, submit)

GPT-5.4 is classified as "High cyber capability" under OpenAI's Preparedness Framework — which is both a capability signal and a warning about the expanded risk surface.

When You Still Need a Human in the Loop

Any task involving irreversible external actions (sending emails to real recipients, submitting financial transactions, deleting files) needs human confirmation before execution. The 25% failure rate on OSWorld — which is a controlled benchmark — translates to higher failure rates in production environments with messier interfaces and more edge cases. For anything high-stakes, treat GPT-5.4 computer use as a fast draft executor that a human reviews before the final action fires.

Verdict: Is This Actually Useful Yet?

Honestly? Yes — but with conditions.

It's useful now if you're: a developer building automation agents for web-based workflows (form filling, data extraction, portal navigation), or if you need to operate software that has no API and no other automation path. The OSWorld score isn't hype. It's a real capability jump that didn't exist six months ago.

It's not ready for: unsupervised production deployment on sensitive systems, overnight agentic runs without monitoring, or anything where a 25% failure rate is unacceptable.

The theme park and RPG demos are impressive, but they're also ideal conditions — structured tasks, forgiving interfaces, iterative workflows. Real enterprise environments are messier. The insurance portal test by Pace (navigating 20-year-old hyper-dense enterprise UIs without hallucinating a click) is a more honest signal: GPT-5.4 handled it, but those are exactly the edge cases where human review still matters.

The Playwright Interactive loop — where the model writes code, watches it run, and fixes its own bugs — is the capability I keep coming back to. That feedback cycle changes how AI-assisted development works in practice. Not magic. Not AGI. But a real change in the workflow.

At Macaron, we built our agent around the same problem this capability exposes: ideas get stuck between the conversation and actual execution. If you want to test whether your workflow plans can be turned into real, trackable tasks without switching between tools or losing context, try running one real project through Macaron and see how far it actually gets.

Related Articles