Project Genie Tutorial: Step-by-Step Guide to Creating Your First World (2026)

Hello, everyone. I'm Anna. Several days ago, I wanted a short, moving background for a two-minute update I was sending to a client, and my brain refused to storyboard anything. I opened Genie, it looked like a low-effort way to point at an idea and say "that, make that move." I expected another shiny tab I'd forget. It wasn't. At least not immediately.

This is the Project Genie tutorial I wish I'd had before I poked at it on a Tuesday afternoon in January 2026: simple steps, a few prompt patterns, what "real-time" actually feels like, and the small frictions that made me pause. If you're curious about AI as a quiet helper rather than a workflow empire, this is for you.

Before you start

Access checklist (US / plan / age)

I tested Project Genie in January 2026 from the US on both desktop (MacBook Air M3) and iPhone 15 Pro. Access may still be rolling out, and a queue shows up during peak hours. A few practical notes from my setup:

Region: US access worked for me: friends outside the US saw a waitlist. If you're traveling, VPNs sometimes trigger extra verification.
Plan: The free tier queued my generations (10–90 seconds wait). A paid plan moved me faster and unlocked longer captures. If you're just testing, free is fine. Google AI Ultra subscribers get priority access and the highest usage limits.
Age: Expect a 18+ minimum with standard "don't upload people without consent" rules. It's the usual content policy stuff: nothing surprising, but worth skimming.

I didn't see any hard-gated "pro" features you can't learn without paying, just speed and duration differences. Helpful if you're impatient: survivable if you're not.

Setup tips (device / browser / latency expectations)

Browser: Chrome and Edge behaved best. Safari worked but occasionally stuttered when I started recording. On mobile, the app felt smoother than mobile web.
Device: Anything recent with decent GPU helps. My MacBook Air stayed cool: my older Intel laptop felt like it wanted a nap.
Network: Stable Wi‑Fi > cellular. On 5G, it worked, but the first few seconds of "real-time" control lagged.
Latency expectations: First load of a new world took ~8–20 seconds for me: subsequent remixes were quicker. Live control had a small delay, think a beat behind a gamepad, not instant twitch gaming.

If any of that sounds annoying: it is, briefly. But once a world spins up, it's mostly smooth sailing unless your network hiccups.

Create a world

Text vs image prompt

Genie accepts two kinds of prompts:

Text: Fastest way to sketch an idea. "A tiny explorer in a mossy terrarium, soft morning light, gentle camera follow." Good for mood and action.
Image: You upload a reference and describe motion. This is helpful when you want a specific character silhouette or color palette to carry through.

In my runs, text-first gave me more varied ideas: image+text gave me more consistency. If you're stuck, start with text, then anchor the next version with an image.

1 starter prompt template you can copy

When I find a prompt that actually works, I don’t want it to vanish into a notes app I’ll never open again. That’s why we built Macaron. We use it to save working prompts, tweak them lightly, and pull them back out when we need motion now, not another blank page. If you’re collecting small, useful prompts instead of building a giant workflow, try Macaron here.

I lean on this fill‑in template when my brain feels like cold oatmeal:

"[Character] doing [clear action] in [environment], [lighting], [camera movement]. Keep [style] with [color palette]. Avoid [things you don't want]."

Example I used: "Tiny paper sailboat navigating puddles between city cobblestones, golden hour light, slow lateral camera drift. Keep handmade stop‑motion style with warm browns and teal highlights. Avoid faces, logos, or text."

Small adjustment that helped: swapping vague words like "beautiful" for specifics like "soft backlight" or "handheld wobble." The model responds better to the concrete stuff.

Character + environment prompt patterns

I stopped overthinking once I split my prompts into two buckets: who is moving, and what the world is made of.

Character: role, material, motion verbs, scale
"miniature diver," "felt fox," "paper astronaut"
Verbs with posture: "tiptoes," "wades," "skids," "peers over"
Scale in context: "thumb‑sized," "towering over mugs," "ant‑level"
Environment: texture, layout, physics, light
Texture: "moss‑thick glass terrarium," "dusty garage workbench," "rain on acrylic"
Layout: "narrow corridor with puddle pockets," "stepping‑stone tiles," "overhead cables"
Physics cues: "slow splash rings," "paper crumples when wet," "dust motes drift"
Light: "neon edge glow," "overcast blue cast," "golden rim light"

Two working patterns:

Inversion: make the character's material clash softly with the environment.

"Clay penguin in a stainless‑steel kitchen, reflections exaggerate movement, shallow DOF."

Echo: match material and world so everything coheres.

"Felt raccoon in a felt forest: fibers visible: soft, pillowy shadows."

When outputs looked mushy, it was usually because I piled on too many style words. Cutting back to 2–3 style anchors ("stop‑motion, warm film grain, handheld") made a noticeable difference.

Explore controls + what “real-time” means

The first time "real-time" loaded, I moved the character and… nothing happened for a second. Then everything jumped forward like it had decided to catch up. Not broken, just the system warming up.

What it felt like in practice:

Controls: I saw a simple joystick/direction pad and a jump/interaction button. Keyboard arrows worked on desktop: touch controls worked on mobile. Camera generally followed the character with a gentle drift.
Response: There's input prediction but also server generation, so the first 2–3 seconds can feel like pushing through syrup. After that, it settles into a steady rhythm.
Edits on the fly: Nudge the prompt while you're in the world, small tweaks like "more rain," "slower camera," or "softer shadows" applied within ~5–10 seconds.
Reroll vs remix: Reroll keeps the prompt, changes the seed. Remix keeps the physics/motion vibe but lets you edit character or environment. I started to rely on remix when I liked the mood but wanted a different protagonist.

I wouldn't call it a game. It's more like puppeteering a scene while the world renders just ahead of you. If you expect console responsiveness, you'll be grumpy: if you expect a creative toy with live-ish feedback, it's fun. For more technical details on how Genie 3 works, see Google DeepMind's research blog.

Remix a world from gallery / randomizer

On days when my ideas are thin, I open the gallery and borrow momentum. Here's how that went.

Gallery: I picked a seed labeled "soft neon alley, cat courier." Hitting Remix carried over the alley layout, light spill, and general physics. Swapping the character to "wind‑up tin robot" worked better than changing everything at once. One variable at a time is kinder to the model (and my patience).
Randomizer: The dice icon is chaos with manners. It shuffles character, palette, and environment. I used it to get unstuck, then trimmed back the prompt. Two random presses, then edits, that rhythm got me to usable results faster than writing from scratch.
Preserve the spine: If you like the way puddles behave or how the camera breathes, keep those words. "Keep slow ripple rings." "Keep handheld cam with slight sway." Those anchors survive remixing more often than specific nouns.

Small delight: sometimes the randomizer pulled a color palette I'd never pick. I saved two of those because they made my usual "warm + cozy" style look less predictable.

Exporting is pleasantly boring.

Capture length: Free gave me short clips: paid let me run longer. I recorded 8–15 seconds most of the time, long enough to feel alive, short enough to stay crisp.
Resolution/aspect: Square and vertical worked. Horizontal too, but vertical looked nicer for motion. If you're posting to a story, 9:16 is the easy button.
Watermark: I saw a small corner mark on free exports. Paid exports offered a toggle. If attribution matters to you, leave it on: it saves questions later.
Sound: No native audio in my tests. I added ambient tracks after export. Honestly better that way.

Sharing safely (my common‑sense list):

Don't use real people's faces or someone else's brand elements. The model sometimes echoes styles: keep it generic if you're sharing publicly.
If you're using this for client work, check license terms and content policies. "Looks fine on social" is not the same as "rights are handled."
Metadata: I keep a tiny note with prompt + date for each exported clip. Later, when someone asks "how'd you make that," I don't have to reconstruct it from memory.

Troubleshooting + FAQ

A few things I bumped into, plus what helped:

It feels laggy at first.

Give it one full generation to "warm up." If it stays laggy, drop resolution once, or switch Wi‑Fi bands (5 GHz was noticeably snappier for me).

My outputs look mushy/over‑stylized.

Remove vague adjectives. Keep 2–3 style anchors. Add one concrete lens note ("handheld wobble," "macro depth of field").

The character keeps changing between takes.

Lock a short identity phrase early: "thumb‑sized felt fox with blue scarf." Reuse it exactly. Consistency beats cleverness here.

Prompts are ignored (e.g., it keeps adding text/logos).

Add "avoid text or logos" up front. If it persists, reroll the seed or remix from a gallery item that doesn't have those artifacts.

Crashes or stalls mid‑session.

Save early exports. Keep sessions under ~10 minutes before refreshing. This reduced hiccups on my Intel laptop.

One last nudge: if you hit a wall, remix someone else's seed and change one noun. It's the lowest‑effort way to learn how Genie likes to be asked.

I'll keep using it, mostly when I want movement without planning. And I'm curious whether the input delay shrinks as they roll more updates. If it does, I'll probably get a little braver with the puppeteering.