Are AI-Generated Recipes Actually Good?

Blog image

At some point I started using AI to figure out what to do with whatever was in my fridge on a Tuesday. And honestly? It works often enough that I kept doing it. But "often enough" is doing a lot of work in that sentence, because there have also been genuinely bad results — including a batch of muffins with a crumb so dense it knocked against the counter like wood.

So I ran a more deliberate test: a set of AI-generated recipes across different categories, cooked from scratch, compared against what a decent recipe should actually produce. Here's what I found.


What We Tested and How

Recipe Types Tested

The test covered five categories:

  • Simple weeknight meals — a one-pan chicken dish, a pasta with pantry ingredients, a stir-fry from whatever was available
  • Baking — blueberry muffins, a simple chocolate cake, chocolate chip cookies
  • Dietary restriction substitutions — making a standard curry vegan, adapting a pasta for gluten-free
  • Unusual ingredient combinations — prompts like "use up this half-head of cabbage, some leftover rice, and fish sauce"
  • Multi-step techniques — a braised short rib, homemade pasta dough

Tools used: ChatGPT (GPT-4o), Claude, and DishGen as a dedicated recipe tool comparison. Each recipe was generated from a plain-language prompt with no additional constraints unless dietary restrictions were part of the test.

What "Good" Means Here

A recipe passes if it's accurate (quantities and timing produce the described result), cookable (doesn't require equipment or technique that's unrealistic without warning), and actually tasty (the dish is worth eating, not just edible). All three matter. A recipe that produces an edible result with awful texture and no flavor fails the same as one that gives you the wrong bake time.


Where AI-Generated Recipes Hold Up

Simple Weeknight Meals

Blog image

This is where AI genuinely earns its keep. The one-pan chicken dish — thighs, canned tomatoes, olives, garlic, white wine, served over rice — came out well on the first attempt from both ChatGPT and Claude. The quantities were reasonable, the timing was accurate within a few minutes, and the result tasted like something a person had thought about, not a generic "chicken in sauce" template.

The stir-fry prompt ("chicken thighs, broccoli, soy sauce, ginger, whatever else makes sense") worked just as well. Both models suggested sesame oil, rice vinegar, and a cornstarch slurry for the sauce unprompted — details that distinguish a good stir-fry from a watery one.

For simple cooking with common proteins and pantry staples, AI-generated recipes are reliable enough that I'd follow them without significant skepticism. The techniques are standard, the ratios are familiar from the training data, and there's enough margin for error in cooking that small inaccuracies don't derail the dish.

Ingredient Substitution Suggestions

This is probably where AI adds the most real-world value beyond a recipe search. "Make this curry vegan" with a chicken tikka masala base produced a genuinely good response: swap the chicken for chickpeas and paneer (noting paneer isn't vegan, offering tofu as the fully vegan option), replace the cream with full-fat coconut milk in the same quantity, adjust the seasoning timing since coconut milk can curdle with prolonged high heat.

That last detail — the coconut milk curdling warning — is exactly the kind of thing a recipe search wouldn't surface unless you already knew to look for it. Claude in particular tends to add these technique notes without being prompted, which makes its substitution responses more useful than a simple ingredient swap list.

The gluten-free pasta adaptation was equally solid: use a specific 70/30 blend of rice flour and tapioca starch rather than just "gluten-free flour," add xanthan gum if the flour blend doesn't include it, rest the dough longer to allow full hydration. Again — specific and actually helpful.

Dietary Restriction Filtering

Strict constraint handling works reliably when you're explicit. "Vegan, gluten-free, no nuts, high protein" returned a chickpea and lentil bowl that met all four constraints without the recipe quietly including parmesan or assuming nut-free meant seed-free too. When I tested for common failures — asking for a "dairy-free" recipe and checking whether the AI would sneak in butter — both ChatGPT and Claude caught it. DishGen was slightly less reliable here, occasionally offering "optional parmesan" in what was supposed to be a strict dairy-free dish.

The caveat: you need to be explicit. "Make this vegan" is less reliable than "make this vegan, which means no meat, dairy, eggs, honey, or gelatin." General-purpose AI handles edge cases better when they're specified rather than assumed.


Where They Fall Apart

Baking — Where Precision Matters Most

This is the category where AI recipes fail most consistently, and the failures are specific enough to explain.

The blueberry muffin test was the clearest example. ChatGPT's recipe called for folding the oil, egg, and milk into the dry ingredients — so far standard — and then, in a separate step, stirring in the vanilla extract. That's not how batter works. You can't work vanilla extract into a thick, partially-mixed muffin batter without significantly overmixing it. The result was a dense, cornbread-like crumb that didn't rise properly and baked too hard at 400°F for 20–25 minutes.

The chocolate chip cookies had a different problem: the ratio of butter to flour was off — too much fat for the amount of dry ingredients, which produced cookies that spread flat and greasy rather than holding any structure. Both issues point to the same underlying problem: AI generates recipes by pattern-matching on text, not by understanding the chemistry of what happens when ingredients interact under heat.

A CNN investigation into AI-generated baking recipes documented similar failures, including a recipe that was physically impossible to execute as written — the steps assumed a shape that the ingredient couldn't produce. The author, an experienced baker, worked through it several times before concluding it simply didn't work.

Baking is chemistry. A 10% error in flour quantity changes texture. A wrong temperature by 25°F changes crust. A missed step in the mixing order changes structure. AI doesn't test recipes before generating them, and that gap shows most clearly here.

Portion and Timing Accuracy

Across cooking recipes, timing instructions were consistently optimistic. "Sauté the onions until translucent, about 3 minutes" — it's 7 minutes if you're actually doing it right, closer to 10 if the pan was cold. "Braise for 2 hours until tender" — the short ribs needed closer to 3 hours at the specified temperature.

This isn't random error. AI generates timing from what it's seen written down, and recipe writing systematically underestimates prep and cook times because it's written by people who already know what "nearly done" looks like. That institutional bias transfers to AI output.

Portion sizes also drifted. "Serves 4" on a pasta dish produced what was generously 2 adult servings if served as a main. On the stir-fry, "serves 4" was closer to accurate. The inconsistency suggests portion calibration isn't reliable without context about serving style.

Complex Multi-Step Techniques

The braised short rib recipe was structurally sound — sear, mirepoix, deglaze, braise — but missed technique details that matter for the result. No mention of patting the meat dry before searing (critical for proper browning). No warning that the braising liquid should barely simmer, not boil (boiling makes the meat tough). No note about resting before serving.

These aren't optional tips. They're the difference between restaurant-quality braised short rib and something that's cooked through but chewy. A recipe that omits them isn't wrong, exactly — it just sets you up to make avoidable mistakes.

The homemade pasta dough was worse. The recipe listed correct ingredients and ratios, then described the kneading process as "knead until smooth, about 5 minutes." Pasta dough requires 8–10 minutes of firm kneading minimum to develop proper gluten structure, and the endpoint isn't "smooth" — it's specific tactile feedback that a text description can't capture. Following it exactly produced dough that tore during rolling.

Unusual Ingredient Combinations

The fridge-clearing prompt ("half-head of cabbage, leftover rice, fish sauce, whatever else") actually worked. The suggestion — a fried rice variant with cabbage, fish sauce, scrambled egg, and scallions — was reasonable and produced a genuinely good result.

Where unusual combinations failed was in less familiar territory. A prompt asking for something interesting with preserved lemon, tahini, and white beans produced a recipe that was technically functional but flavor-wise confused — the preserved lemon dominated everything else, and no adjustment to quantity would have fixed a structural balance problem. The recipe was plausible on paper, but clearly hadn't been tasted.


Which AI Generates the Best Recipes?

General-Purpose AI vs Dedicated Recipe Tools

For cooking (not baking), ChatGPT and Claude produced consistently better results than DishGen in this testing. The general-purpose models handle mid-conversation adjustments better — "make this with less sodium" or "I don't have white wine, what else works?" — and add technique context more naturally.

Blog image

DishGen was faster and more format-consistent — it outputs a structured recipe immediately without any back-and-forth. For simple, single-prompt generation, that speed is useful. For anything with constraints or follow-up adjustments, the general models are more practical.

Blog image

Quick Comparison of Output Quality

Scenario
ChatGPT
Claude
DishGen
Weeknight cooking
✅ Reliable
✅ Reliable
✅ Reliable
Dietary substitutions
✅ Strong
✅ Strongest (adds technique notes)
⚠️ Occasional slips
Baking
❌ Unreliable
❌ Unreliable
❌ Unreliable
Fridge-clearing
✅ Good
✅ Good
✅ Good
Complex techniques
⚠️ Incomplete
⚠️ Incomplete
❌ Thin
Timing accuracy
⚠️ Optimistic
⚠️ Optimistic
⚠️ Optimistic

No tool escapes the baking problem. It's not a ChatGPT issue or a DishGen issue — it's an AI issue.


When to Trust an AI Recipe (and When Not To)

Trust it for: Weeknight cooking with standard proteins and vegetables. Ingredient substitution suggestions, especially for dietary restrictions. Generating ideas when you have an unusual combination of leftovers. Quick flavor profile guidance ("what herbs go with this?").

Verify before committing to: Any timing instruction — add 30–50% to prep estimates and check the dish rather than trusting the clock. Portion sizes — if it says "serves 4" and you're feeding four adults as a main, make more than you think you need.

Don't follow without a tested source: Baking recipes. Anything requiring specific ratios (bread hydration, leavening amounts, custard ratios). Techniques you've never done before and won't recognize when they go wrong. Food safety-critical steps — NPR's reporting on AI recipe safety issues documented cases where AI-generated recipes produced dangerous results from unsafe combinations, including one that generated chlorine gas from a prompted ingredient list.

Blog image

The honest rule: use AI recipes the way you'd use a recipe from a stranger on the internet. It might be good. It might need adjustment. It might not work at all. Treat it as a starting point that requires your own judgment to execute, not a tested source you can follow blindly.


Verdict

For weeknight cooking, fridge-clearing improvisation, and dietary substitutions, AI-generated recipes are genuinely useful — often enough that they've become part of how I actually cook. The results aren't perfect, but they're good enough often enough to be worth trying.

For baking, don't trust them without a cross-reference. The failure mode is specific and consistent: ratio errors and missing technique details that produce results nothing like what the recipe describes. If you're baking from an AI recipe, at minimum verify the flour-to-fat ratio against a known-good recipe for the same type of item before starting.

For anything complex — braised proteins, laminated pastry, fermented dough — use AI for inspiration and technique questions, not as your primary recipe. The gaps in what it knows to tell you are exactly the gaps that will derail a dish.

The technology is useful. It's just not tested, and that matters more in some contexts than others.


FAQ

Are AI recipes safe to follow?

For standard cooking, yes — the food safety risks are low. For unusual ingredient combinations, apply normal judgment: if the prompt involves anything you'd be cautious about (raw proteins, preservation, unfamiliar ingredients at high quantities), don't rely on AI to catch safety issues. As NPR reported, some AI recipe generators have produced genuinely dangerous output from irresponsible prompts. For everyday cooking, this isn't a practical concern, but it's worth knowing the limitation exists.

Which AI generates the most reliable recipes?

For cooking, Claude tends to add the most useful technique context unprompted, which makes its recipes more reliable in practice — not because the base recipe is more accurate, but because it flags more of what can go wrong. ChatGPT is comparable and slightly better at adapting mid-conversation. Neither is meaningfully better than the other on baking, where both fail at similar rates. Dedicated tools like DishGen are faster but less flexible for adjustments.



This article reflects hands-on recipe testing conducted in early 2026. Results will vary by tool version and prompt specificity.

Hey — I'm Jamie. I try the things that promise to make everyday life easier, then write honestly about what actually stuck. Not in a perfect week — in a normal one, where the plan fell apart by Thursday and you're figuring it out as you go. I've been that person. I write for that person.

Apply to become Macaron's first friends