
Hey fellow code-workspace builders — if you've ever set up an automation and spent the next week wondering if it's quietly breaking things, this one's for you.
I've been running scheduled automations on my own repos for three weeks. Not demos. Real commits, real PRs, real edge cases I didn't predict.
I don't trust them blindly. I run them because I've built enough guardrails to catch drift, and I needed to know: can this pattern survive without me babysitting every run?
Three weeks in, here's what stuck:
Let me show you what actually worked, what broke, and the setup that's still running without constant supervision.
I started by mixing these up. Skills felt like automations, automations seemed like skills with timers. After running both for a few weeks, here's the actual difference:
Skills are reusable workflows. They're bundles of instructions, scripts, and context that teach Codex how to do something your way—whether that's deploying to Vercel, converting Figma designs to code, or running your lint standards. Once you create a skill, it works across the Codex app, CLI, and IDE extensions. You can check skills into your repository so your entire team uses the same pattern.
Automations are scheduled runs. They combine instructions (and optionally skills) with a schedule. When an automation finishes, results land in a review queue. That's where you decide if it worked, if it needs adjustment, or if you want to continue the work.

The first time I set up an automation, I assumed it would "just work." It ran fine, generated a report, then broke on the next run because dependencies had changed. That's when I realized: automations need skills underneath them to stay stable, and skills need real-world testing to be trustworthy.
Here's a pattern that worked:
# Example: Daily dependency check automation
schedule: "0 9 * * *" # 9 AM daily
instructions: |
Check for outdated npm packages.
Use the dependency-audit skill.
Generate a summary with version deltas.
If critical vulnerabilities exist, flag for immediate review.
skills:
- dependency-audit
review_queue: true
The review_queue: true flag is critical. Without it, the automation would commit changes directly. With it, results sit in the queue until I decide what to do.
The safest pattern I've found for automations is this:
I've run automations that generated clean reports, and I've run automations that tried to delete entire directories. The difference? Approval gates at the right places.
Here's where I always put human review:
main or production? Always gated.Here's a configuration example for a safe automation setup:
# config.toml - Project-level rules
[automations.ci-summary]
schedule = "0 */4 * * *" # Every 4 hours
read_only = true
allowed_commands = [
"npm test",
"git log --oneline -10"
]
review_required = true
[automations.dependency-update]
schedule = "0 2 * * 1" # Monday 2 AM
read_only = false
allowed_commands = [
"npm outdated",
"npm update --dry-run"
]
elevated_permissions = [
"npm install" # Requires approval
]
review_required = true
Notice the read_only and elevated_permissions fields. These control what the automation can do without asking.

I asked around: what automations do people actually keep running? Three patterns came up repeatedly, and I've tested all three in my own workflow.
This automation runs tests and linters after every commit, then summarizes failures in the review queue. It doesn't auto-fix anything—it just tells you what broke and why.
Why this works: It's read-only by design. The worst thing it can do is generate a noisy report.
Setup example:
# .codex/automations/test-summary.yaml
schedule: "*/15 * * * *" # Every 15 minutes
instructions: |
Run npm test and npm run lint.
If failures occur, summarize:
- Which tests failed
- Error messages
- Files that triggered the failures
Generate a concise report in the review queue.
skills:
- test-runner
review_queue: true
I've been running this for two weeks. It catches regressions fast, and because it only reads output, there's no risk of it breaking anything.
This automation checks for outdated dependencies and security vulnerabilities, then generates a report with version deltas and CVE details.
Why this works: It surfaces actionable data without making changes. You decide which updates to apply.
Setup example:
# .codex/automations/dependency-audit.yaml
schedule: "0 9 * * 1" # Monday 9 AM
instructions: |
Run npm outdated and npm audit.
Summarize:
- Packages with available updates
- Security vulnerabilities (with severity)
- Recommended actions
Flag critical vulnerabilities for immediate review.
skills:
- dependency-checker
review_queue: true
I run this weekly. The reports are consistent, and I've caught two critical vulnerabilities before they hit production.
This automation generates API documentation from inline code comments, then commits the output to a docs/ branch for review.
Why this works: It operates on a separate branch, so it can't break the main codebase. You review the generated docs before merging.
Setup example:
# .codex/automations/doc-generator.yaml
schedule: "0 0 * * 5" # Friday midnight
instructions: |
Scan all .js files for JSDoc comments.
Generate API documentation using the doc-generator skill.
Commit output to the docs-branch branch.
Create a PR for review.
skills:
- doc-generator
review_queue: true
branch: "docs-branch"
This has saved me hours of manual doc updates. The quality isn't perfect, but it's good enough that I only need to fix edge cases.

Automations break. Skills drift. The question is: how do you know when it happens?
I track three metrics:
Every time an automation flags something as broken when it's actually fine, that's a false positive. I track this manually: if I review an output and think "this is wrong," I mark it.
Target: < 5% false positive rate. If it's higher, the automation isn't ready for daily use.
How to measure: Review the first 20 runs. Count how many outputs were incorrect. If more than 1 in 20 is wrong, adjust the instructions or skills.
Drift happens when the automation starts producing different outputs for the same inputs. This usually means dependencies have changed, or the underlying model is behaving differently.
Signal: Compare this week's outputs to last week's. If the format or content has shifted, investigate.
Fix: Pin dependencies in your skills. Use explicit version numbers in package.json and skill configurations.
If an automation ran successfully for weeks and then suddenly fails, that's a regression. Usually caused by:
Tracking: Keep a log of every automation run. If a failure occurs, diff the current codebase against the last successful run to find the trigger.
Here's a simple reliability tracking table I use:
If an automation stays in "Needs Tuning" for more than two weeks, I either fix it or remove it. No point running something unreliable.

After three weeks of real use, here's what survived:
The biggest surprise? Automations didn't replace my workflow—they compressed the repetitive parts so I could focus on the decisions that actually matter.
At Macaron, we're watching how agents handle delegation differently than traditional tools. Skills and automations turn conversations into workflows that run on their own—but only when the review layer stays tight. If you're building systems that need to survive more than a demo, test them like this: schedule them, let them fail, then build the guardrails that keep them stable.