it is 8 am Sunday morning. i was about to open “Zen and the Art of Motorcycle Maintenance” — a book with one simple truth: if you truly care to understand something, you go slow. that’s not laziness. that’s how you avoid breaking things you can’t fix.
i didn’t get to the book. let me tell you why.
[OPENING: the power outage — narrative, first-person, slow pace]
mid-week, power cut in my building. everything stopped — laptop, internet, work. i sat there useless. and something settled in: this is what invisible infrastructure feels like when it’s gone. i work in cloud. i don’t touch bare metal. but right then, i felt it in the most physical way possible.
that feeling stayed with me. and it connected to something else i’d been thinking about.
[THE HORROR STORY — keep the momentum, connect to the power outage feeling]
because this week i also saw a post from an infrastructure engineer.
they let an agentic AI tool run terraform apply on their behalf. production database — gone.
no rollback. no plan to recover. just gone.
reading that felt exactly like sitting in my dark apartment. that same sudden, total stop.
except this time, it wasn’t weather. it was a choice someone made.
[THE ANALOGY — build it as a thought experiment, not a lecture]
here’s what i keep thinking about.
a city’s electrical grid and a multi-account AWS architecture are the same system wearing different clothes.
both start with a central hub — a substation, a Transit Gateway — and fan out through distribution layers into isolated zones. both have a monitoring plane: SCADA watches telemetry and triggers breakers; CloudWatch and GuardDuty do the same. both use isolation as protection: protection relays cut off a faulty segment; Security Groups and NACLs do the same for compromised workloads.
a city of 25,000 people needs 300–800 distribution transformers. each one has its own protection scheme, its own metering, its own load profile. a 100-account AWS org has thousands of security group rules, route table entries, IAM bindings. each one is a potential blast radius.
both systems fail the same way: not because one component breaks, but because relationships between components break. a misconfigured relay cascades into a district blackout. a missing Transit Gateway route silently black-holes traffic for an entire business unit.
now: would you let an AI autonomously manage the city power grid?
[THE RED BUTTON — precise terraform metaphors, concrete examples]
in terraform, the blast radius isn’t abstract.
lifecycle { prevent_destroy = true } is your circuit breaker — the guard that stops the current.
module optionality — the ability to conditionally enable or disable a module — is your protection relay.
your modules themselves are the substations: foundational, structural, load-bearing.
one innocent line:
enable = false
in your centralized backup module — and you’ve silently removed the data safety net for your entire organisation.
terraform apply doesn’t ask if you’re sure. it does what you told it to do.
the neurosurgeon doesn’t hand over the drill because AI could finish the operation in 30 minutes instead of six hours. speed is not the measure. consequence is.
[PROFESSIONAL RESPONSIBILITY — 9 years of evidence, Murphy’s Law]
i’ve worked with terraform for about nine years. here’s what that experience actually teaches you:
plan output is not a guarantee of what apply will produce.
sometimes apply fails after a perfectly clean plan.
network is unreliable by definition.
asynchronous operations may be inconsistent with state.
this isn’t a criticism of terraform. it’s how complex systems work. Murphy’s Law doesn’t care about your CI/CD pipeline. everything that can go wrong, will go wrong.
the platform engineer, like the grid operator, is invisible infrastructure. you’re what enables everyone else’s work. you only become visible when something breaks.
[CLOSE + TEASER — connected, not bolted on]
so what does this mean for AI?
it doesn’t mean don’t use it. it means: don’t relinquish the switch.
the primary directive is simple: disable autonomous terraform apply in any agentic mode.
plan output requires human eyes before apply runs. always.
that one constraint eliminates most of the blast radius.
there are workflows where AI genuinely helps in infrastructure work — code generation, plan review, module scaffolding, drift detection analysis. none of them require autonomous execution.
i’ll go into specifics in the next post: what those workflows look like, and how to structure the boundaries.
but the rule of thumb fits on one line: AI reads and suggests. humans decide and apply.
because if you call yourself an infrastructure engineer, you carry the weight of the grid. act like it.