it is 8 am Sunday morning.

i was about to open “Zen and the Art of Motorcycle Maintenance.” the book has one quiet truth: if you truly care to understand something, you go slow. not because you’re lazy. because that’s how you avoid breaking things you can’t fix. the fast-and-careless move belongs to someone who doesn’t plan to be around for the consequences.

i didn’t get to the book. let me tell you why.


mid-week, power cut in my building. everything stopped — laptop, internet, work. i sat there with nothing to do and nowhere to go. and something settled in, the way things do when you’re forced to sit still.

this is what invisible infrastructure feels like when it’s gone.

i work in cloud. i don’t touch bare metal. but right then, i felt it in the most physical way possible. the moment electricity disappeared, my whole professional world disappeared with it. not metaphorically. literally.

that feeling stayed with me. and it connected to something i’d been thinking about for weeks.


because this week i also saw a post from an infrastructure engineer.

they let an agentic AI tool run terraform apply on their behalf. production database — gone. no rollback. no recovery path. just gone.

reading that felt exactly like sitting in my dark apartment. that same sudden, total stop. except this time, it wasn’t weather.

it was a choice someone made.


here’s what i keep coming back to.

a city’s electrical grid and a multi-account AWS architecture are the same system wearing different clothes.

both start with a central hub — a substation, a Transit Gateway — and fan out through distribution layers into isolated zones. both have a monitoring plane: SCADA watches telemetry and triggers breakers; CloudWatch and GuardDuty do the same. both use isolation as protection: protection relays cut off a faulty segment; Security Groups and NACLs do the same for compromised workloads.

a city of 25,000 people needs 300–800 distribution transformers. each one has its own protection scheme, its own metering, its own load profile. a 100-account AWS org has thousands of security group rules, route table entries, IAM permission bindings. each one is a potential blast radius.

both systems fail the same way. not because one component breaks — because the relationships between components break. a misconfigured protection relay cascades into a district blackout. a missing route in a Transit Gateway route table silently black-holes traffic for an entire business unit.

now ask yourself: would you let an AI autonomously manage the city power grid?


in terraform, the blast radius isn’t abstract.

lifecycle { prevent_destroy = true } is your circuit breaker — the guard that stops the current from flowing somewhere destructive. module optionality, the ability to conditionally enable or disable a module, is your protection relay. your modules themselves are the substations: foundational, structural, load-bearing.

consider one innocent line:

enable = false

put that in your centralized AWS Backup module. you’ve just silently removed the data safety net for your entire organisation. no alarms. no warnings. terraform will tell you the plan looks clean. the apply will succeed. the backup jobs will stop running. you’ll find out three months later when you need a restore.

terraform apply doesn’t ask if you’re sure. it does what you told it to do.

the neurosurgeon doesn’t hand over the drill because AI could finish the operation in 30 minutes instead of six hours. speed is not the measure. consequence is.


i’ve worked with terraform for about nine years. here’s what that actually teaches you:

plan output is not a guarantee of what apply will produce. i’ve watched perfectly clean plans fail on apply — sometimes due to provider quirks, sometimes due to race conditions, sometimes because the real world had drifted from the state file in ways terraform couldn’t see until it tried to act.

network is unreliable by definition. asynchronous operations may be inconsistent with state. these aren’t bugs. they’re properties of distributed systems.

Murphy’s Law doesn’t care about your CI/CD pipeline. everything that can go wrong, will go wrong. the only question is whether a human was watching when it happened — or whether the agentic tool had already moved on to the next action in the queue.

the platform engineer, like the grid operator, is invisible infrastructure. you’re what enables everyone else’s work. you only become visible when something breaks. when you’re invisible, that means you’re doing it right. and doing it right requires judgment — the kind that comes from understanding the system, not just executing commands against it.


so what does this actually mean for AI?

it doesn’t mean don’t use it.

it means: don’t relinquish the switch.

one constraint eliminates most of the blast radius: disable autonomous terraform apply in any agentic mode. plan output requires human eyes before apply runs. always. that’s it. that’s the line.

there are workflows where AI genuinely helps in infrastructure work — code generation, plan review, module scaffolding, drift detection analysis. none of them require autonomous execution. the value is in the suggestion, the review, the second pair of eyes that never gets tired. not in the execution.

i’ll go into the specifics in the next post: what those workflows look like, and how to structure safe boundaries for each one.

but the rule fits on one line:

AI reads and suggests. humans decide and apply.

because if you call yourself an infrastructure engineer, you carry the weight of the grid. the lights are on for your users because you made good decisions, in sequence, with attention.

act like it.