Configuration as Code

Treat system configuration (servers, networks, databases, secrets, Kubernetes manifests, monitoring rules) as source files: version-controlled, reviewed, tested, automatically applied.

Why?

Complicated setups mean mistakes and skipped steps. A wiki page titled “How to install AwesomeApp” fails because humans forget. Make it automatic, and don’t let them make mistakes.

Ingredients:

Declarative: you write the desired state, not the commands, the tool figures out the diff
Idempotent: running it twice lands in the same state, no “I already created that” failures
Versioned: every change is a commit with diff, author, and reason, rollback is git revert + reapply
Reviewed: infra changes go through PR review, just like app code

Tools:

Cloud provisioning: Terraform, Pulumi, CloudFormation, AWS CDK
Server config: Ansible, Puppet, Chef, Salt (Ansible dominates)
Container orchestration: Kubernetes YAML + Helm / Kustomize
Secrets: sealed-secrets, SOPS, Vault

Why it’s non-negotiable at scale:

Reproducibility: rebuild a cluster from scratch with one command
Auditing: regulators, security reviews
Drift detection: tools can flag manual (“ClickOps”) changes
Blast radius: PR review catches the rm -rf before it runs

The anti-pattern is “some poor soul’s SSH history”: a server configured manually, with institutional knowledge in one person’s head. That server is an outage waiting to happen.

From ECE459 L34

Sendmail, apache, and nginx configs are notoriously complicated. The environment is also config. Principles:

Version control configuration alongside code
Code review changes to config
Test configs: do they produce expected files, services, or behaviours?
Aim for modular services that compose
Refactor Puppet manifests and Chef recipes like regular code
Run through continuous builds

Convergence

Unlike app code, config for long-running services won’t “terminate”, but CPU usage should converge downward after a while.

Terraform

Canonical example for cloud infra. You declare desired state (services, permissions, GitHub repo access, group membership) and terraform apply converges on it.

terraform plan shows a dry-run including expected cost delta, guarding both against a $10k/day surprise and against a “small” change turning out huge. Caveats: state can change between plan and apply, unique IDs are only known post-create.

Destructive changes

A misconfigured plan once destroyed GitHub groups, which some members mistook as a sign they were being fired. Reverting the PR re-creates the group but not its members. A destroyed database re-creates empty. You took backups, right?

Common infrastructure

Treat each piece of infra as having an interface, communicate only through it. Reduces coupling, lets you scale parts independently.

Avoid not-invented-here syndrome. Prefer existing tools:

Storage: MongoDB, S3
Naming / discovery: Consul
Monitoring: Prometheus

Roll-your-own

Roll your own only when it’s your core competence. Never roll your own crypto or security without experts plus external audits. If the tool you want doesn’t exist, either you’re first or it’s a bad idea. Big platforms (AWS) launch services constantly, so patience can be cheaper than effort.

Naming

“Two hard things in computing: cache invalidation, naming things, and off-by-one errors.”

Debate: meaningful vs. meaningless names. billing describes the service but clashes with the team name and the replacement service (billing2?). Made-up mythological names carry morale value (“Avengers, assemble”).

The real fix: a directory tool (OpsLevel) that maps names to owners, docs, and maturity info (deprecations, security patches, test coverage). Jeff saw teams “X Infrastructure” vs. “X Operations” where ~35% of queries bounced between the two.

Servers as cattle, not pets

Reproducible deployment beats manual SSH surgery. Minimize manual intervention toward zero, which enables auto-scaling.

Example: Kubernetes handles deploys, rollouts and reverts, placement, health checks, replacement of dead instances, and load balancing.

Canarying (“test in prod”)

Two flavours:

Deploy new alongside old, route small fraction of traffic to new, grow or shrink based on metrics
Upgrade a subset of instances, run tests, reintroduce

Subset flow: stage → remove canaries from service → upgrade → automated tests → reintroduce → watch.

Design for rollback

A schema migration that’s destructive or shared across servers can make rollback impossible. “100k dev records vs. 10M prod records” is the classic trap.

Containers

Evolution:

Hand-installed binaries: painful
Packages: dependencies + install script, but multiple services on one host produces RPM/JAR/classloader/DLL hell
VMs: isolation at cost of per-app guest OS (install the same security patch 10 times)
Containers: isolation with shared kernel, built from a spec listing libs and tools, shared read-only where possible. A lightweight VM

Build goal: a container ready for Kubernetes to deploy.

Don't skip upgrades

Containers let you never upgrade. Before, OS patches gave you shared-library updates. Now you can keep an ancient insecure libfoo forever. Don’t.

🛠️ Steven Gong

Table of Contents

Configuration as Code

From ECE459 L34

Terraform

Common infrastructure

Naming

Servers as cattle, not pets

Canarying (“test in prod”)

Containers

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Configuration as Code

From ECE459 L34

Terraform

Common infrastructure

Naming

Servers as cattle, not pets

Canarying (“test in prod”)

Containers

Related

Graph View

Backlinks