Configuration as Code
Treat system configuration (servers, networks, databases, secrets, Kubernetes manifests, monitoring rules) as source files: version-controlled, reviewed, tested, automatically applied.
Why?
Complicated setups mean mistakes and skipped steps. A wiki page titled “How to install AwesomeApp” fails because humans forget. Make it automatic, and don’t let them make mistakes.
Ingredients:
- Declarative: you write the desired state, not the commands, the tool figures out the diff
- Idempotent: running it twice lands in the same state, no “I already created that” failures
- Versioned: every change is a commit with diff, author, and reason, rollback is
git revert+ reapply - Reviewed: infra changes go through PR review, just like app code
Tools:
- Cloud provisioning: Terraform, Pulumi, CloudFormation, AWS CDK
- Server config: Ansible, Puppet, Chef, Salt (Ansible dominates)
- Container orchestration: Kubernetes YAML + Helm / Kustomize
- Secrets: sealed-secrets, SOPS, Vault
Why it’s non-negotiable at scale:
- Reproducibility: rebuild a cluster from scratch with one command
- Auditing: regulators, security reviews
- Drift detection: tools can flag manual (“ClickOps”) changes
- Blast radius: PR review catches the
rm -rfbefore it runs
The anti-pattern is “some poor soul’s SSH history”: a server configured manually, with institutional knowledge in one person’s head. That server is an outage waiting to happen.
From ECE459 L34
Sendmail, apache, and nginx configs are notoriously complicated. The environment is also config. Principles:
- Version control configuration alongside code
- Code review changes to config
- Test configs: do they produce expected files, services, or behaviours?
- Aim for modular services that compose
- Refactor Puppet manifests and Chef recipes like regular code
- Run through continuous builds
Convergence
Unlike app code, config for long-running services won’t “terminate”, but CPU usage should converge downward after a while.
Terraform
Canonical example for cloud infra. You declare desired state (services, permissions, GitHub repo access, group membership) and terraform apply converges on it.
terraform plan shows a dry-run including expected cost delta, guarding both against a $10k/day surprise and against a “small” change turning out huge. Caveats: state can change between plan and apply, unique IDs are only known post-create.
Destructive changes
A misconfigured plan once destroyed GitHub groups, which some members mistook as a sign they were being fired. Reverting the PR re-creates the group but not its members. A destroyed database re-creates empty. You took backups, right?
Common infrastructure
Treat each piece of infra as having an interface, communicate only through it. Reduces coupling, lets you scale parts independently.
Avoid not-invented-here syndrome. Prefer existing tools:
- Storage: MongoDB, S3
- Naming / discovery: Consul
- Monitoring: Prometheus
Roll-your-own
Roll your own only when it’s your core competence. Never roll your own crypto or security without experts plus external audits. If the tool you want doesn’t exist, either you’re first or it’s a bad idea. Big platforms (AWS) launch services constantly, so patience can be cheaper than effort.
Naming
“Two hard things in computing: cache invalidation, naming things, and off-by-one errors.”
Debate: meaningful vs. meaningless names. billing describes the service but clashes with the team name and the replacement service (billing2?). Made-up mythological names carry morale value (“Avengers, assemble”).
The real fix: a directory tool (OpsLevel) that maps names to owners, docs, and maturity info (deprecations, security patches, test coverage). Jeff saw teams “X Infrastructure” vs. “X Operations” where ~35% of queries bounced between the two.
Servers as cattle, not pets
Reproducible deployment beats manual SSH surgery. Minimize manual intervention toward zero, which enables auto-scaling.
Example: Kubernetes handles deploys, rollouts and reverts, placement, health checks, replacement of dead instances, and load balancing.
Canarying (“test in prod”)
Two flavours:
- Deploy new alongside old, route small fraction of traffic to new, grow or shrink based on metrics
- Upgrade a subset of instances, run tests, reintroduce
Subset flow: stage → remove canaries from service → upgrade → automated tests → reintroduce → watch.
Design for rollback
A schema migration that’s destructive or shared across servers can make rollback impossible. “100k dev records vs. 10M prod records” is the classic trap.
Containers
Evolution:
- Hand-installed binaries: painful
- Packages: dependencies + install script, but multiple services on one host produces RPM/JAR/classloader/DLL hell
- VMs: isolation at cost of per-app guest OS (install the same security patch 10 times)
- Containers: isolation with shared kernel, built from a spec listing libs and tools, shared read-only where possible. A lightweight VM
Build goal: a container ready for Kubernetes to deploy.
Don't skip upgrades
Containers let you never upgrade. Before, OS patches gave you shared-library updates. Now you can keep an ancient insecure libfoo forever. Don’t.