DevOps Operations
Keeping running services healthy: monitoring, alerting, incident response, security. The “ops” side of DevOps, covered in ECE459 L35.
Monitoring and Alerting
Health checks are usually too basic (“app responds?”). A good one tests real workflows (can the user log in and see their data?). What to actually watch:
- CPU load
- Memory utilization
- Disk space and I/O
- Network traffic
- Clock skew
- Queue lengths
- Application response times
Multi-system deployments need a dashboard that’s summary-level, detailed enough to detect problems but not a wall of data.
Niall Murphy’s three responses to a detected issue:
- Alerts: a human must act now
- Tickets: a human must act in hours or days
- Logging: forensic only, no immediate action expected
Logs-as-tickets
Routinely scanning logs for errors is a bad pattern. Write code to scan logs.
Alert judiciously
Too-frequent alerts become ignored fire alarms. If one does go off, you’ll assume it’s another false alarm, and you might be dead.
High CPU isn’t always bad: 80% sustained may just be “close to max.” The dog-that-didn’t-bark matters too: unusually low activity (nobody logging in) can mean something broke.
Incident Reports and Postmortems
An incident report (post-mortem) identifies root causes and lessons. Without it, the same problem recurs.
Must contain:
- Timeline: strictly facts, when the problem started, was noticed, actions taken, resolution
- Root cause vs. proximate cause (see below)
- Action items: short and long term, realistic
- Lessons learned, even small ones
Must not contain: irrelevant detail, speculation, blaming.
Root vs. proximate
Deadlock: proximate cause is thread A locks X then waits for Y, while B locks Y then waits for X. Root cause is “we don’t acquire locks in a consistent order.” Toyota’s Five Whys keeps you from stopping at “I wrote a bug.”
Blameless postmortems
Blame incentivizes cover-ups, blame-shifting, and denial. Articles from medical fields show blame leads to worse patient outcomes than a learning culture.
Security
Always-available internet services have huge attack surface. Two worst categories:
- Code execution / injection: attackers run their code on your platform
- Information exposure: PII leakage, GDPR fines (top offender: Amazon Europe, ~746M EUR)
Cryptomining on free CI
Attackers hide mining containers behind innocuous names like
linux88884474. About 137 coin: terrible ROI for a paying user, but free money when GitHub foots the bill.
PII in logs is a disaster waiting to happen. Often caused by overly strict security policies: devs blocked from the DB dump everything to logs to debug, and now an attacker reading logs gets a one-stop shop.
Dependency vulnerability scanners work transitively. False positives happen when the vulnerable API isn’t actually used. Modularization improves scanner precision [ADL24].
Supply-chain attack: xz / ssh (2024)
Backdoored xz/liblzma targeted openssh. Notable for this course because it was discovered via performance profiling: Andres Freund noticed auth was slower (0.3s → 0.8s) and dug in.