DevOps Operations

Keeping running services healthy: monitoring, alerting, incident response, security. The “ops” side of DevOps, covered in ECE459 L35.

Monitoring and Alerting

Health checks are usually too basic (“app responds?”). A good one tests real workflows (can the user log in and see their data?). What to actually watch:

  • CPU load
  • Memory utilization
  • Disk space and I/O
  • Network traffic
  • Clock skew
  • Queue lengths
  • Application response times

Multi-system deployments need a dashboard that’s summary-level, detailed enough to detect problems but not a wall of data.

Niall Murphy’s three responses to a detected issue:

  • Alerts: a human must act now
  • Tickets: a human must act in hours or days
  • Logging: forensic only, no immediate action expected

Logs-as-tickets

Routinely scanning logs for errors is a bad pattern. Write code to scan logs.

Alert judiciously

Too-frequent alerts become ignored fire alarms. If one does go off, you’ll assume it’s another false alarm, and you might be dead.

High CPU isn’t always bad: 80% sustained may just be “close to max.” The dog-that-didn’t-bark matters too: unusually low activity (nobody logging in) can mean something broke.

Incident Reports and Postmortems

An incident report (post-mortem) identifies root causes and lessons. Without it, the same problem recurs.

Must contain:

  • Timeline: strictly facts, when the problem started, was noticed, actions taken, resolution
  • Root cause vs. proximate cause (see below)
  • Action items: short and long term, realistic
  • Lessons learned, even small ones

Must not contain: irrelevant detail, speculation, blaming.

Root vs. proximate

Deadlock: proximate cause is thread A locks X then waits for Y, while B locks Y then waits for X. Root cause is “we don’t acquire locks in a consistent order.” Toyota’s Five Whys keeps you from stopping at “I wrote a bug.”

Blameless postmortems

Blame incentivizes cover-ups, blame-shifting, and denial. Articles from medical fields show blame leads to worse patient outcomes than a learning culture.

Security

Always-available internet services have huge attack surface. Two worst categories:

  • Code execution / injection: attackers run their code on your platform
  • Information exposure: PII leakage, GDPR fines (top offender: Amazon Europe, ~746M EUR)

Cryptomining on free CI

Attackers hide mining containers behind innocuous names like linux88884474. About 137 coin: terrible ROI for a paying user, but free money when GitHub foots the bill.

PII in logs is a disaster waiting to happen. Often caused by overly strict security policies: devs blocked from the DB dump everything to logs to debug, and now an attacker reading logs gets a one-stop shop.

Dependency vulnerability scanners work transitively. False positives happen when the vulnerable API isn’t actually used. Modularization improves scanner precision [ADL24].

Supply-chain attack: xz / ssh (2024)

Backdoored xz/liblzma targeted openssh. Notable for this course because it was discovered via performance profiling: Andres Freund noticed auth was slower (0.3s → 0.8s) and dug in.