Dependable Memory Hierarchy

Fast but undependable is not very attractive. One of the 8 Great Ideas in Computer Architecture for dependability is Redundancy.

Reliability is a measure of the continuous service accomplishment—or, equivalently, of the time to failure—from a reference point

We define 2 terms:

  • mean time to failure (MTTF)
  • annual failure rate (AFR)

When MTTF gets large it can be misleading, so we use AFR for better intuition.

  • Where MTTR is the mean time to repair

There are three ways to improve MTTF:

  1. Fault avoidance: Preventing fault occurrence by construction.
  2. Fault tolerance: Using redundancy to allow the service to comply with the service specification despite faults occurring.
  3. Fault forecasting: Predicting the presence and creation of faults, allowing the component to be replaced before it fails.