Dependable Memory Hierarchy
Fast but undependable is not very attractive. One of the 8 Great Ideas in Computer Architecture for dependability is Redundancy.
Reliability is a measure of the continuous service accomplishment—or, equivalently, of the time to failure—from a reference point
We define 2 terms:
- mean time to failure (MTTF)
- annual failure rate (AFR)
When MTTF gets large it can be misleading, so we use AFR for better intuition.
- Where MTTR is the mean time to repair
There are three ways to improve MTTF:
- Fault avoidance: Preventing fault occurrence by construction.
- Fault tolerance: Using redundancy to allow the service to comply with the service specification despite faults occurring.
- Fault forecasting: Predicting the presence and creation of faults, allowing the component to be replaced before it fails.