Fault Tolerance

Hemal Shah introduced me to these.

Before even talking about fault tolerance, it is super important that you understand the distinction between the 3:

  1. Failure: Users can see, something crashes
  2. Fault: Seems odd, may or may not to lead to failure
  3. Error: Lowest level, this is what we need to solve, the root cause

Fault tolerance is a process that enables an operating system to respond to a failure in hardware or software.

From SE465

  • Errors: people commit errors
  • Fault: a fault is the result of an error in the software documentation, code, etc.
  • Failure: a failure occurs when a fault executes

Software testing: exercise the software with test cases to gain (or reduce) confidence in the system (execution based on test cases)