Fault Tolerance
Hemal Shah introduced me to these.
Before even talking about fault tolerance, it is super important that you understand the distinction between the 3:
- Failure: Users can see, something crashes
- Fault: Seems odd, may or may not to lead to failure
- Error: Lowest level, this is what we need to solve, the root cause
Fault tolerance is a process that enables an operating system to respond to a failure in hardware or software.
From SE465
- Errors: people commit errors
- Fault: a fault is the result of an error in the software documentation, code, etc.
- Failure: a failure occurs when a fault executes
Software testing: exercise the software with test cases to gain (or reduce) confidence in the system (execution based on test cases)