Resilience concepts
High Availability (HA) 🏷️¶
TL;DR
- Goal: Maximizing a system's uptime 📈.
- User disruption is acceptable ✅.
- More budget friendly 🏷️
- With Spare infrastructure ready to switch to you can minimize outages 👌.
- cope through disaster.
- e.g. 4x4 repearing a tire hole with a spare tire
- spare tire.
- some user disruption while doing so.
Fault-Tolerance (FT) 💰¶
^6c176e
- Goal:
- Work through failure of some its components 🩹
- ****minimize outages**** ⚡.
- User disruption not allowed or acceptable 🚫.
- More expensive 💸.
- Outages must be minimized and the system needs levels of redundancy.
- Mission or life critical situations 🔫
- operate through disaster.
- e.g. Resilient systems on large planes
- (like extra engines than it needs to so it can operate through failure)
Disaster Recovery (DR) ☠️¶
- what to plan for ? and do when disaster occurs ?
- *e.g.* DR Process are pilot or passanger ejection systems.
- Goal: DR is a process designed to keep the non replaceable parts safe.
- Set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural 🍃 or human-induced disaster 🤡.
- DR can largely be automated to eliminate the time for recovery and errors.
This involves:
- Pre-planning
- Ensure plans are in place for extra hardware
- [DO NOT STORE] backups at the same site as the system.
- DR Processes
- Cloud machines ready when needed.
- Run periodic DR Testing.
- all the process and tools should be properly documented.
- all the parties involved should run periodic DR Testing from time to time to minimize human error.
- all logins to key systems need to be available for staff at time of disaster.
- This is designed to keep the crucial and non replaceable parts of the system in place.
- Used when HA and FT don't work.