Reliability Design Principles

🦄 05 Design Principles for Reliabililty

There are five design principles for Reliability in the cloud:

1. Test recovery procedures: In an on-premises environment, testing is often conducted to prove the system works in a particular scenario. Testing is not typically used to validate recovery strategies. In the cloud, you can test how your system fails, and you can validate your recovery procedures. You can use automation to simulate different failures or to recreate scenarios that led to failures before. This exposes failure path ways that you can test and rectify before a real failure scenario, reducing the risk of components failing that have not been tested before.

2. Automatically recover from failure: By monitoring a system for key performance indicators (KPIs), you can trigger automation when a threshold is breached. This allows for automatic notification and tracking of failures, and for automated recovery processes that work around or repair the failure. With more sophisticated automation, it’s possible to anticipate and remediate failures before they occur.

3. Scale horizontally to increase aggregate system availability: Replace one large resource with multiple small resources to reduce the impact of a single failure onthe overall system. Distribute requests across multiple, smaller resources to ensurethat they don’t share a common point of failure.

4. Refine operations procedures frequently: As you use operations procedures,look for opportunities to improve them. Set up regular game days to review and validate that allprocedures are effective and that teams are familiar with them.

5. Stop guessing capacity: A common cause of failure in on-premises systems isresource saturation, when the demands placed on a system exceed the capacity of that system (this is often the objective of denial of service attacks). In the cloud, you can monitor demand and system utilization, and automate the addition or removal of resources to maintain the optimal level to satisfy demand without over-or under- provisioning.

6. Manage change in automation: Changes to your infrastructure should be doneusing automation. The changes that need to be managed are changes to the automation.