Recovery Strategies

DR Solution 1 - Backup and Restore cheat sheet

  • Disaster recovery method with longest RPO/RTO

  • For lower priority use cases

  • Primarily use Amazon S3 and AWS Storage Gateway

  • Preparation phase:

    • Take backups of current systems

    • Store backups in Amazon S3

    • Document the procedure to restore from backup on AWS:

      • Know which AMI to use; build your own as needed

      • Know how to restore system from backups

      • Know how to switch to new system

      • Know how to configure the deployment

  • In case of disaster:

    • Retrieve backups from Amazon S3

    • Bring up required infrastructure:

      • Amazon EC2 instances with prepared AMIs, ELB, etc.

      • Use AWS CloudFormation to automate deployment of core networking

    • Restore system from backup

    • Switch over to the new system:

      • Adjust DNS records to point to AWS

AWS Storage Gateway cheat sheet

  • Connects an on-premises software appliance (AWS Storage Gateway Hardware Appliance) with cloud-based storage to provide seamless and highly secure integration between your on-premises IT environment and the AWS storage infrastructure

  • Supports industry-standard storage protocols that work with your existing applications

  • Integrated with Amazon CloudWatch, AWS CloudTrail, AWS KMS, IAM, and more

  • Virtual tape library (VTL): virtual tapes stored in Amazon S3 or Amazon S3 Glacier

  • Gateway-cached volumes: store primary data in Amazon S3 and retain your frequently accessed data locally at substantial cost savings and lower latency

  • Gateway-stored volumes: stores primary data locally and asynchronously backs up point-in-time snapshots of this data to Amazon S3

DR Solution 2 - Pilot Light

  • Low cost, but RPO/RTO of tens of minutes

  • Best for core application services

  • Based on having a replicated but scaled-down and not running infrastructure that your application can fail over to once it is activated

  • Preparation phase:

    • Set up Amazon EC2 instances to replicate or mirror data

    • Ensure that you have all supporting custom software packages available in AWS

    • Create and maintain AMIs of key servers where fast recovery is required

    • Regularly run these servers, test them, and apply any software updates and configuration changes

    • Consider automating the provisioning of AWS resources

  • In case of disaster:

    • Automatically bring up resources around the replicated core data set

    • Scale the system as needed to handle current production traffic

    • Switch over to the new system:

      • Adjust DNS records to point to AWS
    • Objectives:

      • RTO: As long as it takes to detect a need for disaster recovery and automatically scale up the replacement system

      • RPO: Depends on replication type

DR Solution 3 - fully working low-capacity standby

  • More expensive, but an RPO/RTO of minutes

  • Best for business-critical services

  • Can take some production traffic at any time, not just during disaster recovery

  • Cost footprint smaller than full disaster recovery

  • Preparation:

    • Like pilot light, but components are active 24/7

    • Not scaled for production traffic

    • Best practice: Continuous testing with a statistical subset of production traffic to a disaster recovery site

  • In case of disaster:

    • Immediately fail over to most critical production load, Adjust DNS records to point to AWS

    • Scale the system automatically to handle all production load

  • Objectives:

    • RTO: For critical load, as long as it takes to fail over. For all other load, as long as it takes to scale further.

    • RPO: Depends on replication type

DR Solution 4 - Multi-site active-active

  • Most expensive, but a real-time RPO/RTO

  • Best for achieving as close to 100% availability as possible

  • Can take all production load at any moment

  • Preparation:

    • Similar to low-capacity standby

    • Fully scaling in/out with production load

  • In case of disaster: Immediately fail over all production load

  • Objectives:

    • RTO: As long as it takes to fail over

    • RPO: Depends on replication type

Best practices for being prepared

Start Simple

  • Backup in AWS are a first step

  • Incrementally improve RTO/RPO as a continuous effort

Check for software/license issue

  • Verify that, in the event of disaster, you have the license available to deploy your application to new instance quickly.

Game day excercise

  • Test critical system going offline or even entire Region. What if an entire fleet were to crash?

  • Ensure backup is available