7 min readApr 25, 2017

Ultimate Survival Secrets to Avoid Disasters With Strong Disaster Plan for AWS Cloud

Disaster in inevitable. We keep hearing this.

Many companies have gone out of business after losing their mission critical data due to disasters hitting their cloud. But despite tight security check-ins, outage happens. Hence, in this digital world, having a disaster recovery plan in place is pivotal to any organization’s success. The best you can do is be prepared for any disaster. It’s easy!

Here are 5 ‘must-address’ AWS Cloud DR Management challenges for any AWS cloud user:

1.Automated Backup Scripts

Backup is a periodical operation that takes place in the background while normal business operations continue. Manual invocation and orchestration are not scalable for cloud applications because the process is very error-prone. Hence the backup process must be automated. Further, custom Automated Backup Procedure is quite tedious to achieve as it involves scripting, allocating, and provisioning distributed cloud resources. Businesses need to find a way to tackle the custom Automated Backup Procedure in a smart way.

2.Continuous Operations Monitoring

Managing the production backup system is an important consideration for a DR System. The DR System must monitor the backup operation, collect and collate logs from the production systems. The DR System should analyze the logs collected and alert the DevOps for any anomalies and errors so that the DevOps can make the remedial actions immediately. However, this is easier said than done.

3.Management Of Various Permissions And Authorization Policies

The security concerns applicable to the production systems apply to the backup resources as well. As these backup systems might span across multiple regions, it introduces additional complexity due to the need for management of various permissions and authorization policies across all of the additional resources. The administration console for the DR system should provide tools that assign the rights to invoke backup, access business data, and recovery implementation across different zones, regions, and accounts.

4.Cost Management

In addition to the data consistency and all other challenges listed above, the DR solution should also take advantage of the elastic capabilities of the AWS and dynamically allocate or de-allocate the storage and processing resources to ensure that cost management is feasible.

5.Ease Of Management

One of the top challenges of Backup-as-a-Service and Disaster Recovery as a Service management many businesses fail to win is the highly distributed nature of these services. The administration console has to abstract away the complexity of having to use the manual cloud copying operations, designing and organizing the network elements for the distributed data replication. Providing an easy interface for the DevOps to quickly to initiate the day to day scheduled tasks is a challenging proposition for custom ad-hoc solutions as without effective tools for DevOps are risky, costly and do not scale well.

You must also check these Do’s and Don’ts of DR on AWS Cloud: Having A Disaster Recovery Plan Is Pivotal — The Do’s And Don’ts On AWS Cloud

Practice these to back up data in the cloud with ease.

The AWS cloud supports many well-accepted DR structural designs and makes your infrastructure DR-ready by eliminating the Single Points of Failures on AWS Cloud. The best way forward to avoid hitting a disaster: having a Disaster Recovery Plan.

If you’re looking for a partner to help you with the best approaches towards strategizing disaster recovery plans for AWS cloud effectively, Botmetric can help.

Botmetric runs automated Disaster Recovery backup audit checklist to ensure business continuity and data backup during a DR event:

ELB Optimisation

Botmetric provides a list of ELBs that have either one availability zone or the EC2 instances are distributed unevenly among different availability zones. We recommend that you maintain approximately equivalent numbers of instances in each Availability Zone for better fault tolerance.

RDS Multi AZ

Botmetric provides a list of DB instances deployed in a single Availability Zone and recommends you to launch instances in separate Availability Zones with the help of Amazon RDS in order to protect your applications from the failure of a single location. Amazon RDS automatically switches to a standby replica in another Availability Zone (if Multi-AZ feature is enabled) to provide data redundancy, eliminate I/O freezes, and minimize latency spikes during system backups.

ELB Connection Draining

Connection Draining, a feature of ELB, completes the requests (in progress) before deregistration of any back-end instances and makes it easy to manage the capacity behind your ELB. If the back-end instances fail in health check, the load balancer does not send any new requests to the unhealthy EC2 instance. Instead, it allows the existing requests to complete.

Botmetric identifies if the load balancers have connection draining configured or not. It recommends you to enable connection draining to ensure in-progress requests are handled gracefully during auto-scaling termination or unhealthy instance removal events.

ELB Cross Zone

By default, your load balancer evenly distributes incoming requests across its enabled Availability Zones.

But, just to ensure your ELB distributes the incoming requests evenly across all back-end instances (irrespective of the AZs), enable cross-zone load balancing. Botmetric identifies which load balancers should be configured to use cross-zone load balancing option.

However, as a best practice we recommend you to evenly distribute your EC2 instances in each AZ for higher fault tolerance.

EC2 Availability Zone

Amazon EC2 is hosted worldwide in several regions and each of these regions has isolated locations called Availability Zones. It is recommended to host instances in multiple locations rather than one single location. In case of any disaster (though very rare), if you have hosted all your instances in a single location and that particular location is affected by any failure, none of your instances will be available.

Botmetric identifies such regions that have either all the instances in same availability zones, or have instances in multiple zones, but the distribution is uneven. Accordingly, it gives you smart recommendations to fix the uneven distribution in seconds with its ‘Click-To-Fix’ button feature.

Auto Scaling Group

Auto Scaling regularly runs a health check on your instances in the Auto Scaling Group and reports if any instance is unhealthy. Botmetric diagnoses all your EC2 instances and recommends you to have the health check type as ‘ELB’ if you use a load balancer with your Auto Scaling group and if you are not using any load balancers with Auto Scaling Group then you should choose the default health check as ‘EC2’.

Auto Scaling Group resource Audit

An Auto Scaling group resource ensures that your applications have enough capacity to handle the current traffic demands. To make your applications highly available and fault tolerant, you should use Auto Scaling Group. What’s even more important is the fact that implementation of Auto Scaling does not incur any additional cost — you only pay for the Amazon EC2 resources you use.

Botmetric runs an audit and identifies which auto scaling group is associated with a deleted load balancer or which launch configuration is associated with a deleted Amazon Machine Image (AMI).

Route53 High TTL RR Set

This check examines resource record sets that can benefit from having a lower time-to-live (TTL) value. A long TTL can cause unnecessary delays in rerouting traffic.

Botmetric identifies if the resource record set has a TTL greater than 60 seconds and if it is associated with a health check. It also checks if its routing policy is set to ‘Failover’ or not.

Volume Snapshot

Never forget to take incremental backup of the snapshots of your EBS volumes to Amazon S3. Botmetric provides a list of such EBS volumes that either don’t have a snapshot or without the latest snapshot. It recommends you to take regular snapshots of the required volumes for disaster recovery purpose. With Botmeric’s DevOps Cloud Automation, you can schedule a job that automatically takes EBS volume snapshots based on specified instance or volume tags.

RDS Backup

Amazon RDS has an automatic backup feature that enables point-in-time recovery for your DB Instance, and allows you to restore your DB Instance to any second during your retention period, up to the last five minutes. Botmetric provides a list of RDS instances that either don’t have a backup or the backup retention period are not at the recommended level. The range of maximum retention period for the automated backups is from eight days to thirty-five days. Hence, you can store more than a month of backups. Botmeric’s DevOps Cloud Automation feature schedules a job that automatically takes your RDS data backup based on specified instance tags.

S3 Access Configuration

Botmetric identifies S3 buckets that don’t have correct logging configurations enabled. By default, the Amazon S3 buckets and all its objects are private. Only the owner will have access to grant Read/Write permissions to other resources and users. When logging is initially enabled, the configuration is automatically validated. However, future modifications can result in logging failures. To avoid this, write an access policy for each of the permissions that you grant to other resources.

EC2 Instance BackUp

Just like EBS volume snapshots, regularly back up your instance using Amazon EBS snapshots.

EC2 Instance Scheduled Retirement

When an EC2 instance reaches its scheduled retirement date, it is automatically stopped or terminated by AWS and you will have no more access to the data on that retired instance. If your instance’s root device is an EBS volume, you can replace the instance by creating an AMI of your instance, and launching a new instance from the AMI. If you are unaware of the process, we advise you to reach out to us. We would love to guide you through the process.

Apart from these audits, it is highly recommended to periodically copy your data backups across the AWS regions.

With Botmetric, you can do so by scheduling a job for cross-region copy:

Copy EBS Volume snapshot (based on volume tags) across regions
Copy RDS snapshot (based on RDS tags) across regions

These are just a few of the automated audits that Botmetric performs. With timely automation of DR backup tasks and the right strategy in place, you can safely make your AWS Cloud Infrastructure DR ready.

Check your DR Policy Compliance with Botmetric, Sign Up Here: Botmetric 14-day free trial
Hope you’ll make the most of these tips to ‘DR-Proof’ your AWS Cloud Infrastructure. Feel free to share with your peers. You can connect with us on Twitter, LinkedIn, Facebook, to catch up with more updates on AWS and Botmetric!

Cheers.