BangDB High Availability

Introduction

Business continuity requires services to remain available 99.99 to 99.999% of time. This requires resilience for disruptions and any unknown incidents which may affect the business continuity. Businesses need to have plan and capabilities to withstand outages due to any reasons (application, machines, natural disaster, network issues etc.). Therefore, in order achieve 99.999% of uptime, we need to have mechanism for backup and recovery, failovers, load balancing for zero or near zero disruptions from the user point of view.

Benefits of High Availability (HA)

High availability not only protects business from losing data but also provides mechanism for protecting the data and other critical aspect of the business.

Availability Problems

The availability problems could be categorized in following different buckets:

Planned outages

These are planned maintenance work where the application, servers, machines could be updated for various reasons such as upgrade, security patches, dependent libraries update etc. Therefore, to do this without affecting end users, we can switch the production environment to another standby environment, do the upgrade/update on the primary environment and then switch back. The same could be then repeated for other environments as well. Data backup should be done before starting the process.

Unplanned outages

This may happen due to many different reasons, known or unknown. However, the plan should be in place to recover from the outage and go back to the consistent point for business operation.

Human error, Software issues, Hardware failures, Environment problems, natural disasters etc. are various possible reasons for such outages.

A replicated environment for the application along with data backup and recovery strategy are important to address this.

Disaster Recovery

Disaster recovery strategy defines the plan to copy the data and other important data and recovers the application state using the same remotely. It is important to set up the procedure and leverage the BangDB inbuilt features to achieve this smoothly. The important points to consider here, which has direct implications on the plan and strategy definitions, are following.

Recovery point objective (RPO) - affects the tolerance of data loss
Recovery time objective (RTO) - affects the time to recover
Frequency of backup - affects the bandwidth
Analysis of application, load, size of data etc.

Efficient backups

Backups are at the center of disaster recovery. Therefore, a clear path should be defined based on the needs. This has cost, performance, recovery time and other implications, hence should be defined clearly in consultations with the business owner for their requirements.

The important question here is to find out the RPO and RTO, the associated cost and budget, performance of application, data size etc. We must analyze these to come up with the best possible strategy for the backups.

Load balancing

Load balancing not only provides an abstraction for the entire application but also allows us to manage the workload and balance that accordingly. We can use one or more of following techniques to achieve this.

Front End load balance for routing the requests
Request distribution using servers
Distributed application and database

High availability criteria

Up time requirement

Typically, we must have 99.999% uptime requirement, but it depends on the application and business operations. Here is general views for different uptime and their downtime.

99% uptime means 87 hours of downtime per year (7 hours per month)
99.9% uptime means around 9 hours of downtime per year (42 minutes per month)
99.99% uptime means around 1 hour of downtime per year (4 minutes per month)
99.999% uptime means around 5 min of downtime per year

Based on the requirement for uptime, HA plan may be created. It also has direct impact on the associated cost and budget.

Recovery Point Objective

How much data loss is the organization is willing to tolerate in the event of an incident. Therefore, it defines the frequency with which data backup should be taken. Data files and checkpointed data replication along with write ahead log. Disk level or block level syncing and replication.

Recovery Time Objective

How quickly the data should be recovered in case of an incident. This includes the time to copy the data back to the servers, recover the data by replaying logs if set up or designed the configuration that way.

Resilience Requirement

What's the level of resilience that application requires.

Persist everything, application state, data, log, communications
Persist application state, data, log
Persist application state, data
Application state
Nothing

Automated failover and switchover

BangDB has ability to automatically switch to standby. This needs to setup and once deployed, the cluster can switch to the standby in autonomous manner or manually as well if needed.

System performance

High availability has performance implications. This is due to replication mechanism, number of slaves, write and read distribution, backup data size, frequency of backups, size of data etc. Therefore, it is imperative to incorporate the performance need of the application in the entire plan to define the strategy points.

System overview for various HA plan

BangDB can be deployed with replica in place with the entire system behind the load balancer. The replica could be Active-Active or Active-Passive with replication in continuous sync manner or in batch mode. The replication of operations can happen in sync manner or with the file-based sync up or batch mode. Below diagram describe the deployment from high level.

We can also set up read and write for different servers, for example write for master and read for slave.

For higher availability, we may set up multiple masters or we can shard the data for scale and then define the replicas for these masters as shown in the image below.

Auto failover can be achieved using the built-in mechanism to find the dead master and then converting self to master and remaining to its slave.

The entire system can then be deployed in multiple availability zones for higher availability.