BangDB Disaster Recovery

Introduction

The Disaster Recovery plan is composed of a number of sections that document resources and procedures to be used in the event that a disaster occurs. Each supported computing platform has a section containing specific recovery procedures. There are also sections that document the personnel that will be needed to perform the recovery tasks and an organizational structure for the recovery process.

Goal of the plan

To minimize interruptions to the normal operations.
To limit the extent of disruption and damage.
To minimize the economic impact of the interruption.
To establish alternative means of operation in advance.
To train personnel with emergency procedures.
To provide for smooth and rapid restoration of service.

Overview of the BangDB Platform and System

BangDB is a converged database to align with the emerging data trends and meeting challenges due to this to help users build modern application in time accelerated manner. BangDB combines following from very high level.

Multi-model database
Stream and Time-series data processing
Graph Processing
AI

Most of the use cases today are demanding some or all of the above features. We have different tools and systems in different area and it's up to the developer or users or organizations to take those different pieces and stitch them together before building the business application on top of it. It takes lots of resources, time and money to build such platform. Further, application building and then maintaining such system is hard. Finally having different silos for different areas makes it hard to scale and perform at the same time.

Therefore, there is a need for a converged platform, which natively supports these different high-level functionalities and allows users to focus on building the application. The platform acts like off-the-shelf no-code platform for users to build applications. BangDB is currently being offered in following different manner:

Community Edition
SaaS
Enterprise

Here is the list of features of BangDB

BangDB Stack

Intended Audience

This document is intended for Developer, SysAdmin, DevOps, CIOs and other group of people who are managing the BangDB deployment. The document provides a blueprint for managing BangDB deployment and ensuring that the Business operation and data is resilient to any unforeseen and untoward issues/incidents. Data, application is secure from any disruptions and data is never lost in any case.

Purpose of Disaster Recovery plan

Data is at the core of any application and business. It is imperative and critical that data is never lost in any case or scenario. Data loss may take place in real world for many different reasons, and we can put them in following categories.

Catastrophic failure of system, infrastructure
Calamities and natural disaster
Human error

All the above are disaster can happen and our plan should be ready to tackle that. The document has details and steps for concerned people to ensure that data is always secure, persisted and recovered in any scenario.

A backup and disaster recovery plan and strategy is necessary to protect the mission critical data against these types of attacks and incidents. With the strategy in place, business and concerned people will be at peace as far as data protection is concerned as they will have a plan in case of such disasters.

Database systems are the most important and critical piece in the applications. It's of paramount importance for organizations to ensure the safeguard of the data and plan in place to recover the data in any condition. However, it's also not straight forward to create the plan as it can get from simple to complex depending upon the criteria, context and requirement. Therefore, a lot of inherent support is sought from the database itself.

BangDB has worked hard and continuously working to simplify the process yet provide absolute guarantee for data recovery and persistence.

Backup considerations

Backup systems takes the snapshot of data at any point in time and secures it for future recovery. When we need to recover data, we simply use the snapshot to take us to the point at which the backup was taken. However, there are many factors which may be important to consider defining the best strategy for the organization. It is also important to take the various configurations and db strategy also into account. Most of the time the database configuration may have influence on the backup strategy and at the same time the backup considerations may require database configurations to be set in a particular manner.

There are two important point for organization to consider for setting up the disaster recovery plan in place.

Recovery point objective
Recovery time objective
Cost & Budget

BangDB had both above as its design goals. BangDB has several elements in place to configure things in a way to be able to align with the needs of the organization using simple configuration. But the setup of these configurations is important hence defining the above two goals are important before deciding the backup strategy.

There are tradeoff and the application owner, or the organization has to work with. BangDB allows the flexibility and means to deal with the requirements as required.

The other important aspect is the maintenance, resource cost for managing the backups. These need to be evaluated in the entire application or organization context.

Point of consideration	Description
Recovery point objective	How much data loss is the organization is willing to tolerate in the event of an incident
Recovery time objective	How quickly the data should be recovered in case of an incident. This includes the time to copy the data back to the servers, recover the data by replaying logs if set up or designed the configuration that way
Isolation	Backups are to be separated from the physical production system, ideally in multiple places
Backup types	Backup entire db files Backup only WAL files Backup both db and WAL file
Restore process	This is very important part of the strategy and BangDB offers several choices and options here. Backup can be done using the inbuilt auto-recovery process or by restoring the entire db files
System state during backup	In case of distributed deployment and shards, it is important to remain consistent from the data and application point of view
Backup mechanism	Copy of files, db and/or WAL files Block/disk sync, snapshots, datadump
Replication	Replication of data also provides natural backup

BangDB backup strategies

Data dump strategy

BangDB has datadump functionality which dumps the entire database (or selected tables) on the disk. This dump file can then be copied to secure location for future need.

Database doesn't need to be stopped, it is done like any other operation.

Steps to take backup

Call datadump() from client or CLI
Copy the dump file to network storage or cloud or elsewhere

Steps to recover

Copy the dump file to the data folder
Start the database

Pros:

Simple to take the dump
It takes the database files only, log (WAL) files dump is not required
Recovery isn't required as db files are in proper state when dumped, hence BangDB starts without getting into recovery mode
Good for small size database, with less load and operations

Cons:

Takes the full db dump all the time, hence heavy operation, not suitable for frequent dumps
Entire db files need to be copied all the time to network location for securing
Not suitable for busy and mid to large size database for frequent backups

Copy the DB and log files

BangDB will keep copying the underlying db and log files to some secure location. Typically, a daemon or some other service is run by the users to keep taking backups of the db and log files. This doesn't disturb database at all. This is totally dependent on network bandwidth. Ideally there should be a dedicated network line for these copying so that normal connections are not used for this copy as the size of the files could be large. We can also set up block level copy or syncing, further many other efficient tools and techniques can be used to keep the mounted disk in sync with other network disk. Filesystems snapshots, filesystem copy tools, or other enterprise tools can be used to just keep these two folders synced with the backup folders.

Database doesn't need to be stopped, and it can be done anytime as required or configured.

Steps to take backup

Copy the db files (all the files under /data forlder) and the WAL files (all the files under /logdir folder) to secure location

Steps to recover

Copy the backup files into the /data and /logdir folder
Start the server, Server does auto recovery

Pros:

Very simple operation
Recovery is simple and automated, nothing extra is required from users' side
Crash, machine breakdown or any other case, DB will always auto recover
It is based on Write Ahead Log (WAL) mechanism coupled with checkpointed db data, hence it recovers very fast (average 50MB/sec). If checkpointing is frequent, then few seconds are good to recover the data (as it will only recover from last checkpoint mark)

Cons:

Need to copy both /data and /logdir folder. Better sync the mounted FS

The recovery time depends on database checkpointing configuration. Default checkpointing frequency is very aggressive for BangDB hence this works for most of the heavy load scenarios as well. However, we can tweak this as per the requirement to balance the RPO and RTO tradeoffs.

To set this param correctly, use following:

// by default, it's 1
CHKPNT_ENABLED = 1

// default value is 933700 micro sec [ almost 1 sec]
CHKPNT_FREQ = 933700

Notes on checkpointing

If checkpoint is ON, then BangDB keeps syncing the data from the buffer pool (page cache) to this disk. This means that it checks its internal stats and figures out if data needs to be written to the disk or not. Since WAL is in place for BangDB, therefore, checkpointing is not mandatory. WAL itself guarantees the data durability and when required, DB can recover the data from WAL files using auto recovery mode. The WAL recovers with around 50MB /sec speed. Therefore, the time taken for recovery depends on how much data is to be recovered.

When checkpointing is ON, then db recovers the data from the last checkpointed mark. Which means all the data that has not been written to the db files. Hence the amount of data to be recovered can be reduced to great extent. Size of db also doesn't matter. Depending upon the checkpointing setting, and the number of operations happening on database, if on an average 1 GB of data is not checkpointed, then recovery would take ~1000/50 = ~20 seconds. At this point the data is recovered and db is in consistent state. Post this, DB may do more operations as well to get few stats ready.

Therefore, it's important to set the checkpoint appropriately. More frequent checkpointing will cause db to do more of data sync work and less would cause higher recovery time. Please note that, checkpoint frequency doesn't force db to checkpoint every time, it's up to the db to take the call and it uses several parameters to decide whether to write data or not.

BangDB backup strategies comparison

	File System Snapshots		Datadump
Recovery Point Objective	Default snapshot frequency along with checkpoint frequency. Data and logdir. Works good for mid to large database with above average number of all sorts of operations. Around 1 sec for checkpoint frequency and 1 to 6 hours of backup frequency (It could be continuous as well using other tools)	Customized configuration setting based on requirements. For extreme high performance and less frequent snapshots. Checkpoint setting could be in few seconds and every 6 hours to 24 hours of backups frequency. Only data file could also be copied (without log files), however it will be then ask for RPO tolerance	Every day once is good but suitable for small database with less operations
Recovery Time Objective	It depends on many things outside database as well, for example, where is data backup on network, latency, size etc. But from DB perspective, this works fast. More frequent snapshots, faster recovery time	If customized, then it depends on frequency of checkpoint and backup. If only DB files are copied then recovery time is fastest, however RPO needs to be adjusted accordingly	No loss of time from DB perspective for recovery, only network latency and size of dump file is important
Isolation	It depends on how far the snapshots are kept and the latency to get the data copied from remote server to the db server	It depends on how far the snapshots are kept and the latency to get the data copied from remote server to the db server	It depends on how far the snapshots are kept and the latency to get the data copied from remote server to the db server
Performance Impact	Default setting is optimal for high performance of DB, RPO and RTO. Therefore, most of the time settings should be good for mid to large size database. But the copying of data back and forth would take time, syncing at disk level could be efficient but cost need to be considered	We can customize the parameters and set them according to our need. High performance at run time would mean longer recover time due to less checkpoint hence longer recovery time	Recovery time is highest in this case but the run time performance is lowest. It depends on how far the snapshots are kept and the latency to get the data copied from remote server to the db server
Restore process	BangDB auto recovers the data, no other steps are needed	BangDB auto recovers the data, no other steps are needed	Recovery is not needed
Deployment complexity	Low	Low	Low
Flexibility	High	High	Low

Details of BangDB backup mechanism

High level of BangDB checkpoint and backup operations

This describes how BangDB checkpoints the data and several steps for the files and data backup.

Disaster Recovery with different availability zones

BangDB can be deployed in multiple availability zones and the files and data backed up accordingly. This also ensures higher availability.

Critical Timeframe, System Impacts and Statements

Findings, what happened, when did it occur
Stake holders reported about the incidence
Team and clients reported that they are somewhat vulnerable to an outage (in case the availability is not set up or the entire availability is at risk)
Time assessment for data recovery
Start the Recovery process and update stake holders
Collect the evidence of the incidents. Core dump if any, dblogs, syslogs, binaris. Forward it for investigations to concerned team. Take the backup of the db and log files if possible
Escalate as per plan and if necessary for possible outages and if outages have occurred