bangdb.config
Configuration for BangDB
bangdb.config
There are several parameters that BangDB exposes for users to configure the BangDB for optimal and efficient running and performance.
Let's first categorise these config params for better understanding and then go into each of these. We will provide recommendations as well for each of these config params
- Config that affects execution or running of BangDB
- Config specific when BangDB runs as a server
- AI/ML related server and db config
- Advanced config to tune core of BangDB
Before we cover this in details, it will be good to see what all command line arguments that server takes if you run bangdb-server-2.0 directly. Here are these configurations; these are self explanatory
Usage: -i [master | slave] -r [yes | no] -t [yes | no] -d [dbname] -s [IP:PORT] -m [IP:PORT] -p [IP] -b [yes | no] -c [tcp | http | hybrid] -w [HTTP_PORT] -v Options: -i: defines the server's identity [master | slave], default is SERVER_TYPE (master) as defined in bangdb.config -r: defines replication state [yes | no], default is ENABLE_REPLICATION (0) as defined in bangdb.config -t: defines if transaction is enabled(yes) or disabled(no) [yes | no], default is no -d: defines the dbname, default is BANGDB_DATABASE_NAME (mydb) as defined in bangdb.config -s: defines IP:Port of this server, default is SERVER_ID:SERV_PORT as defined in bangdb.config -m: defines IP:Port of the master (required only for slave as it declares master with this option) -p: defines public IP of the server (required for master and slave to expose their own public IP) -b: defines if server to be run in background as daemon, default is foreground -c: defines if server runs as tcp server or http (rest) server or both (hybrid), default is tcp server -w: defines the http port when server runs in http or hybrid mode default is MASTER_SERVER_ID:MASTER_SERV_PORT as defined in the bangdb.config -v: prints the alpha-numeric version of the executable
Hence to run master with other values as defined in the bangdb.config, issue following command
./bangdb-server -s 192.168.1.5:10101
if we add -c hybrid -w 18080, then it will run the server in hybrid (tcp and http(s)) mode with 18080 as http(s) port
To run slave for this master with default other values..
./bangdb-server -i slave -s 192.168.1.6:10102 -m 192.168.1.5:10101
etc…
The command line args can only be provided only when you run the server directly from the executable, bangdb-server-2.0. If you run the BangDB server using the script bangdb-server, then it's not possible to provide these command line args. However, you may set all these in the bangdb.config file and then run, using either method.
Let's see how these params can be set using the bangdb.config:
Set master or slave
SERVER_TYPE is the config param and we can use this to set whether this server is master or slave 0 for master, 1 for slave
Set whether replication is ON or OFF
ENABLE_REPLICATION is the config param to set it.
0 for ON and 1 for OFF
Set db name
BANGDB_DATABASE_NAME is the param. By default it's always mydb
Set the (this) server ip and port
SERVER_ID for IP address, SERV_PORT for port. We can use ip address or name of the server SERVER_ID = 127.0.0.1 SERV_PORT = 10101
Set the master's ip and port
This is mainly for slave as it has to know where is the master MASTER_SERVER_ID for ip address of master, MASTER_SERV_PORT for port of master
Run the server in the background
Need to use -b command line argument, can't set using bangdb.config as of now -b yes
Run the server with transaction
Need to use -t command line argument, can't set using bangdb.config as of now -t yes
A. Config that affects BangDB execution
The following config params are for DB, whether it is run in embedded or server manner the param setting is done without using any quotes, either for numerical or string values
SERVER_DIR
The dir where the db files will be created. Please edit it with suitable dir location default is the local dir, note: this can be provided as input param while creating a database using DBParam
BANGDB_LOG_DIR
Log dir. This is where database write ahead log files will be kept. Default is local dir note: this can be provided as input param while creating a database using DBParam
BUFF_POOL_SIZE_HINT
Memory budget for the DB. This is defined in MB and once set, BangDB will not use memory more than this. If it's handling more data than the size of the buffer pool, then it will do page flush as required for dirty pages etc. BangDB has patent in managing buffer pool in a manner which is very efficient and maintains the level of performance to acceptable range even in worse conditions and tends to degrade gracefully.
We should select this properly as it has direct implication on performance. Max limit for buffer pool size on a machine is ~13TB and min limit is 500MB.
Ideal value is of course dependent on the use case, but if it's a dedicated BangDB server then buffer pool should be RAM Size - 3/4 GB. Therefore, on a 16GB machine, 11/12 GB would be a good number etc.
BangDB buffer pool is very efficient, performant and implements several novel techniques for high performance. BangDB has Patent for Adaptive prefetching in Buffer Pool And also Patent for Buffer Pool and Page Cache
BANGDB_APP_LOG
When set to 1, then BangDB will do logging using syslog (/var/log/syslog file) When inactive(set to 0) then BangDB will flush the logs on to standard output(terminal) and when set to 2 it will flush to the logfile maintain by the BangDB. The preferred value is 2 as BangDB implements high performance logging mechanism
DB_APP_LOG_SIZE_MB
This sets the size of the applog (when BANGDB_APP_LOG = 2). When the file(applog) gets full, it creates another one and keeps rolling.
BANGDB_APP_LOG_LEVEL
This sets the log level, following options:
BANGDB_DATABASE_NAME
You may leave it the default val here. Please note, you can always pass dbname through command line or using the API
CEP_BUFFER_SW_SIZE
BangDB provides complex event processing support, therefore we can look for a given complex pattern on the streaming data. These pattern analysis is state based query which runs in a sliding window. Most of the CEP out there in the market are in-memory based model, this is bit inefficient as if we run few queries over a period of time, and number of event ingestion is bit moderately high then memory is not sufficient and the system starts dropping events. To get rid of this bottleneck, BangDB CEP buffer is backed with a table and this table runs in a sliding window. Therefore, this param is to set the sliding window size for the buffer table for cep related items. Size is defined in time in seconds, default is 86400 (1 day).
BANGDB_PERSIST_TYPE
This is a table config param, it basically tells whether the table should be backed by file on the disk or is it going to be in-memory. This should be set by using TableEnv type
BANGDB_INDEX_TYPE
This is a table config param, it defines the index type (primary key arrangement type) for the table. This should be set by using TableEnv type
BANGDB_LOG
This is to set database log, this is different from app log which is basically db debug, error logging.
BangDB supports write ahead logging (WAL) for every write operations. WAL also ensures atomicity, transaction and durability. It furthers allows BangDB to recover from crash in automated manner.
BangDB has Patent for efficient write ahead log
LOG_BUF_SIZE
If BANGDB_LOG is set to be 1 (ON), then we can set the size of the log file. This is mmap area and WAL keeps rotating as it gets filled.
Default value (128MB) is good in most of the cases, however if buffer pool size is large (for larger servers) , for ex; 64GB or more, then 256MB is better choice for the WAL size.
DAT_SIZE
This denotes maximum size of data in KB. This is only true for normal key value or document data (for NORMAL and WIDE tables, see here for details). If the size is less than that of MAX_RESULTSET_SIZE (see below), then BangDB sets it to MAX_RESULTSET_SIZE . This can't be more than MAX_RESULTSET_SIZE. However, for larger data (LARGE TABLE), we can deal with large data size, for example hundreds or MBs or GBs upto 20GB file/data
KEY_SIZE
This is again a config param for table and not for db. This sets the default value for keysize when not specified using TableEnv. This should be set by using TableEnv type
MAX_RESULTSET_SIZE
BangDB supports scan method for running range query. These scan method returns ResultSet which has list of key/vals/docs as needed by the query. MAX_RESULTSET_SIZE defines the max size of such resultsets.
KEY_COMP_FUNCTION_ID
Since BangDB arranges keys in order, it uses two key comparison methods.
- Lexicographical
- Quasi lexicographical
Default value is 2.
BANGDB_AUTOCOMMIT
When BangDB is run in transaction mode, If auto commit is off(0) then explicit transaction is required (begin, commit/abort), else implicit non-transactional single op can be run in usual manner later this can be set/unset whenever required.
BANGDB_TRANSACTION_CACHE_SIZE
BangDB supports transaction using Optimistic Concurrency Control (OCC). OCC demands size of memory kept aside for transaction related operations. BANGDB_TRANSACTION_CACHE_SIZE defines that size in the memory.
Most of the time, default size is good enough, but if you are going to club too many operations in a single transaction then size should be increased. Note that BangDB supports many concurrent transactions and that has little implications on this size. This is mainly for large number of operations in a single transaction
TEXT_WORD_SIZE
BangDG supports reverse indexing, hence we need maximum size of a token/word. TEXT_WORD_SIZE defines the same. Default is good from logical perspective
MAXTABLE
BangDB supports several thousands of tables, infact it is only limited by the number of open file fds on the system which is only 1M. But to optimise the running of BangDB, it is good to define this reasonably. Default value 16384 is good, however you may increase it as needed
PAGE_SIZE_BANGDB
BangDB page size can be configured. Default is 16KB which is a good fit for most of the scenario, however you may increase or decrease as needed.
MASTER_LOG_BUF_SIZE
To maintain WAL, DB needs a masterlog for various housekeeping. MASTER_LOG_BUF_SIZE is the size of the master log. Default 4MB is good for many cases, however if you intend to have DB which will have large size (few TBs) then increase the size. Typically for few TB of DB size 4 - 16MB is good enough
B. Config specific when BangDB runs as a server
Following are the configurations when BangDB runs as server hence, these are server specific config params
SERVER_TYPE
BangDB when runs a server, then it may run as master or slave. SERVER_TYPE defines whether it's master (0) or slave (1). we can pass this as command line arg as well when we run server directly.
./bangdb-server-2.0 -i master
./bangdb-server-2.0 -i slave
Or we can set the SERVER_TYPE param in the config file for the db, this is needed when we run bangdb using the script (bangdb-server)
ENABLE_REPLICATION
We can run BangDB Server with replication ON (1) or OFF (0). If OFF then slaves can't be attached. We can do this with command line arg as well.
./bangdb-server-2.0 -r yes
./bangdb-server-2.0 -r no
SERVER_ID
This sets the ip address or name of the server. We can do this with command line arg as well.
./bangdb-server-2.0 -s 127.0.0.1:10101
SERV_PORT
This sets the port of the server. We can do this with command line arg as well.
./bangdb-server-2.0 -s 127.0.0.1:10101
MASTER_SERVER_ID
When a server is slave of another server, then we need to tell this server about the master. This tells the server about the ip address of the master. We can do this using command line arg as well.
./bangdb-server -m 127.0.0.1:10101
MASTER_SERV_PORT
When a server is slave of another server, then we need to tell this server about the master. This tells the server about the port of the master. We can do this using command line arg as well.
./bangdb-server -m 127.0.0.1:10101
MAX_SLAVES
This is for master, to set the limit for number of slaves
OPS_REC_BUF_SIZE
BangDB allows read/write operations to continue even when slave is syncing with the server. This happens using the Ops record buffer when syncing is in progress with a slave. OPS_REC_BUF_SIZE sets the size in MB for the ops record. Default is good for most of the cases
PING_FREQ
Master and slaves checks each other liveliness using UDP based ping pong. PING_FREQ sets the frequency for the ping pong. Default value 10 sec is good enough, however you may increase or decrease the frequency as needed
PING_THRESHOLD
How many pings or pongs to fail before one can conclude that the other server is unreachable or down? PING_THRESHOLD defines that. Default 5 times in a row is good enough
CLIENT_TIME_OUT
All clients connect to the server using tcp. BangDB server handles tens of thousands of concurrent number of such connections. However, user may define if server can time out some of the connections if no requests have been received for some period of time.
CLIENT_TIME_OUT defines the same in number of seconds. Default is 720 seconds
NUM_CONNECTIONS_IN_POOL
This is for clients only. It sets the number of connections with the server to be in the pool for performance and efficiency purposes. Default is 48, however you may increase as you need, no performance impact* due to this
SLAB_ALLOC_MEM_SIZE
BangDB Server uses pre allocated slabs for run time memory requirements. SLAB_ALLOC_MEM_SIZE defines the same in MB. default value of 256MB is good enough
TLS_IDENTITY
BangDB can run in secure mode as well and clients have to connect using the secure channel. TLS_IDENTITY can be set (reset) by the user for security purpose
TLS_PSK_KEY
BangDB can run in secure mode as well and clients have to connect using the secure channel. TLS_PSK_KEY can be set (reset) by the user for security purpose
BANGDB_SYNC_TRAN
If set then BangDB will sync forcefully with the filesystem after flush. Ideally it should be OFF (0), but in case of hard need, you may set it ON (1)
BANGDB_SIGNAL_HANDLER_STATE
There are various signal handlers set already, but for few extra ones, user may add the handlers. Ideally not required, but still user may switch them ON
LISTENQ
Queue size for the listen() call, default 10000 is quite a good number
MAX_CLIENT_EVENTS
Maximum number of concurrent connections to the server or num of concurrent connections. Server can handle default 10000, but change it to less number as suitable.
Stage options, basically it tells server to create the number of stages to handle the clients and their requests there are two types of stages supported as of now
- two stages, one for handling clients and other for handling the requests
- four stages, one for handling clients, one for read, one for ops and finally one for write
SERVER_OPS_WORKERS
If SERVER_STAGE_OPTION = 2, then this can define how many workers to allocate for operations for db. . Default 0 is fine. Default 0 allows db to select the number of workers best suited for the given server configuration.
SERVER_READ_WORKERS
If SERVER_STAGE_OPTION = 2, then this can define how many workers to allocate for read (network). Default 0 is fine Default 0 allows db to select the number of workers best suited for the given server configuration.
SERVER_WRITE_WORKERS
If SERVER_STAGE_OPTION = 2, then this can define how many workers to allocate for write (network). Default 0 is fine Default 0 allows db to select the number of workers best suited for the given server configuration.
EXT_PROG_RUN_CHLD_PROCESS
For IE (information extraction), or ML/DL related activities, BangDB may run external code such as python or c. In that case this flag tells whether the external libs or code can be run in the same process or in separate process for safety purpose.
Default is to run in separate process. 0 for run in separate process, 1 will allow db to run in same process in case running in separate process fails
C. AI/ML related server and db config
Checkout this discussion on ML to know more on this
BRS_ACCESS_KEY
BangDB supports large data as well. These large data could be binary object data or could be file. While large object data is written into LARGE_TABLE, the files are stored in BRS.
BRS stands for BangDB Resource Server. BRS is line S3 and supports similar concept and API. BangDB can run as BRS or as DB + BRS, depending on configuration (as described below).
User may create buckets and store files in these buckets. To access these buckets user may define the access key using this param. The access key could be defined using the request json for creating such buckets as well
BRS_SECRET_KEY
BangDB supports large data as well. These large data could be binary object data or could be file. While large object data is written into LARGE_TABLE, the files are stored in BRS. BRS stands for BangDB Resource Server. BRS is line S3 and supports similar concept and API. BangDB can run as BRS or as DB + BRS, depending on configuration (as described below).
User may create buckets and store files in these buckets. To access these buckets user may define the secret key using this param. The secret key could be defined using the request json for creating such buckets as well
BRS_DATABASE_NAME
When BangDB runs as separate instance as BRS then it can have different DB name, whereas if it runs as part of the DB then it shares the same name as DB's database
BRS_SERVER_ID
When BangDB runs as separate instance as BRS then it has different IP, whereas if it runs as part of the DB then it shares the IP as DB Using this param you may set the server IP address accordingly
BRS_SERVER_PORT
When BangDB runs as separate instance as BRS then it has different Port, whereas if it runs as part of the DB then it shares the Port as DB. Using this param you may set the server Port accordingly
BRS_ML_BUCKET_NAME
This sets the default bucket that's created by the DB at the start, you may use this (along with the default access key and secret key) to store files in this bucket
ML_TRAINING_SERVER_IP
BangDB can run as separate ML training server or as part of the DB as well. When it runs as part of the DB then it shares the IP else it has it's own IP Using this param, you may set the IP of the training server accordingly.
ML_TRAINING_SERVER_PORT
BangDB can run as separate ML training server or as part of the DB as well. When it runs as part of the DB then it shares the Port else it has it's own Port. Using this param, you may set the Port of the training server accordingly.
ML_PRED_SERVER_IP
BangDB can run as separate ML prediction server or as part of the DB as well. When it runs as part of the DB then it shares the IP else it has it's own IP Using this param, you may set the IP of the prediction server accordingly
ML_PRED_SERVER_PORT
BangDB can run as separate ML prediction server or as part of the DB as well. When it runs as part of the DB then it shares the Port else it has it's own Port. Using this param, you may set the Port of the prediction server accordingly.
BANGDB_ML_SERVER_TYPE
This is to set up the ML cluster including the BRS. For any server, this param defines what type of this server is as far as ML is concerned
TRAINING_PREDICT_FILES_LOC
Since BangDB trains, predicts in concurrent manner, therefore it could hog the memory as we do more of these operations, esp training. Also, for performance reasons it keeps the models in the memory in loaded condition. Therefore it is important that we put a limit to the memory that it could use.
TRAIN_PRED_MEM_BUDGET sets the amount of memory ML can use. The loaded models are in LRU list and DB auto loads or unloads depending upon the usage pattern
MAX_CONCURRENT_PRED_MODEL
How many models could be trained or kept in the LRU list, this param sets that number. Default 32 is good for most of the scenario, however edit it as required
D. Advanced config to tune core of BangDB
The following are config params to tune the internal working of core BangDB. Therefore we need to be really sure before editing these. Let's go and understand these params as well
PAGE_SPILT_FACTOR
Since BangDB uses B+Tree* which keeps keys in sorted manner. When page splits then we need to transfer keys from one page to other. This variable decides the split factor.
Simple rule is, if the ingestion of data is going to be mostly sequential (and not random) or semi sequential, then higher value is better. Else keep the default.
As of now this is applicable to the entire db, however it should be for table. Will make it table specific in upcoming release
LOG_FLUSH_FREQ
This is frequency of log flush initiation. It's tuned for higher performance for general cases, however, you may play with the number and set what works best for you
CHKPNT_ENABLED
This is set to checkpointing of WAL. 0 means not checkpointing else yes. It's recommended to keep it ON, but for higher performance in certain cases you may turn it off as well
CHKPNT_FREQ
If checkpointing is ON then what's the frequency? Again this is set for better performance in general, however you may chose to edit it for experimentation and select the right value
LOG_SPLIT_CHECK_FREQ
WAL maintains append only rolling log file. It is recommended to keep checking if log file needs split at certain frequency. The value is selected for higher performance for general use cases, however you may experiment and pick the right value.
LOG_RECLAIM_FREQ
BangDB generates WAL log files for durability and crash recovery along with atomicity and transaction. However it writes close 2.2X+ 4X more data in the WAL log than the ingested data. Which may result in large amount of logs generated on filesystem, which may cause disk full scenarios and db could go down.
To avoid this, BangDB keeps checking and reclaiming the log files not needed by the db, even in the case of DB crash and recovery. It's a very complex process but very important. Therefore, we should have this value properly set to ensure DB runs properly without filling the disk with log files.
LOG_RECLAIM_ACTION
This will tell DB to steps it can take when it finds out that WAL logs could be reclaimed.
LOG_RECLAIM_DIR
If LOG_RECLAIM_ACTION = 1, then it tells which directory logs should be reclaimed. Ideally when we wish to keep the log files and not delete then reclaim folder should be on network or other disk where capacity is large
BUF_FLUSH_RECLAIM_FREQ
This is for buffer pool and defines buffer cache dirty page flusher and the buffer cache memory reclaimer frequency in micro sec.
SCATTER_GATHER_MAX
Maximum number of pages to look for scatter gather, put 0 to select the system supported number (suggested), else put whatever num, but if it's more than system supported then it will be corrected to the system supported one. Ideally no need to change this
MAX_NUM_TABLE_HEADER_SLOT
This has implications on the length of the chain of pages for a slot. If there are too many tables ( more than 10,000 tables, then reduce this number a bit else leave it as default). Higher number with large number of tables would increase the memory overhead for the DB
MIN_DIRTY_SCAN
How many pages to scan to find out dirty pages. This is tuned for higher performance however change it as per your need after experiment. Be sure before changing
MIN_UPDATED_SCAN
How may pages to scan to find updated page? Be sure before changing
IDX_FLUSH_CONSTRAINT
This impacts the flushing of pages, it affects the core and hence only change when you are confident after experimentation
DAT_FLUSH_CONSTRAINT
This impacts the flushing of pages, it affects the core and hence only change when you are confident after experimentation
IDX_RECLAIM_CONSTRAINT
This impacts the flushing of pages, it affects the core and hence only change when you are confident after experimentation
DAT_RECLAIM_CONSTRAINT
This impacts the flushing of pages, it affects the core and hence only change when you are confident after experimentation
PAGE_WRITE_FACTOR
This in a way denotes how fast data is written, but we should not change it unless confident after experiment
PAGE_READ_FACTOR
This in a way denotes how fast data is read, but we should not change it unless confident after experiment
IDX_DAT_NORMALIZE
This normalises the idx vs dat pages, helpful when we favour one over other
PREFETCH_BUF_SIZE
The pre-fetch buffer max size defined in MB. DB treats this as the max limit for pre-fetching of pages in the pool
PREFETCH_SCAN_WINDOW_NUM
Size of window for prefetch scan
PREFETCH_EXTENT_NUM
To what extent pages would be pre fetched