AI in BangDB
Overview
The document is about the AI for BangDB. There are two aspects to it, one is the Machine Learning part which deals with the training of models within BangDB and other one is for the vector indexing and RAG workflows and auto bots’ creation. Let’s investigate the ML part first
Training Infrastructure
There are three components of the BangDB ML infra, namely
- Training
- Prediction
- BRS (Resource Server)
The entire infrastructure could be deployed in following 3 different combinations
- Training, Prediction and BRS all in 3 different servers
- Training and Prediction in 1 server, BRS in another one
- All 3 in single server
While option (c) is good for test/dev purpose, scenario (a) should be the mostly used one in production environment
We should be able to define the structure of the infra using following.
- bangdb.config - BANGDB_ML_SERVER_TYPE
- train_pred_brs_info, structure which basically takes IP:port for each 3 servers
Interfaces & APIs
Following basic interface is available from user's perspective
client_ml_helper
//creates bucket
i/p - {"bucket_name":"myname", "access_key":"akey", "secret_key":"skey"
int create_bucket(char *bucket_info);**- //sets bucket - which bucket to use for any operation within this class
//i/p - {"bucket_name":"myname", "access_key":"akey", "secret_key":"skey"}
void set_bucket(char *bucket_info); - //key is the key of the file using which the brs will store it
//fpath is full path of the file on local fs, iop is flag for operation put
int upload_file(char *key, char *fpath, insert_options iop); - //train req : It depends on what kind of algo is requested. Format changes for different types
int train_model(char *req); - //req : {"account_id":"AACCEEGGIILLNN", "model_name":"my_model1"}
char *get_model_status(char *req); - //{"account_id":, "model_name":}
int del_model(char *req); - //{"account_id":, "model_name":}
int del_train_request(char *req); - //req : depends on algo etc, user ought to provide the right one
char *predict(char *req); - //get training requests for a given account - all training requests
//req : {account_id:"aacid"}
resultset *get_training_requests(char *req); - //count models for a given account, all the models
//req : {account_id:"aacid"}
long count_models(char *req); - //This is to re-init the model data manager in case we would like to change the
//IP:PORT info for BRS, useful because BRS mostly will be separate and mostly static
//but may change due to load etc, as BRS can scale linearly
//req : {"bucket_info", "brs_ip", "brs_port"}
int reinit_mdm(char *req); - //how many objects are using this reference
int get_ref_count(); - //get the handle of BRS - useful only for embd as client should never bother about this
bangdb_resource_manager *get_brs(); - //this is to test if brs is local to the BE server DB bool is_brs_local();
- void clean_ml_helper();
For developer, we have following interfaces
Development Interfaces, classes
- iq_train_predict
- model_data_manager
- pred_housekeep
- iqconvert
- ml_bangdb
Out of these, iq_train_predict is the interface which we need to implement for every new algo we add. For ex, we have svm_train_predict for svm, similarly ie_train_predict for IE etc.
iq_convert is for converting the format of a file from f1 to f2.
pred_housekeep keeps the state of any request, training info etc. It also provides locking apis for safely handling of parallel trainings or predictions
model_data_manager manages the models. It interfaces with BRS to get or put data (any data)
Finlly ml_bangdb or ie_bangdb are collections of helper functions
Details of these would be defined below.
iq_train_predicit
void set_housekeep(void *hkeep); char *train_model(char *param_list); char *predict(char *str, void *arg = NULL); char *get_status(char *model_detail); void close_trainer();
We just need to implement above five APIs to add any new algo
Python Support
Given that python is leading language for ML and many new interesting supports is coming from the language, therefore it was needed that we have built in support for such python code execution. However, following are the conditions that we wanted to apply.
- Python runs in single threaded manner, but we like parallel execution
- Run python code in separate process. If process creation fails then run in thread
- Read return data from python process
- Keep status of the process for reporting
- Python 2.7 and 3 support
- BangDB should compile and run with or without python - provide a switch
Currently the SVM doesn't require python, but IE may need it. Therefore, need to compile BangDB accordingly
Input file format
For training and even for prediction, user may like to send data in different formats. Therefore, we needed simple mechanism to handle this. We have separate interface "iqconvert" defined, which has following APIs
int convert(char *infile, char *outfile);
int convert(FILE *finfile, FILE *foutfile);
Currently it's implemented for csv and json to libsvm format converters. Developer should implement this for new conversion logic as appropriate
Algo
- Classification - Single, multi-label - class & value - Done
- Regression - Linear & Logistic - Done
- Knowledge base training - any given domain - Done
- NER model - name entity model - Done
- Relation models - set of models for all relations - Done
- Ontology - Done
- Kmeans - Done, portal integration pending
- Random Forest
- Kernel ridge regression - also find the right function etc
- N Bayes, GM, Dynamic Bayesian Network - PA
- Gradient Boosting Algorithm
- DNN, Convolutional Neural Network
- ResNet, loss and metric
- Semantic segmentation
- FHOG, BR
- Image search, face search
Param Tuning
One of the limitations with previous implementation was not able to tune the params. We know that for same data and algo, if params are tuned properly then efficacy could be increased from 20% to 95%+. Therefore, it's important that we must allow user to tune params. However, this is optional, and user may switch on / off the tuning part as needed. Default is on. It's recommended that it must be on
Data Normalization
For the same data file, different fields could have different dimensions. Therefore, applying singular dimension would be defeating and yield poor results. We can now switch on the normalization option when we send request. It's highly recommended to do that
Memory Budget
We know that training esp is memory intensive and server may crash if not handled properly. Similarly, when we run prediction server, to be able to handle prediction faster, we must have models loaded in the memory. But again, how many models? what if we need more than we can handle given amount of memory? etc. To address all these, we have memory budget concept implemented in the ML infra.
1. Training happens in given memory budget. It never exceeds the given amount. In future we may add speed as param for training and accordingly can adjust the memory. As of now user needs to explicitly define the memory budget
2. The model manager also works within memory budget. It manages all required models with the given amount
LRU for ensuring memory budget constraints
When we have more objects to deal with than the given memory, we must have ways to de-allocate few objects in favor of newer ones. Therefore, BangDB ML Infra implements its own LRU which keeps objects in-memory as much as possible, however when it requires more, it will de-allocate or invalidate the ones which were accessed earliest. Which means it tries to keep the recently accessed objects in-memory as much as possible. This is very light weight implementation, but performance is high and its thread safe
BRS Integration
Now we don't have dependency on S3 and it's removed completely. We have BRS for the same purpose. The ML Infra is seamlessly integrated with BRS.
BSM Integration
For BSM, not much has changed from outside and we still use svm_predictor and most of the APIs are same. However, there are subtle changes internally, which basically aligns with the new implementation. Therefore, in BSM we don't have any impact from caller or usage perspective. Existing code for Get, Put, scan etc. should just work as they used to
Conventions
A. Any file name must have model_name and account_id associated with it. For ex; model key would be following.
- model_name__account_id [ joined by two underscores "__" ]
- Training file name would be;
model_name__account_id__training_file_name
B. Training or prediction uses local dir /tmp/BRS_DATA for dealing with temporary files. We must create this folder on the server else training or prediction will fail
C. Training request returns immediately. This allows training to be non-blocking for client server case. However, training is blocking for embedded case since it should force user to wait.
D. Since the training is async and returns immediately therefore we must need mechanism to get the state of the training any given time. State of the training requests are maintained and could be queried by calling get_status api (as described above). Following are the states that ML infra would maintain within BangDB, these are self explanatory
enum ML_BANGDB_TRAINING_STATE { ML_BANGDB_TRAINING_STATE_INVALID_INPUT = 10, ML_BANGDB_TRAINING_STATE_NOT_PRSENT, ML_BANGDB_TRAINING_STATE_ERROR_PARSE, ML_BANGDB_TRAINING_STATE_ERROR_FORMAT, ML_BANGDB_TRAINING_STATE_ERROR_BRS, ML_BANGDB_TRAINING_STATE_ERROR_TUNE, ML_BANGDB_TRAINING_STATE_ERROR_TRAIN, ML_BANGDB_TRAINING_STATE_LIMBO, ML_BANGDB_TRAINING_STATE_BRS_GET_PENDING, ML_BANGDB_TRAINING_STATE_BRS_GET_DONE, ML_BANGDB_TRAINING_STATE_REFORMAT_DONE, ML_BANGDB_TRAINING_STATE_SCALE_TUNING_DONE, ML_BANGDB_TRAINING_STATE_BRS_MODEL_UPLOAD_PENDING, ML_BANGDB_TRAINING_STATE_TRAINING_DONE, ML_BANGDB_TRAINING_STATE_DEPRICATED, };
E. User should use the helper class (bangdb_ml_helper) to deal with all sorts of things including uploading, downloading files etc. User should not use BRS for uploading or downloading files, BangDB model manager is aware of this and it takes care of all BRS interactions. Therefore it's very simple to just use 7-9 APIs given in the bangdb_ml_helper class and not worry about what to use etc.
F. Training request format for SVM
API - char *train_model(char *param_list); param_list = { "account_id:id, "algo_type": "SVM", "algo_param": {"svm_type": 1, "kernel": 2, "degree": 3, "gamma": 0.2, "cost": 1.1, "cache_size": 50, "probability": 0, "termination_criteria": 0.001, "nu": 0.5, "coef0": 0.1}, "attr_list":[{"name":"a1", "position":1}, {"name":"a2", "position":2} ... ], "training_details":{"training_source": infile, "training_source_type": FILE, "file_size_mb": 110}, "scale":Y/N, "tune_param": Y/N, "attr_type" : NUM/STR, "re_format":JSON, "model_name" :"my_model1" }
G. Prediction request for SVM
API - char *predict(char *str, void *arg = NULL); str = ex1: {account_id, attr_type: NUM, data_type:event, re_arrange:N, re_format:N, model_name: model_name, data:"1 1:1.2 2:3.2 3:1.1"} ex2; {account_id, attr_type: NUM, data_typee:FILE, re_arrange:N, re_format:N, model_name: model_name, data:inputfile} ex3; {account_id, attr_type: NUM, data_type:event, re_arrange:N, re_format:JSON, model_name: model_name, data:{k1:v1, k2:v2, k3:v3}} etc...