Overview

The document is about the AI for BangDB. There are two aspects to it, one is the Machine Learning part which deals with the training of models within BangDB and other one is for the vector indexing and RAG workflows and auto bots’ creation. Let’s investigate the ML part first

Training Infrastructure

There are three components of the BangDB ML infra, namely

  • Training
  • Prediction
  • BRS (Resource Server)

The entire infrastructure could be deployed in following 3 different combinations

  • Training, Prediction and BRS all in 3 different servers
  • Training and Prediction in 1 server, BRS in another one
  • All 3 in single server

While option (c) is good for test/dev purpose, scenario (a) should be the mostly used one in production environment

We should be able to define the structure of the infra using following.

  • bangdb.config - BANGDB_ML_SERVER_TYPE
  • train_pred_brs_info, structure which basically takes IP:port for each 3 servers

Interfaces & APIs

Following basic interface is available from user's perspective

client_ml_helper

  • //creates bucket
    i/p - {"bucket_name":"myname", "access_key":"akey", "secret_key":"skey"
    int create_bucket(char *bucket_info);**

  • //sets bucket - which bucket to use for any operation within this class
    //i/p - {"bucket_name":"myname", "access_key":"akey", "secret_key":"skey"}
    void set_bucket(char *bucket_info);
  • //key is the key of the file using which the brs will store it
    //fpath is full path of the file on local fs, iop is flag for operation put
    int upload_file(char *key, char *fpath, insert_options iop);
  • //train req : It depends on what kind of algo is requested. Format changes for different types
    int train_model(char *req);
  • //req : {"account_id":"AACCEEGGIILLNN", "model_name":"my_model1"}
    char *get_model_status(char *req);
  • //{"account_id":, "model_name":}
    int del_model(char *req);
  • //{"account_id":, "model_name":}
    int del_train_request(char *req);
  • //req : depends on algo etc, user ought to provide the right one
    char *predict(char *req);
  • //get training requests for a given account - all training requests
    //req : {account_id:"aacid"}
    resultset *get_training_requests(char *req);
  • //count models for a given account, all the models
    //req : {account_id:"aacid"}
    long count_models(char *req);
  • //This is to re-init the model data manager in case we would like to change the
    //IP:PORT info for BRS, useful because BRS mostly will be separate and mostly static
    //but may change due to load etc, as BRS can scale linearly
    //req : {"bucket_info", "brs_ip", "brs_port"}
    int reinit_mdm(char *req);
  • //how many objects are using this reference
    int get_ref_count();
  • //get the handle of BRS - useful only for embd as client should never bother about this
    bangdb_resource_manager *get_brs();
  • //this is to test if brs is local to the BE server DB bool is_brs_local();
  • void clean_ml_helper();

For developer, we have following interfaces

Development Interfaces, classes

  • iq_train_predict
  • model_data_manager
  • pred_housekeep
  • iqconvert
  • ml_bangdb

Out of these, iq_train_predict is the interface which we need to implement for every new algo we add. For ex, we have svm_train_predict for svm, similarly ie_train_predict for IE etc.

iq_convert is for converting the format of a file from f1 to f2.

pred_housekeep keeps the state of any request, training info etc. It also provides locking apis for safely handling of parallel trainings or predictions

model_data_manager manages the models. It interfaces with BRS to get or put data (any data)

Finlly ml_bangdb or ie_bangdb are collections of helper functions

Details of these would be defined below.

iq_train_predicit

    void set_housekeep(void *hkeep);
    char *train_model(char *param_list);
    char *predict(char *str, void *arg = NULL);
    char *get_status(char *model_detail);
    void close_trainer();

We just need to implement above five APIs to add any new algo

Python Support

Given that python is leading language for ML and many new interesting supports is coming from the language, therefore it was needed that we have built in support for such python code execution. However, following are the conditions that we wanted to apply.

  • Python runs in single threaded manner, but we like parallel execution
  • Run python code in separate process. If process creation fails then run in thread
  • Read return data from python process
  • Keep status of the process for reporting
  • Python 2.7 and 3 support
  • BangDB should compile and run with or without python - provide a switch

Currently the SVM doesn't require python, but IE may need it. Therefore, need to compile BangDB accordingly

Input file format

For training and even for prediction, user may like to send data in different formats. Therefore, we needed simple mechanism to handle this. We have separate interface "iqconvert" defined, which has following APIs

    int convert(char *infile, char *outfile);
     int convert(FILE *finfile, FILE *foutfile);
    

Currently it's implemented for csv and json to libsvm format converters. Developer should implement this for new conversion logic as appropriate

Algo

  1. Classification - Single, multi-label - class & value - Done
  2. Regression - Linear & Logistic - Done
  3. Knowledge base training - any given domain - Done
  4. NER model - name entity model - Done
  5. Relation models - set of models for all relations - Done
  6. Ontology - Done
  7. Kmeans - Done, portal integration pending
  8. Random Forest
  9. Kernel ridge regression - also find the right function etc
  10. N Bayes, GM, Dynamic Bayesian Network - PA
  11. Gradient Boosting Algorithm
  12. DNN, Convolutional Neural Network
  13. ResNet, loss and metric
  14. Semantic segmentation
  15. FHOG, BR
  16. Image search, face search

Param Tuning

One of the limitations with previous implementation was not able to tune the params. We know that for same data and algo, if params are tuned properly then efficacy could be increased from 20% to 95%+. Therefore, it's important that we must allow user to tune params. However, this is optional, and user may switch on / off the tuning part as needed. Default is on. It's recommended that it must be on

Data Normalization

For the same data file, different fields could have different dimensions. Therefore, applying singular dimension would be defeating and yield poor results. We can now switch on the normalization option when we send request. It's highly recommended to do that

Memory Budget

We know that training esp is memory intensive and server may crash if not handled properly. Similarly, when we run prediction server, to be able to handle prediction faster, we must have models loaded in the memory. But again, how many models? what if we need more than we can handle given amount of memory? etc. To address all these, we have memory budget concept implemented in the ML infra.

1. Training happens in given memory budget. It never exceeds the given amount. In future we may add speed as param for training and accordingly can adjust the memory. As of now user needs to explicitly define the memory budget

2. The model manager also works within memory budget. It manages all required models with the given amount

LRU for ensuring memory budget constraints

When we have more objects to deal with than the given memory, we must have ways to de-allocate few objects in favor of newer ones. Therefore, BangDB ML Infra implements its own LRU which keeps objects in-memory as much as possible, however when it requires more, it will de-allocate or invalidate the ones which were accessed earliest. Which means it tries to keep the recently accessed objects in-memory as much as possible. This is very light weight implementation, but performance is high and its thread safe

BRS Integration

Now we don't have dependency on S3 and it's removed completely. We have BRS for the same purpose. The ML Infra is seamlessly integrated with BRS.

BSM Integration

For BSM, not much has changed from outside and we still use svm_predictor and most of the APIs are same. However, there are subtle changes internally, which basically aligns with the new implementation. Therefore, in BSM we don't have any impact from caller or usage perspective. Existing code for Get, Put, scan etc. should just work as they used to

Conventions

A. Any file name must have model_name and account_id associated with it. For ex; model key would be following.

    model_name__account_id [ joined by two underscores "__" ]

    Training file name would be;
    model_name__account_id__training_file_name

B. Training or prediction uses local dir /tmp/BRS_DATA for dealing with temporary files. We must create this folder on the server else training or prediction will fail

C. Training request returns immediately. This allows training to be non-blocking for client server case. However, training is blocking for embedded case since it should force user to wait.

D. Since the training is async and returns immediately therefore we must need mechanism to get the state of the training any given time. State of the training requests are maintained and could be queried by calling get_status api (as described above). Following are the states that ML infra would maintain within BangDB, these are self explanatory

  enum ML_BANGDB_TRAINING_STATE
    {
        ML_BANGDB_TRAINING_STATE_INVALID_INPUT = 10,
        ML_BANGDB_TRAINING_STATE_NOT_PRSENT,
        ML_BANGDB_TRAINING_STATE_ERROR_PARSE,
        ML_BANGDB_TRAINING_STATE_ERROR_FORMAT,
        ML_BANGDB_TRAINING_STATE_ERROR_BRS,
        ML_BANGDB_TRAINING_STATE_ERROR_TUNE,
        ML_BANGDB_TRAINING_STATE_ERROR_TRAIN,
        ML_BANGDB_TRAINING_STATE_LIMBO,
        ML_BANGDB_TRAINING_STATE_BRS_GET_PENDING,
        ML_BANGDB_TRAINING_STATE_BRS_GET_DONE,
        ML_BANGDB_TRAINING_STATE_REFORMAT_DONE,
        ML_BANGDB_TRAINING_STATE_SCALE_TUNING_DONE,
        ML_BANGDB_TRAINING_STATE_BRS_MODEL_UPLOAD_PENDING,
        ML_BANGDB_TRAINING_STATE_TRAINING_DONE,
        ML_BANGDB_TRAINING_STATE_DEPRICATED,
     };

E. User should use the helper class (bangdb_ml_helper) to deal with all sorts of things including uploading, downloading files etc. User should not use BRS for uploading or downloading files, BangDB model manager is aware of this and it takes care of all BRS interactions. Therefore it's very simple to just use 7-9 APIs given in the bangdb_ml_helper class and not worry about what to use etc.

F. Training request format for SVM

API - char *train_model(char *param_list);

  param_list = 
        {
          "account_id:id,
          "algo_type": "SVM",
          "algo_param": {"svm_type": 1, "kernel": 2, "degree": 3, "gamma": 0.2, "cost": 1.1, "cache_size": 50, 
          "probability": 0, "termination_criteria": 0.001, "nu": 0.5, "coef0": 0.1},
          "attr_list":[{"name":"a1", "position":1}, {"name":"a2", "position":2} ... ],
          "training_details":{"training_source": infile, "training_source_type": FILE, "file_size_mb": 110},
          "scale":Y/N,
          "tune_param": Y/N,
          "attr_type" : NUM/STR,
          "re_format":JSON,
          "model_name" :"my_model1"
        }

G. Prediction request for SVM

   API - char *predict(char *str, void *arg = NULL);

        str =

        ex1: {account_id, attr_type: NUM, data_type:event, re_arrange:N, re_format:N, model_name: model_name, 
              data:"1 1:1.2 2:3.2 3:1.1"}
        ex2; {account_id, attr_type: NUM, data_typee:FILE, re_arrange:N, re_format:N, model_name: model_name, 
              data:inputfile}
        ex3; {account_id, attr_type: NUM, data_type:event, re_arrange:N, re_format:JSON, model_name: model_name, 
              data:{k1:v1, k2:v2, k3:v3}}
              etc...