Algo

Classification - Single, multi-label - class & value - Done
Regression - Linear & Logistic - Done
Knowledge base training - any given domain - Done
NER model - name entity model - Done
Relation models - set of models for all relations - Done
Ontology - Done
Kmeans - Done, portal integration pending
Random Forest
Kernel ridge regression - also find the right function etc
N Bayes, GM, Dynamic Bayesian Network - PA
Gradient Boosting Algorithm
DNN, Convolutional Neural Network
ResNet, loss and metric
Semantic segmentation
FHOG, BR
Image search, face search

Param Tuning

One of the limitations with previous implementation was not able to tune the params. We know that for same data and algo, if params are tuned properly then efficacy could be increased from 20% to 95%+. Therefore, it's important that we must allow user to tune params. However, this is optional, and user may switch on / off the tuning part as needed. Default is on. It's recommended that it must be on

Data Normalization

For the same data file, different fields could have different dimensions. Therefore, applying singular dimension would be defeating and yield poor results. We can now switch on the normalization option when we send request. It's highly recommended to do that

Memory Budget

We know that training esp is memory intensive and server may crash if not handled properly. Similarly, when we run prediction server, to be able to handle prediction faster, we must have models loaded in the memory. But again, how many models? what if we need more than we can handle given amount of memory? etc. To address all these, we have memory budget concept implemented in the ML infra.

1. Training happens in given memory budget. It never exceeds the given amount. In future we may add speed as param for training and accordingly can adjust the memory. As of now user needs to explicitly define the memory budget

2. The model manager also works within memory budget. It manages all required models with the given amount

LRU for ensuring memory budget constraints

When we have more objects to deal with than the given memory, we must have ways to de-allocate few objects in favor of newer ones. Therefore, BangDB ML Infra implements its own LRU which keeps objects in-memory as much as possible, however when it requires more, it will de-allocate or invalidate the ones which were accessed earliest. Which means it tries to keep the recently accessed objects in-memory as much as possible. This is very light weight implementation, but performance is high and its thread safe

BRS Integration

Now we don't have dependency on S3 and it's removed completely. We have BRS for the same purpose. The ML Infra is seamlessly integrated with BRS.

BSM Integration

For BSM, not much has changed from outside and we still use svm_predictor and most of the APIs are same. However, there are subtle changes internally, which basically aligns with the new implementation. Therefore, in BSM we don't have any impact from caller or usage perspective. Existing code for Get, Put, scan etc. should just work as they used to

Conventions

A. Any file name must have model_name and account_id associated with it. For ex; model key would be following.

model_name__account_id [ joined by two underscores "__" ]

B. Training or prediction uses local dir /tmp/BRS_DATA for dealing with temporary files. We must create this folder on the server else training or prediction will fail

C. Training request returns immediately. This allows training to be non-blocking for client server case. However, training is blocking for embedded case since it should force user to wait.

D. Since the training is async and returns immediately therefore we must need mechanism to get the state of the training any given time. State of the training requests are maintained and could be queried by calling get_status api (as described above). Following are the states that ML infra would maintain within BangDB, these are self explanatory

  enum ML_BANGDB_TRAINING_STATE
    {
        ML_BANGDB_TRAINING_STATE_INVALID_INPUT = 10,
        ML_BANGDB_TRAINING_STATE_NOT_PRSENT,
        ML_BANGDB_TRAINING_STATE_ERROR_PARSE,
        ML_BANGDB_TRAINING_STATE_ERROR_FORMAT,
        ML_BANGDB_TRAINING_STATE_ERROR_BRS,
        ML_BANGDB_TRAINING_STATE_ERROR_TUNE,
        ML_BANGDB_TRAINING_STATE_ERROR_TRAIN,
        ML_BANGDB_TRAINING_STATE_LIMBO,
        ML_BANGDB_TRAINING_STATE_BRS_GET_PENDING,
        ML_BANGDB_TRAINING_STATE_BRS_GET_DONE,
        ML_BANGDB_TRAINING_STATE_REFORMAT_DONE,
        ML_BANGDB_TRAINING_STATE_SCALE_TUNING_DONE,
        ML_BANGDB_TRAINING_STATE_BRS_MODEL_UPLOAD_PENDING,
        ML_BANGDB_TRAINING_STATE_TRAINING_DONE,
        ML_BANGDB_TRAINING_STATE_DEPRICATED,
     };

E. User should use the helper class (bangdb_ml_helper) to deal with all sorts of things including uploading, downloading files etc. User should not use BRS for uploading or downloading files, BangDB model manager is aware of this and it takes care of all BRS interactions. Therefore it's very simple to just use 7-9 APIs given in the bangdb_ml_helper class and not worry about what to use etc.

F. Training request format for SVM

API - char *train_model(char *param_list);

  param_list = 
        {
          "account_id:id,
          "algo_type": "SVM",
          "algo_param": {"svm_type": 1, "kernel": 2, "degree": 3, "gamma": 0.2, "cost": 1.1, "cache_size": 50, 
          "probability": 0, "termination_criteria": 0.001, "nu": 0.5, "coef0": 0.1},
          "attr_list":[{"name":"a1", "position":1}, {"name":"a2", "position":2} ... ],
          "training_details":{"training_source": infile, "training_source_type": FILE, "file_size_mb": 110},
          "scale":Y/N,
          "tune_param": Y/N,
          "attr_type" : NUM/STR,
          "re_format":JSON,
          "model_name" :"my_model1"
        }

G. Prediction request for SVM

   API - char *predict(char *str, void *arg = NULL);

        str =

        ex1: {account_id, attr_type: NUM, data_type:event, re_arrange:N, re_format:N, model_name: model_name, 
              data:"1 1:1.2 2:3.2 3:1.1"}
        ex2; {account_id, attr_type: NUM, data_typee:FILE, re_arrange:N, re_format:N, model_name: model_name, 
              data:inputfile}
        ex3; {account_id, attr_type: NUM, data_type:event, re_arrange:N, re_format:JSON, model_name: model_name, 
              data:{k1:v1, k2:v2, k3:v3}}
              etc...