ML Helper (Embedded)

BangDB ML Helper (Embedded) offers several APIs to help simplify the ML related activities. The type offers features from Training model, prediction, versioning of model, deployment to managing large files and binary objects related to ML. Check out the few real world examples for to learn more or try them out on BangDB.

C++

Java

static BangDBMLHelper *getInstance(BangDBDatabase *bdb, long mem_budget = 0);

To get instance of the ml helper, we call this api. It takes BangDBDatabase as required argument. It takes mem_budget as optional parameter. The mem_budget defines the amount memory we allocate for ML related activities, bangdb will always respect this budget.

This is important when we run db and ml on same box or in embedded mode or when multiple users are using it and we wish to server all of them or when we wish to ensure ml memory overflow doesn't create problem for the users Upon success it returns reference to the ml_helper else NULL.

int createBucket(char *bucket_info);

All intermediate files, models or training/ testing related files are stored within BRS (bangdb resource server) in some bucket. This API allows us to create bucket as defined by the bucket_info which looks like following:

{access_key:, secret_key:, bucket_name:}

void setBucket(char *bucket_info);

This is similar to create_bucket, but if there is existing bucket with the name then it will change that to this bucket.

long uploadFile(char *key, char *fpath, insert_options iop);

This is any ML related file that we wish to further use for training or testing or prediction. Key is the id for the file and fpath takes the path to the file including the file name. Please note it uploads in the default bucket. To load in a particular bucket, please use other API described below It returns 0 for success else -1 for error. Please see AI section for more information.

char *trainModel(char *req);

This is to train a model. It takes a training request and returns status of the training request. The training request looks like following:

{
   "schema-name":"id",
   "algo_type":"SVM",
   "algo_param":{
      "svm_type":1,
      "kernel":2,
      "degree":3,
      "gamma":0.2,
      "cost":1.1,
      "cache_size":50,
      "probability":0,
      "termination_criteria":0.001,
      "nu":0.5,
      "coef0":0.1
   },
   "attr_list":[
      {
         "name":"a1",
         "position":1
      },
      {
         "name":"a2",
         "position":2
      }
   ],
   "training_details":{
      "training_source":"infile",
      "training_source_type":"FILE",
      "file_size_mb":110,
      "train_speed":1
   },
   "scale":"Y/N",
   "tune_param":"Y/N",
   "attr_type":"NUM/STR",
   "re_format":"JSON",
   "custom_format":{
      "name":"ts_rollup",
      "fields":{
         "ts":"ts",
         "quantity":"qty",
         "entityid":"eid"
      },
      "aggr_type":2,
      "gran":1
   },
   "model_name":"my_model1",
   "udf":{
      "name":"udf_name",
      "udf_logic":1,
      "bucket_name":"udf_bucket"
   }
}

Please see AI section for more information.

char *getModelStatus(char *req);

This is to get the status of the model when training request is fired. Req input parameter is like following:

req = {"schema-name":, "model_name": }

And the return value is like following:

{"schema-name":, "model_name":, "train_start_ts":, "train_end_ts":, "train_state":}

The train_state actually tells the status of the model. The value for train_state are as following:

enum ML_BANGDB_TRAINING_STATE {
   //error
   ML_BANGDB_TRAINING_STATE_INVALID_INPUT = 10, 
   ML_BANGDB_TRAINING_STATE_NOT_PRSENT,
   ML_BANGDB_TRAINING_STATE_ERROR_PARSE,
   ML_BANGDB_TRAINING_STATE_ERROR_FORMAT,
   ML_BANGDB_TRAINING_STATE_ERROR_BRS,
   ML_BANGDB_TRAINING_STATE_ERROR_TUNE,
   ML_BANGDB_TRAINING_STATE_ERROR_TRAIN,
   ML_FILE_TYPE_ERROR_VAL_TESTDATA,
   ML_FILE_TYPE_ERROR_VAL_TRAINDATA,
   ML_BANGDB_TRAINING_STATE_LIMBO,

   //intermediate states
   ML_BANGDB_TRAINING_STATE_BRS_GET_PENDING,
   ML_BANGDB_TRAINING_STATE_BRS_GET_DONE,
   ML_BANGDB_TRAINING_STATE_REFORMAT_DONE,
   ML_BANGDB_TRAINING_STATE_SCALE_TUNING_DONE,
   ML_BANGDB_TRAINING_STATE_BRS_MODEL_UPLOAD_PENDING,
   //training done
   ML_BANGDB_TRAINING_STATE_TRAINING_DONE, //25
   ML_BANGDB_TRAINING_STATE_DEPRICATED,
};

The above is true for ML related model status. For IE (Information Extraction) related model status use following:

enum IE_BANGDB_TRAINING_STATE {
   //error
   IE_BANGDB_TRAINING_STATE_INVALID_INPUT = 10,
   IE_BANGDB_TRAINING_STATE_NOT_PRSENT,
   IE_BANGDB_TRAINING_STATE_ERROR_BRS,
   IE_BANGDB_TRAINING_STATE_ERROR_HELPER_FILES,
   IE_BANGDB_TRAINING_STATE_ERROR_BRS_FEATURE_EX,
   IE_BANGDB_TRAINING_STATE_ERROR_BRS_HELP_FILES,
   IE_BANGDB_TRAINING_STATE_ERROR_PRE_NER_TRAIN,
   IE_BANGDB_TRAINING_STATE_LIMBO,
   IE_BANGDB_TRAINING_STATE_ERROR_NER_TRAIN,
   IE_BANGDB_TRAINING_STATE_ERROR_NER_TRAIN_BRS,
   IE_BANGDB_TRAINING_STATE_ERROR_PRE_REL_TRAIN, //20
   IE_BANGDB_TRAINING_STATE_ERROR_REL_TRAIN,
   IE_BANGDB_TRAINING_STATE_ERROR_REL_TRAIN_BRS,
   IE_BANGDB_TRAINING_STATE_ERROR_REL_LIST_BRS,
   IE_FILE_TYPE_ERROR_VAL_TRAINDATA,
   IE_FILE_TYPE_ERROR_VAL_TESTDATA,
   IE_FILE_TYPE_ERROR_VAL_CLASSDATA,
   IE_FILE_TYPE_ERROR_VAL_TOTALEXDATA,

   //intermediate states
   IE_BANGDB_TRAINING_STATE_BRS_GET_PENDING,
   IE_BANGDB_TRAINING_STATE_BRS_GET_DONE,
   IE_BANGDB_TRAINING_STATE_HELPER_DONE, //30
   IE_BANGDB_TRAINING_STATE_PRE_NER_DONE,
   IE_BANGDB_TRAINING_STATE_NER_DONE,
   IE_BANGDB_TRAINING_STATE_PRE_REL_DONE,
   IE_BANGDB_TRAINING_STATE_REL_DONE,
   IE_BANGDB_TRAINING_STATE_BRS_MODEL_UPLOAD_PENDING,
   IE_BANGDB_TRAINING_STATE_BRS_RELLIST_UPLOAD_PENDING,
   //training done
   IE_BANGDB_TRAINING_HELP_DONE, //37
   IE_BANGDB_TRAINING_STATE_TRAINING_DONE, //38
   IE_BANGDB_TRAINING_STATE_DEPRICATED,
};

Please see AI section for more information.

int delModel(char *req);

This is used to delete the model by passing req parameter. Here is how req looks like:

req = {"schema-name":, "model_name": } int delTrainRequest(char *req);

This is to delete the training request. Helpful when training got stuck for some reasons and the status was not updated properly. Here is how req looks like:

req = {"schema-name":, "model_name": }
//predict request must contain the algo type as well
//void *arg is for sorted list of positions of features 
char *predict(char *req, void *arg = NULL);

The predict API is used to predict for a particular data or event. It takes req as parameter and default parameter arg which describes the sorted position of the different features. It's not required most of the time. Here is how request looks like:

{
   schema-name,
   attr_type:NUM,
   data_type:event,
   re_arrange:N,
   re_format:N,
   model_name:model_name,
   data:"1 1:1.2 2:3.2 3:1.1"
}

{
   schema-name,
   attr_type:NUM,
   data_typee:FILE,
   re_arrange:N,
   re_format:N,
   model_name:model_name,
   data:inputfile
}

{
   schema-name,
   attr_type:NUM,
   data_type:event,
   re_arrange:N,
   re_format:JSON,
   model_name:model_name,
   data:{
      k1:v1,
      k2:v2,
      k3:v3
   }
}

resultset *getTrainingRequests( 
   resultset *prev_rs,
   char *accountid
);

This returns all the training requests made so far for an account (or schema). The prev_rs should be NULL for the first call and for subsequent calls, just pass the previous rs. Upon success it returnss 0 else -1 for error.

char *getRequest(char *req);

This returns request (training) from the ML housekeeping. The request is as follows:

req = {"schema-name":, "model_name": }

It returns response with status or NULL for error or if req not found.

int setStatus(char *status);

This sets the status for a particular train request. The status is as follows:

status = {"schema-name":, "model_name":, "status": }

Upon success it returns 0 else -1 for error.

char *getModelPredStatus(char *req);

Given a request get the prediction status. The req is as follows:

req = {"schema-name":, "model_name": }
retval = {"schema-name":, "model_name":, “pred_req_state”:, “file_name”:}
int delPredRequest(char *req);

Deletes the request. The input param req is as follows:

req = {"schema-name":, "model_name": “file_name”:}

It returns 0 for success and -1 for error.

long uploadFile(char *bucket_info, char *key, char *fpath, insert_options iop);

long downloadFile(char *bucket_info, char *key, char *fname, char *fpath);

It downloads the file from the given bucket, key. It renames the file as "fname" and stores the file at "fpath" It returns 0 for success else -1 for error.

long getObject(char *bucket_info, char *key, char **data, long *datlen);

It gets the object(binary or otherwise) from the given bucket, key. It fills data with the object and sets the datlen as length or size of the object. It returns 0 for success else -1 for error.

int delFile(char *bucket_info, char *key);

This deletes the given file (key) from the given bucket (bucket_info). It returns 0 for success and -1 for error.

int delBucket(char *bucket_info);

This deletes the given bucket. It returns 0 for success and -1 for error.

LONG_T countBuckets();

This returns number of buckets else -1 for error.

int countSlices(char *bucket_info, char *key);

Since BRS (bangdb resource server) stores large files and objects in chunks, therefore we can count how many slices are there for the given file (key) by calling this function. It returns count of slices else -1 for error.

LONG_T countObjects(char *bucket_info);

This counts the number of objects in the given bucket else returns -1 for error.

char *countObjectsDetails(char *bucket_info);

This gives the details of all the objects in the given bucket(bucket_info) else returns NULL for error. Please note it may set error in the returned json value as well.

long countModels(char *accountid);

This counts the models for a given account else returns -1 for error.

int getRefCount();

This returns reference count of all the references of theml_helper held by different objects.

// this is to test if brs is local to the BE server DB
bool isBrsLocal();

This is useful to know if brs (bangdb resource server) is local or remote. Mostly used by clients.

char *listObjects(char *bucket_info, char *skey = NULL, int list_size_mb = MAX_RESULTSET_SIZE);

This returns json string with the list of objects in a given bucket for a given key or for all keys (in case of skey is NULL). It may return NULL for error as well. list_size_mb defines the max size of the list, by default it would return 2MB of data or less.

// returns json with the name of buckets, else error 
{"access_key":"akey", "secret_key":"skey"}
char *listBuckets(char *user_info);

This returns the list of all buckets for the user given by user_info which looks like following:

{"access_key":"akey", "secret_key":"skey"}

It may return NULL as well in case of error

int closeBangDBMLHelper(bool force = false);

This closes the bangdb ml helper. Since reference count is maintained within the ml_helper therefore if force is set as false and there are open references then it would not close the ml_helper.

But if force is set to be true or number of references is 0 then it will close the ml_helper. It returns 0 for success else -1 for error.