BangDB has ML natively integrated within it. What it means is that ML is part of the system and it does perform work both in implicit and explicit manner. This means that within BangDB we have ML related support for training, testing, versioning, storing and deploying the models for prediction in continuous or other manner.

While BangDB implements and support certain algorithm implement in c/c+ natively, it also allows users to bring their model, framework or code to run as part of the system.

Users may leverage some of the frameworks like Tensorflow, PyTorch etc. as they like or require. Since one of the major problems with ML is dealing with large files, either training/test files or models themselves, therefore BangDB uses BRS to help users with these.

Let's now figure out the commands that CLI supports for ML

// Train, predict, deploy models using cli
train model model_name
train model from model_name
show models show models where schema = "myschema"
show status where schema = "myschema" and model = "mymodel"
select treq from bangdb_ml where schema = "myschema" and model = "mymodel"
select treq from bangdb_ml where schema = "myschema"
delete treq from bangdb_ml where schema = "myschema" and model = "mymodel"
update bangdb_ml set status = 25 where schema = "myschema" and model = "mymodel"
drop model mymodel where schema = "myschema" pred model model_name

Train model

BangDB trains models based on the training instructions (metadata) defined in the json format. We can write the metadata in a text editor and save as file and use the file directly to train or we can start a workflow here on the cli which will eventually create the metadata and train the model. Here is the training metadata format which BangDB leverages for training models.

Let's take a look at a schema for training a classification mode using svm

training request : {
   "algo_param":{
      "termination_criteria":0.1,
      "degree":0,
      "svm_type":2,
      "kernel_type":2,
      "gamma":0.001,
      "shrinking":0
   },
   "attr_type":1,
   "tune_params":1,
   "scale":1,
   "schema-name":"myschema",
   "training_details":{
      "file_size_mb":1,
      "input_format":"SVM",
      "expected_format":"SVM",
      "train_speed":2,
      "training_source":"svmguide1",
      "training_source_type":1
   },
   "attr_list":[
      {
         "name":"a",
         "position":0
      },
      {
         "position":1,
         "name":"b"
      },
      {
         "position":2,
         "name":"c"
      },
      {
         "position":3,
         "name":"d"
      },
      {
         "name":"e",
         "position":4
      }
   ],
   "algo_type":"SVM",
   "model_name":"model1"
}

Let's train this model using the workflow first

train model model1
what's the name of the schema for which you wish to train the model?: myschema
do you wish to read earlier saved ml schema for editing/adding? [ yes | no ]:

Since we are creating new model and we don't have metadata saved on the disk, hence we will select 'no' (or enter) and move on Now it lists all the natively supported algo and also "Custom (5)" option if we wish to use other framework etc.

We will pick Classification (1)

BangDB supports following algorithm, pls select from these
Classification (1) | Regression (2) | Lin-regression/Classification (3) |
Kmeans (4) | Custom (5) | IE - ontology (6) | IE - NER (7) | IE - Sentiment (8) |
IE - KB (9) | DL - resnet (10) | DL - lenet (11) | DL - face detection (12) |
DL - shape detection (13) | SL - object detection (14)
what's the algo would you like to use (or Enter for default (1)): 1

Based on algo selection, it asks for certain info on the parameters etc.

svm type [ C_SVC (0) | NU_SVC (1) | ONE_CLASS (2) ] (press enter for default (0)): 2
kernel type [ LINEAR (0) | POLY (1) | RBF (2) | SIGMOID (3) ] (press enter for default (0)): 2
degree (press enter for default (3): 
enter gamma (or press enter for default (0.001)):
enable shrinking? [ yes | no ]: 
what's the stopping criteria (eps) (or press enter for default (0.001)): 0.1
what's the input (training data) source? [ local file (1) | file on BRS (2) | stream (3) ] (press enter for default (1)): 1
enter the file name for upload (along with full path): trainfiles/svmguide1
what is the input data format for the train data [ LIBSVM (0) | CSV (1) | JSON (3) ] (press Enter for default 1): 0
what's the training speed you wish to select [ Very fast (1) | fast (2) | medium (3) | slow (4) | very slow (5) ] (or Enter for default (1)): 2 what's the attribute type [ NUM (1) | STRING (2) | HYBRID (3) ] (press enter for default (1)): 1
do you wish to scale the data? [ yes | no ]: yes
do you wish to tune the params? [ yes | no ]: yes

Finally we can also do attribute mapping here. This is useful when we have data format and the format needed by the algo are different, such that db could do the transformation accordingly before testing.

It also helps in training and prediction on stream as subset of event fields could be used for training and prediction.

we need to do the mapping so it can be used on streams later. This means we need to provide attr name and its position in the training file.

attr name: a
attr position: 0
do you wish to add more attributes? [ yes | no ]: yes
attr name: b
attr position: 1
do you wish to add more attributes? [ yes | no ]: yes
attr name: c
attr position: 2
do you wish to add more attributes? [ yes | no ]: yes
attr name: d
attr position: 3
do you wish to add more attributes? [ yes | no ]: yes
attr name: e
attr position: 4
do you wish to add more attributes? [ yes | no ]: 
do you wish to add external udf to do some computations before the training? [ yes | no ]:

Once we enter "yes", model starts training.

To know the status of the training, we should use either of the following:

show models
+---------------+----------+----+------------+-----------+------------------------+------------------------+ 
|key            |model name|algo|train status|schema name|train start time        |train end time          |
+---------------+----------+----+------------+-----------+------------------------+------------------------+
|myschema:model1|model1    | SVM|passed      |myschema   |Wed Feb 3 13:44:47 2021 |Wed Feb 3 13:44:59 2021 |
+---------------+----------+----+------------+-----------+------------------------+------------------------+

The above will show details for all models.

To know specifically for a model

show status where schema = "myschema" and model = "model1"
{
    "schema-name":"myschema",
    "model_name":"model1",
    "train_req_state":25
}

Now let's do test prediction here.

Predict for a test event (single data)

pred model model1
what's the name of the schema for which mode was trained?: myschema
do you wish to see the train request? [ yes | no ]: no
model algo type is [ SVM ] it needs [ NUM ] data type with [ LIBSVM ] input data format
what is the input data format for the given pred file [ LIBSVM (0) | CSV (1) | JSON (3) ] (press Enter for default 0): 0
do you wish to provide attribute list? [ yes | no ]: no
do you wish to consider the target (are you also supplying target value?) [ yes | no ]: no
do you wish to pred for file? or single event? [ yes (file) | no (single event) ]: no 
enter the test data: 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02
pred request = {"input_format":"SVM","expected_format":"SVM","schema-name":"myschema","model_name":"model1","algo_type":"SVM","attr_type":1,"consider_target":0,"data_type":2,"data":"1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02"} {"predict_labels":1,"user_pred_accuracy":100,"errorcode":0}
success

We selected libsvm format of the event "1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02" hence there was no conversion. Let's select csv file and ask db to do the conversion.

pred model model1 what's the name of the schema for which mode was trained?: myschema
do you wish to see the train request? [ yes | no ]:
model algo type is [ SVM ] it needs [ NUM ] data type with [ LIBSVM ] input data format
what is the input data format for the given pred file [ LIBSVM (0) | CSV (1) | JSON (3) ] (press Enter for default 0): 1
what is the separator (SEP) for the csv file? (press Enter for default ',' (comma) else type it):
do you wish to provide attribute list? [ yes | no ]:
do you wish to consider the target (are you also supplying target value?) [ yes | no ]:
do you wish to pred for file? or single event? [ yes (file) | no (single event) ]:
enter the test data: 26,58,-0.02,125 pred request = {"input_format":"CSV","SEP":",","expected_format":"SVM","schema-name":"myschema","model_name":"model1","algo_type":"SVM","attr_type":1,"consider_target":0,"data_type":2,"data":"26,58,-0.02,125"} {"predict_labels":1,"user_pred_accuracy":0,"errorcode":0}
success

Here we select 1 for input data format and gave the event in csv; “26,58,-0.02,125” Now pred using a test file

pred model model1
what's the name of the schema for which mode was trained?: myschema 
do you wish to see the train request? [ yes | no ]: 
model algo type is [ SVM ] it needs [ NUM ] data type with [ LIBSVM ] input data format 
what is the input data format for the given pred file [ LIBSVM (0) | CSV (1) | JSON (3) ] (press Enter for default 0):
do you wish to provide attribute list? [ yes | no ]: 
do you wish to consider the target (are you also supplying target value?) [ yes | no ]: yes 
do you wish to pred for file? or single event? [ yes (file) | no (single event) ]: yes 
do you wish to upload the file? [ yes | no ]: yes 
enter the test file name for upload (along with full path): trainfiles/svmguide1.t 
pred request = {"input_format":"SVM","expected_format":"SVM","schema-name":"myschema","model_name":"model1","algo_type":"SVM","attr_type":1,"consider_target":1,"data_type":1,"data":"svmguide1.t"}
{"pred_file_out":"model1__myschema__svmguide1.t.predict","errorcode":0}
do you wish to download the test file? [ yes | no ]: yes
test file [ model1__myschema__svmguide1.t.predict ] download successful, it's in the /tmp folder
success