Algorithms and their HyperParameters
ALGORITHMS and PARAMETERS | ||||
---|---|---|---|---|
Scaling | Scaling before applying ML algorithms is very important. The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges and to avoid numerical difficulties during the calculation. we perform linear scaling that is scale range [-1,1]. | |||
Tune Param | Proper choice of C and gamma is critical to the SVM's performance. The user is advised to select Tune Param . On selecting Tune Param - you are performing cross-validation and Grid Search. You can consider speed as step length between two consecutive values C and gamma. | |||
CLASSIFICATION | ||||
Parameters | values | Definition | Tips | |
SVM_Type | 0 - C-SVC | C-SVC, NuSVC and One-className SVM performes binary and multi-className classification on a dataset. C-SVM and NuSVM are similar methods, but accept slightly different sets of parameters and have different mathematical formulations. One-className SVM algorithms, learns a decision function for novelty detection: classifying new data as similar or different to the training set. | ||
1 - Nu-SVC | ||||
2 - ONE-className SVM | ||||
Kernel Type | 0 - Linear | linear: u'*v | Radial Basis Function is a general purpose kernel, used when there is no prior knowledge about the data because 1. The linear kernel is a special case of RBF since the linear kernel with a penalty parameter C has the same performance as the RBF kernel with some parameters (C, gamma) 2. The second reason is the number of hyperparameters which influences the complexity of model selection. The polynomial kernel has more hyperparameters than the RBF kernel. There are some situations where the RBF kernel is not suitable. In particular, when the number of features is very large, one may just use the linear kernel. | |
1 - Polynomial | polynomial: (gamma*u'*v + coef0)^degree | |||
2 - RBF | radial basis function: exp(-gamma*|u-v|^2) This kernel nonlinearly maps samples into a higher dimensional space so it, unlike the linear kernel, can handle the case when the relation between className labels and attributes is nonlinear. | |||
3 - Sigmoid | sigmoid: tanh(gamma*u'*v + coef0) | |||
Gamma | [0.000122,8] | gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected. | ||
Degree | Degree of the polynomial kernel function. Ignored by all other kernels. | |||
Coef0 | Independent term in kernel function. It is only significant in ‘polynomial' and ‘sigmoid'. | |||
Cost (C) | [0.031250,8192] | The parameter C, trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly. As C increases, tendency to misclassification decreases on train data( may lead to overfitting). | C is 1 by default and it's a reasonable default choice. If you have a lot of noisy observations you should decrease it: decreasing C corresponds to more regularization. | |
NU | (0,1] | It's a hyperparameter for nu-SVC, one-className SVM and nu-SVR. It is similar to C. nu is upper bound on the fraction of errors and lower bound on the fraction of number of support vectors( number of support vectors determine the run time). Example: if we want error to be less than 1% then nu is 0.01 and the number of supported vectors will be more than 1% of the total records. | Nu approximates value = the fraction of training errors and support vectors. | |
Cachesize | For C-SVC, SVR, NuSVC and NuSVR, the size of the kernel cache has a strong impact on run times for larger problems. | If you have enough RAM available, it is recommended to set cache size to a higher value than the default of 200(MB), such as 500(MB) or 1000(MB). | ||
Termination Criterion | Tolerance for stopping criterion. The stopping tolerance affects the number of iterations used when optimizing the model. | |||
Shrinking | The shrinking are there to save the training time.They sometimes help, and sometimes they do not. It's a matter of runtime, rather than convergence. If the number of iterations is large, then shrinking can shorten the training time. | We found that if the number of iterations is large, then shrinking can shorten the training time. | ||
Probability_Estimates | Whether to enable probability estimates. | |||
nr_weight | nr_weight is the number of elements in the array weight_label and weight. Each weight[i] corresponds to weight_label[i], meaning that the penalty of className weight_label[i] is scaled by a factor of weight[i]. | |||
REGRESSION | ||||
Parameters | values | Definition | Tips | |
SVM_Type | 3 - Epsilon-SVR | The Nu parameter in Nu-SVM can be used to control the amount of support vectors in the resulting model. However, in ϵ-SVR you have no control on how many data vectors from the dataset become support vectors, it could be a few, it could be many. Nonetheless, you will have total control of how much error you will allow your model to have, and anything beyond the specified ϵ will be penalized in proportion to C, which is the regularization parameter. | ||
4 - Nu-SVR | ||||
Kernel Type | 0 - Linear | linear: u'*v | Radial Basis Function is a general purpose kernel, used when there is no prior knowledge about the data because 1. The linear kernel is a special case of RBF since the linear kernel with a penalty parameter C has the same performance as the RBF kernel with some parameters (C, gamma) 2. The second reason is the number of hyperparameters which influences the complexity of model selection. The polynomial kernel has more hyperparameters than the RBF kernel. There are some situations where the RBF kernel is not suitable. In particular, when the number of features is very large, one may just use the linear kernel. | |
1 - Polynomial | polynomial: (gamma*u'*v + coef0)^degree | |||
2 - RBF | radial basis function: exp(-gamma*|u-v|^2) This kernel nonlinearly maps samples into a higher dimensional space so it, unlike the linear kernel, can handle the case when the relation between className labels and attributes is nonlinear. | |||
3 - Sigmoid | sigmoid: tanh(gamma*u'*v + coef0) | |||
Gamma | [0.000122,8] | gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected. | ||
Degree | Degree of the polynomial kernel function. Ignored by all other kernels. | |||
Coef0 | Independent term in kernel function. It is only significant in ‘polynomial' and ‘sigmoid'. | |||
Cost (C) | [0.031250,8192] | Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty. | C is 1 by default and it's a reasonable default choice. If you have a lot of noisy observations you should decrease it: decreasing C corresponds to more regularization. | |
NU | (0,1] | It's a hyperparameter for nu-SVC, one-className SVM and nu-SVR. It is similar to C. nu is upper bound on the fraction of errors and lower bound on the fraction of number of support vectors( number of support vectors determine the run time). Example: if we want error to be less than 1% then nu is 0.01 and the number of supported vectors will be more than 1% of the total records. | Nu approximates value = the fraction of training errors and support vectors. | |
Epsilon_SVR (P) | Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value. | |||
Cachesize | For C-SVC, Epsilon-SVR, NuSVC and NuSVR, the size of the kernel cache has a strong impact on run times for larger problems. | If you have enough RAM available, it is recommended to set cache size to a higher value than the default of 200(MB), such as 500(MB) or 1000(MB). | ||
Termination Criterion | Tolerance for stopping criterion. The stopping tolerance affects the number of iterations used when optimizing the model. | |||
Shrinking | The shrinking are there to save the training time.They sometimes help, and sometimes they do not. It's a matter of runtime, rather than convergence. If the number of iterations is large, then shrinking can shorten the training time. | We found that if the number of iterations is large, then shrinking can shorten the training time. | ||
Probability_Estimates | Whether to enable probability estimates. | |||
LINEAR REGRESSION | ||||
Parameters | values | Definition | ||
Solver | 11 - L2-regularized L2-loss SVR primal | We have 3 linear Regression solvers, by combining several types of loss functions and regularization schemes. The regularization can be L1 or L2, and the losses can be the regular L2-loss for SVM (hinge loss), or L1-loss for SVM. The default value for type is 11 | ||
12 - L2-regularized L2-loss SVR dual | ||||
13 - L2-regularized L1-loss SVR dual | ||||
Cost (C) | The parameter C, trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly. As C increases, tendency to misclassification decreases on train data( may lead to overfitting). | |||
Epsilon_SVR (P) | Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value. | |||
Termination Criterion | Tolerance for stopping criterion. The stopping tolerance affects the number of iterations used when optimizing the model. | |||
Folds | V-fold for Cross Validation. In v-fold cross-validation, we first divide the training set into v subsets of equal size. Sequentially one subset is tested using the classifier trained on the remaining v − 1 subsets. Thus, each instance of the whole training set is predicted once so the cross-validation accuracy is the percentage of data which are correctly classified | |||
LINEAR CLASSIFICATION | ||||
Parameters | values | Definition | ||
Solver | 0 - L2-regularized logistic regression primal, 1 - L2-regularized L2-loss SVC dual, 2 - L2-regularized L2-loss SVC primal , 3 - L2-regularized L1-loss SVC dual, | We have 8 linear Classification solvers, by combining several types of loss functions and regularization schemes. The regularization can be L1 or L2, and the losses can be the regular L2-loss for SVM (hinge loss), L1-loss for SVM, or the logistic loss for logistic regression. The default value for type is 0 | ||
4 - Support Vector Classification by Crammer and Singer | ||||
5 - L1-regularized L2-loss SVC 6 - L1-regularized Logistic regression , 7 - L2-regularized Logistic regression dual | ||||
Cost (C) | Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty. | |||
Bias | Conside : w_1 * x_1 + w_2 * x_2 + w_3 * x_3 + … + w_bias * x_bias = 0, Here x are the feature values and w are the trained “weights”. The additional feature x_bias is a constant, whose value is equal to the bias. | |||
Termination Criterion | Tolerance for stopping criterion. The stopping tolerance affects the number of iterations used when optimizing the model. | |||
Folds | V-fold for Cross Validation. In v-fold cross-validation, we first divide the training set into v subsets of equal size. Sequentially one subset is tested using the classifier trained on the remaining v − 1 subsets. Thus, each instance of the whole training set is predicted once so the cross-validation accuracy is the percentage of data which are correctly classified. | |||
nr_weight | nr_weight is the number of elements in the array weight_label and weight. Each weight[i] corresponds to weight_label[i], meaning that the penalty of className weight_label[i] is scaled by a factor of weight[i]. | |||
Weight (wi) | set the parameter C of className i to weight*C, for C-SVC. | |||
Weight_Label | These weights are used to change the penalty for specific labels (classes). If the weight for a label is not changed, it is set to 1.0. | |||
K-MEANS | ||||
Parameters | values | Definition | ||
Kernel Type | 0. LINEAR | linear: u'*v | ||
1. POLYNOMIAL | polynomial: (gamma*u'*v + coef0)^degree | |||
2. RBF | radial basis function: exp(-gamma*|u-v|^2) This kernel nonlinearly maps samples into a higher dimensional space so it, unlike the linear kernel, can handle the case when the relation between className labels and attributes is nonlinear. | |||
3. SIGMOID | sigmoid: tanh(gamma*u'*v + coef0) | |||
Gamma | gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected. | |||
Coef0 | Degree of the polynomial kernel function. Ignored by all other kernels. | |||
Degree | Independent term in kernel function. It is only significant in 'polynomial' and 'sigmoid'. | |||
Dimension (Number of Attributes) | Number of input attributes / columns in the training data set | |||
Number of Centers | Number of clusters | |||
Stopping Criteria | Tolerance for stopping criterion. The stopping tolerance affects the number of iterations used when optimizing the model. | |||
Number of Rows | Total number of records / rows in the training data |