Some of the rarely shared trade secrets in machine learning: Original post: on linkedin
1. Bootstrap sampling & the magic number 0.63 Even though randomly sampled, bootstrapped samples account to roughly 63% of the overall instances of data !
2. Random search or Grid search 60 randomly sampled parameter checkpoints from the grid can be just as good as the computationally expensive 'grid search' !
4. Selecting Kernels for Support Vector Machines: When do linear kernels work even better than the complex polynomial kernels
5. Testing ML systems : Don't ship it, just because it compiles
6. Bias Variance Tradeoff & Debugging models: The learning curve - Art of figuring out if you need more instances or more dimensions
1. Bootstrapping & the magic number 0.63 Bootstrap sampling is performed by randomly sampling data with replacement; which can be easily modeled as binomial trials where an instance from the data being selected be treated as "success". If there are n number of instances in data, probability of 'success' is 1/n and for the failure, its (n-1)/n. for sample size b, the probability of exactly x number of successes would be,
In the specific case of a bootstrap sample, the sample size b equals the number of instances n. Letting n approach infinity, we get:
for x = 0, the probability is roughly 1/e ~ 0.368 . Thus the probability of the instance being selected atleast once is 1-1/e ~ 0.632
2. Grid Search or random search : Grid search and random search are the methods to tune hyper parameters and look for the optimal one. Grid search is computationally expensive as it checks for all the possible combinations of the parameters specified and evaluates on the same. Lets say if two parameters are A and B , and the possible ranges specified are 0-2 and 0-3 respectively ; The possible combinations in the parameter space in case of grid search would be (0,0) (0,1) (0,2) (0,3) ...........(2,2) (2,3). Although grid search can be made to run in parallel, still the technique is not computationally very efficient . Whereas, randoms search has a stark difference from grid search; instead of searching over all the possible combinations, it just searches over a random set of points on the grid. Random search was not taken very seriously at the start, it was considered sub-optimal as it does not search over all the possible grid points.
60 randomly sampled points from the grid can be just as good as the computationally expensive 'grid search' !
Lets consider the 5 % interval around the true maximum, now we sample points and check if it lands inside that maximum. Each random draw would have a probability of 0.05 to land in that interval and 0.95 to miss. If we draw n points independently, then the probability that none of them misses would be 1-(0.95^n ) If we look for a minimum of 0.95 probability of success, 1- 0.95^n > 0.95 => n >=60
Usually a larger value of k is preferred, as it ensures a larger number of performance estimates and training size in case of large k fold would be closer to the complete data set. However, with the increase in k, the overlap between training sets also increases. For example, with 5-fold cross-validation, each training set shares only 3∕4 of its instances with each of the other four training sets whereas with 10-fold cross- validation, each training set shares 8 ∕ 9 of its instances with each of the other 9 training sets.
Also , with the increasing K, the test data size decreases, which causes the performance metrics to be lesser precise. e.g if there are 10 instances in test data, one can only measure accuracy to the nearest 10% , but if the test data has 20 instances, the accuracy can be measured up to the nearest 5%.
Considering all these factors, k=10 is considered to be the ideal number of folds used for cross validation, something which is observed to perform well in most of the scenarios . Can this hypothesis be mathematically proved as well?
4. How to apply SVM on a real world example?
a.Convert categorical variables into numeric data by using binary dummies. For example, if there is a variable gender. There are two possible values for this categorical attribute. We can create binary dummies Gender_male and gender_female as two variables replacing the variable gender. Thus each of the instance where the “Gender” was male , in place of that we will Gender_male as 1 and gender_female as 0 .
b. Scale the data appropriately: it is important to scale or normalize the data before applying SVM to counter any computational challenges.Also, it may lead to issues if there is huge variance in data.
You can randomly select a kernel function and estimate the best parameters for the chosen kernel. Test the performance and further tune or select a different kernel function. Normally, radial basis function RBF is the first choice of kernel. It performs reasonably well when the decision boundary is non-linear. As it can effectively map the samples into the high dimensional space, which linear classifier can not do. Also, the number of hyper parameters that we need to optimize in order to get the best set of parameters to apply in case of RBF is lesser than the other polynomial kernel functions.
It is also important to note that, when the dimensions are very high compared to the number of instances available to train on, linear kernel would work better than RBF. When to apply a linear kernel? 1. When number of dimensions is significantly larger than number of instances: In such a case , an explicit mapping to high dimensional space does not add any value. Thus even a linear kernel performs better.
2. When the dimensions as well as instances are large in number.
3. Number of instances is significantly larger than the number of dimensions.
5. Testing ML systems : Don't ship it, just because it compiles
It is important to perform tests on ML systems before putting into production:
For a generative learner,write a generator and generate training/test data from the assumed distribution – The learner should (usually) recover the actual parameters of the generator, given enough data – Test it on the abnormal cases(eg, uniform class priors, highly skewed multinomials)
For a discriminative learner, apply the same as above, and also check for : Does taking one gradient step (on a sample task) lower the loss on the training data? Does it lower the loss as expected? (f(x)-f(x+d))/d should approximate f’(x) Does regularization work as expected? Record training set/test set loss – with and without regularization