Logo

The Data Daily

The simplest tidy machine learning workflow

Last updated: 02-13-2020

Read original article here

The simplest tidy machine learning workflow

% # Specify cross-validation vfold_cv() %>% # Start preprocessing recipe(Sale_Price ~ Longitude + Latitude + Neighborhood) %>% step_log(Sale_Price, base = 10) %>% step_other(Neighborhood, threshold = 0.05) %>% step_dummy(recipes::all_nominal()) %>% # Define model linear_reg(penalty = tune(), mixture = tune()) %>% set_engine("glmnet") %>% # Define grid of tuning parameters tune_grid(grid = 10) # ml_wflow shouldn't run anything -- it's just a specification # of all the different steps. `fit` should run everything ml_wflow % autoplot() # Automatically extract best parameters and fit to the training data final_model % fit_best_model(metrics = metric_set(rmse)) # Predict on the test data using the last model # Everything is bundled into a workflow object # and everything can be extracted with separate # functions with the same verb final_model %>% holdout_error() If you want more details on each step, continue reading :). A Data Science Workflow Let's recycle the operations I described above from caret::train and redefine them as general principles: Data Preparation Create a separate training set which represent 75% of the initial sample Preprocessing (or Feature Engineering, for those liking fancy CS names) Center and scale all predictors in the model Model Training/Tuning Identifies 10 alpha values (0.1 to 1) and then 10 additional lambda values For each parameter set (1 alpha value and another lambda value), run a cross-validation 10 times Effectively run 1000 models (10 alpha * 10 alpha) each one cross-validated (10) Record the validation metrics for each model on the assessment dataset Validation Save the best model in the result together with the optimized tuning parameters Before we start, let's load the two packages and data we'll use: library(AmesHousing) # devtools::install_github("tidymodels/tidymodels") library(tidymodels) ## ── Attaching packages ────────────────────────────────────── tidymodels 0.0.4 ── ## ✔ broom 0.5.4 ✔ recipes 0.1.9 ## ✔ dials 0.0.4 ✔ rsample 0.0.5.9000 ## ✔ dplyr 0.8.4 ✔ tibble 2.1.3 ## ✔ ggplot2 3.2.1 ✔ tune 0.0.1 ## ✔ infer 0.5.1 ✔ workflows 0.1.0.9000 ## ✔ parsnip 0.0.5.9000 ✔ yardstick 0.0.5 ## ✔ purrr 0.3.3 ## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ── ## ✖ purrr::discard() masks scales::discard() ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ✖ ggplot2::margin() masks dials::margin() ## ✖ recipes::step() masks stats::step() ## ✖ recipes::yj_trans() masks scales::yj_trans() ames Specify the data is the training set -> Apply preprocessing Previously, recipes was a bit confusing because there were steps which are not easy to remember: prep the dataset and juice or bake it depending on what you want to do (even more verbose and complex when applying this to a cross-validation set). With the workflows package, these steps have been completely eliminated from the users mental load. Model Training/Tuning Model training and tuning is the step on which I think tidymodels brings in too many moving parts. This has been partially ameliorated with workflows. For this step there are three to four packages: parsnip for modelling, workflows for creating modelling workflows, tune for tuning models and yardstick for validating the results. Let's see how they fit together: ## Define a regularized regression and explicitly leave the tuning parameters ## empty for later tuning. lm_mod % parsnip::set_engine("glmnet") ## Construct a workflow that combines your recipe and your model ml_wflow % workflows::add_recipe(mod_rec) %>% workflows::add_model(lm_mod) The expression above adds much more flexibility as you can swap models by just changing the linear_reg to another model. However, it also adds more complexity. tune() requires you to know about parameters() to extract the parameters to create the grid. For that you have to be aware of the grid_* functions to create a grid of values. However, this comes from the dials package and not the tune package. On top of that, we know that the main functions from the tidyverse always accept and return a data frame, making it very familiar to learn new functions. However, all of these moving parts return different things. Having said that, the actual tuning is done with tune_grid where we specify the cross-validated set from the first step. Here tune_grid is quite elegant since it allows you specify a grid of values or an integer which it will use to create a grid of parameters: res % tune::tune_grid(resamples = ames_cv, grid = 10, metrics = yardstick::metric_set(yardstick::rmse)) And finally, you can get the summary of the metrics with collect_metrics: res %>% tune::collect_metrics() ## # A tibble: 10 x 7 ## penalty mixture .metric .estimator mean n std_err ## ## 1 4.89e-10 0.269 rmse standard 0.143 10 0.00233 ## 2 2.73e- 9 0.616 rmse standard 0.143 10 0.00234 ## 3 5.19e- 8 0.967 rmse standard 0.143 10 0.00234 ## 4 7.11e- 7 0.0559 rmse standard 0.143 10 0.00233 ## 5 2.31e- 6 0.755 rmse standard 0.143 10 0.00234 ## 6 7.22e- 5 0.866 rmse standard 0.143 10 0.00234 ## 7 5.87e- 4 0.710 rmse standard 0.143 10 0.00233 ## 8 1.83e- 3 0.399 rmse standard 0.143 10 0.00232 ## 9 1.84e- 2 0.233 rmse standard 0.144 10 0.00216 ## 10 4.92e- 1 0.499 rmse standard 0.180 10 0.00190 Or choose the best parameters with select_best: best_params % tune::select_best(metric = "rmse", maximize = FALSE) best_params ## # A tibble: 1 x 2 ## penalty mixture ## ## 1 0.000587 0.710 Validation The final step is to extract the best model and contrast the training and test error. Here workflows offers some convenience to replace the model with the best parameters and fit the complete training data with the best parameters. This step is currently completely automatized with train where you can extract the best model even after exploring the results of different tuning parameters. reg_res % # Attach the best tuning parameters to the model tune::finalize_workflow(best_params) %>% # Fit the final model to the training data parsnip::fit(data = ames_train) ames_test % predict(new_data = ames_test) %>% bind_cols(ames_test, .) %>% mutate(Sale_Price = log10(Sale_Price)) %>% select(Sale_Price, .pred) %>% yardstick::rmse(Sale_Price, .pred) ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## ## 1 rmse standard 0.136 One of the things I don't like about fit for this current scenario is that I have to think about specifying the training data again. I understand that the data specified in recipe could be even an empty data frame, as it is used only to detect the column names. However, in nearly all the applications I can think of, I will specify the training data at the beginning (in my recipe). So I find that having to specify the data again is a step that can be eliminated altogether if the data is in the workflow. What to remember There are many things to remember from the workflow above. Below is a kind of cheatsheet: Data Preparation rsample::initial_split: splits your data into training/testing rsample::training: extract the training data rsample::vfold_cv: create a cross-validated set from the training data Preprocessing (or Feature Engineering, for those liking fancy CS names) recipes::recipe: define your formula with the training data recipes::step_*: add any preprocessing steps your data Model Training/Tuning parsnip::linear_reg: define your model. This example shows a linear regression but it could be anything else (random forest) tune::tune: leave the tuning parameters empty for later parsnip::set_engine: set the engine to run the models (which package to use) workflows::workflow: create a workflow object to hold your model/recipe workflows::add_recipe: add the recipe to your workflow workflows::add_model: add the model to your workflow yardstick::metric_set: create a set of metrics yardstick::rmse: specify the root-mean-square-error as the loss function tune::tune_grid run the workflow across all resamples with the desired tuning parameters tune::collect_metrics: collect which are the best tuning parameters tune::select_best: select the best tuning parameter Validation tune::finalize_workflow: replace the empty parameters of the model with the best tuned parameters parsnip::fit: fit the final model to the training data rsample::testing: extract the testing data from the initial split parsnip::predict: predict the trained model on the testing data This is currently what I think is the simplest workflow to train models in tidymodels. This is of course a very simplified example which doesn't create tuning grids or tune parameters in the recipes. This is supposed to be the barebones workflow that is currently available in tidymodels. Having said that, I still think there are too many steps which makes the workflow convoluted. Thoughts on the workflow tidymodels is currently being designed to be decoupled into several packages and the key steps for modelling are currently implemented. This offers greater flexibility for defining models, making some of the steps in modelling less obscure and explicit. Having said that, there is too much to remember. dplyr::select is a function which is easy to remember because it can be thought of as an independent entity which I can use with a data.table or base R. On top of that, I know it follows the general principle of the tidyverse where it only accepts a data frame and only returns a data frame. This makes it much more memorable. Due to its simplicity, it's easy to think of it like a hammer: I can apply it to so many different problems that I don't have to memorize it, it becomes a general tool that represents an abstract idea. Some of the functions/packages from tidymodels are difficult to think like that. I believe this is because they are supposed to be almost always used together, otherwise they have no practical applications. tune, workflows and parsnip introduce several ideas which I think are difficult to remember (mainly because you have to remember them and they don't come off naturally, as an abstract concept). workflows seems to be a package that combines some of the steps performed by parsnip and recipes, suggesting that you can build a logical workflow with it. However, workflows is introduced after you define your preprocessing and model. My intuition would tell me that the workflow should begin at first rather than in the middle. For example, in pseucode a logical workflow could look like this: ml_wflow % # Begin workflow workflow() %>% # No need to extract training/testing, they're already in the workflow # This eliminates the mental load of mixing up training/testing and # mistakenly predict one over the other. initial_split(prop = .75) %>% # Apply directly the cross-validation to the training set. No resaving # the data into different names, adding more and more objects to remember vfold_cv() %>% # Introduce preprocessing # No need to specify the data, the training data is already inside # the workflow. This simplifies having to specify your training # data in many different places (recipes, fit, vfold_cv). The data # was specified at the beginning and that's it. recipe(Sale_Price ~ Longitude + Latitude + Neighborhood) %>% step_log(Sale_Price, base = 10) %>% step_other(Neighborhood, threshold = 0.05) %>% step_dummy(recipes::all_nominal()) %>% # Add your model definition and include placeholders for your tuning # parameters linear_reg(penalty = tune(), mixture = tune()) %>% set_engine("glmnet") I believe the code above is much more logical than the current setup for three reasons which are very much related to each other. First, it follows the 'traditional' workflow of machine learning more clearly without intermediate steps. You begin with your data and add the key modelling steps one by one. Second, it avoids creating too many intermediate steps which add mental load. Whenever I'm using tidymodels I have to remember so many things: the training data, the cross-validated set, the recipe, the tuning grid, the model, etc. I often forget what I need to add to tune_grid: is it the recipe and the resample set? Is it the workflow? Did I mistakenly add the test set to the recipe and fit the data with the training set? It's very easy to get lost along the way. And third, I think the workflow from above fits with the tidyverse philosophy much better, where you can read the steps from left to right, in a linear fashion. The power of the pseudocode above is that the workflow is thought of as the holder of your workflow since the beginning, meaning you can add or remove stuff from it. For example, it would very easy to add another model to be compared: ml_wflow % workflow() %>% initial_split(prop = .75) %>% vfold_cv() %>% recipe(Sale_Price ~ Longitude + Latitude + Neighborhood) %>% step_log(Sale_Price, base = 10) %>% step_other(Neighborhood, threshold = 0.05) %>% step_dummy(recipes::all_nominal()) %>% linear_reg(penalty = tune(), mixture = tune()) %>% set_engine("glmnet") %>% # Adds another model rand_forest(mtry = tune(), tress = tune(), min_n = tune()) %>% set_engine("rf") The code above could also include additional steps for adding tuning grids for each model and then a final call to fit would fit all models/tuning parameters directly into the cross-validated set. Additionally, since the original data is in the workflow, methods for fitting the best model to the complete training data could be implemented as well as methods for running the best tuned model on the test data. No objects laying around to remember and everything is unified into a bundle of logical steps which begin with your data. This workflow idea doesn't introduce anything new programatically in tidymodels: all ingredients are currently implemented. The idea is to rearrange specific methods to handle a workflow in this fashion. This workflow idea is just a prototype idea and I'm sure that many things can be improved. I do think, however, that this is the direction which would make tidymodels a truly friendly interface. At least to me, it would make it as easy to use as the tidyverse. Wrap-up This post is intended to be thought-provoking take on the current development of tidymodels. I'm a big fan of RStudio and their work and I'm looking forward to the "official release" of tidymodels. I wrote this piece with the intention of understanding the currently workflow but noticed that I'm not comfortable with it, nor did it come off naturally. I hope these ideas can help exemplify some of the bottlenecks that future tidymodels users can face with the aim of improving the user experience of the modelling framework from tidymodels." />
R news and tutorials contributed by hundreds of R bloggers


Read the rest of this article here