From Data Pre-processing to Optimizing a Regression Model Performance

Read original article here

KDnuggets Home » News » 2019 » Jul » Tutorials, Overviews » From Data Pre-processing to Optimizing a Regression Model Performance ( 19:n27 )
From Data Pre-processing to Optimizing a Regression Model Performance

Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering.
— Andrew Ng, Stanford University( source )

Introduction

Machine learning (ML) helps in finding complex and potentially useful patterns in data. These patterns are fed to a Machine Learning model that can then be used on new data points — a process called making predictions or performing inference.
Building a Machine Learning model is a multistep process. Each step presents its own technical and conceptual challenges. In this article, we are going to focus on the process of selecting, transforming, and augmenting the source data to create powerful predictive signals to the target variable (in supervised learning). These operations combine domain knowledge with data science techniques. They are the essence of feature engineering.
This article explores the topic of data engineering and feature engineering for machine learning (ML). This first part discusses the best practices of preprocessing data in a regression model. The article focuses on using python’s pandas and sklearn library to prepare data, train the model, serve the model for prediction.

Let us start with Data pre-processing…

1. What is Data pre-processing and why it is needed?

Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data, Noisy: containing errors or outliers. Inconsistent: containing discrepancies in codes or names. Data preprocessing is a proven method of resolving such issues.
In Real-world data are generally incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data. Noisy: containing errors or outliers. Inconsistent: containing discrepancies in codes or names.
1.1) Steps in Data Preprocessing
Step 1: Import the libraries
Step 2: Import the data-set
Step 3: Check out the missing values
Step 4: Encode the Categorical data
Step 5: Splitting the dataset into Training and Test set
Step 6: Feature scaling
Let’s discuss all these steps in details.
Step 1: Import the libraries
A library is also a collection of implementations of behavior, written in terms of a language, that has a well-defined interface by which the behavior is invoked. For instance, people who want to write a higher-level program can use a library to make system calls instead of implementing those system calls over and over again. — Wikipedia
We need to import 3 essential python libraries.
1. Numpy is the fundamental package for scientific computing with Python.
2. Pandas is for data manipulation and analysis.
3. Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hard copy formats and interactive environments across platforms.
import numpy import matplotlib.pyplot as plt import pandas as pd
Step 2: Import the data-set
Data is imported using the pandas library.
data = pd.read_csv('/path_of_your-dataset/Data.csv') X = data.iloc[:,:-1].values y = data.iloc[:,3].values
Here, X represents a matrix of independent variables and y represents a vector of the dependent variable.
Step 3: Check out the missing values
There are two ways by which we can handle missing values in our dataset. The first method commonly used to handle null values. Here, we either delete a particular row if it has a null value for a particular feature and a particular column if it has more than 75% of missing values. This method is advised only when there are enough samples in the data set. One has to make sure that after we have deleted the data, there is no addition of bias.
In the second method, we replace all the NaN values with either mean, median or most frequent value. This is an approximation which can add variance to the data set. But the loss of the data can be negated by this method which yields better results compared to removal of rows and columns. Replacing with the above three approximations are a statistical approach to handling the missing values. This method is also called as leaking the data while training.
For dealing with missing data, we will use Imputer library from sklearn.preprocessing package. Instead of providing mean you can also provide median or most frequent value in the strategy parameter.
from sklearn.preprocessing import Imputer imputer = Imputer(missing_values='NaN', strategy = 'mean', axis = 0)
Next step is to train the imputer instance with the data stored in X(predictors).
imputer = imputer.fit(X[:,1:3]) X[:, 1:3] = imputer.transform(X[:,1:3])
Step 4: Encode the Categorical data
Categorical data are variables that contain label values rather than numeric values. The number of possible values is often limited to a fixed set.
Some examples include:
A “pet” variable with the values: “dog” and “cat”.
A “color” variable with the values: “red”, “green” and “blue”.
A “place” variable with the values: “first”, “second” and “third”.
Each value represents a different category.

Note: What is the Problem with Categorical Data?

Some algorithms can work with categorical data directly. But many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.
In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves. This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.
We are going to use a technique called label encoding. Label encoding is simply converting each value in a column to a number. For example, the body_style column contains 5 different values. We could choose to encode it like this:
convertible -> 0
sedan -> 3
wagon -> 4
To implement Label encoding we will import LabelEncoder from sklearn.preprocessing package. But it labels categories as 0,1,2,3…. Now since 0SL, go to Step-4: Otherwise model is ready.
Step-4: Remove the predictor.
Step-5: Fit the model without this variable.
Here, the significance level and p-value are statistical terms, just remember these terms for now as we do not want to go in details. Just note that our python libraries will provide us these values for our independent variables.
Coming back to our scenario, as we know that multiple linear regression is represented as :
y = b0 + b1X1 + b2X2 + b3X3 +…..+ bnXn
we can also represent it as
y = b0X0 + b1X1 + b2X2 + b3X3 +…..+ bnXn where X0 = 1
We have to add one column with all 50 values as 1 to represent b0X0.
import statsmodels.formula.api as smX = np.append(arr = np.ones((50,1)).astype(int), values = X, axis = 1)
statsmodels python library provides an OLS (ordinary least square) class for implementing Backward Elimination. Now one thing to note that OLS class does not provide the intercept by default and it has to be created by the user himself. That is why we created a column with all 50 values as 1 to representb0X0 in the previous step.
In the first step, let us create variable X_opt which will contain variables which are statistically significant(has maximum impact on the dependent variable) and for doing that we have to start with considering all the independent variables and in each step, we will remove variables with the maximum p-value.
X_opt = X[:,[0,1,2,3,4,5]] regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() regressor_OLS.summary()
Here endog means the dependent variable and exog means X_opt .
Result-1
We can remove index x2 as it has the highest p-value.
Again repeat the process
We can remove index x2 as it has the highest p-value.
Again repeat the process
X_opt = X[:,[0,3,5]] regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() regressor_OLS.summary()
Result-4

Now as per the rule we have to remove index x2 as it is more than the significance level but it should be considered as very close the significance level of 50%. But we will go ahead strictly with the rule and we’ll remove index x2. Hold onto this we will discuss this in the next section.
Again repeat the process
X_opt = X[:,[0,3]] regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() regressor_OLS.summary()
Result-5

We have all the variables under the significance level of 0.05. It means, the only one variable left out i.e. x1(R&D spend) has the highest impact on the profit and is statistically significant.
Congratulations!!! We have just created an optimal regressor model using Backward Elimination method.
Now let’s make our model more robust by considering some more metrics like R-squared and Adj. R-squared.

4. Fine-tune our optimal Regressor Model

Before we start tuning our model lets get familiar with two important concepts.

4.1) R-squared

It is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination or coefficient of multiple determination.
R-squared is always between 0 and 100%:
0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data.
There are 2 major problems with R-squared, first, if we add more predictors, R-squared will always increase because OLS will never let R-squared decrease when more predictors are added. Secondly, if a model has too many predictors and higher-order polynomials, it begins to model the random noise in the data. This is called overfitting and produces misleadingly high R-squared values and lesser ability to make predictions.
That is why we need Adjusted R-squared.

4.2) Adjusted R-squared

The adjusted R-squared compares the explanatory power of regression models that contain different numbers of predictors.
The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The adjusted R-squared can be negative, but it’s usually not. It is always lower than the R-squared.
So coming back to our original model, there was a confusion in Result-4 whether to remove x2 or to retain it.
If Adjusted R-squared decreases with the addition of more predictors(and vice-versa) then our model fits better contrary to R-squared.
If we look closely to all the snapshots above, Adjusted R-squared is increasing till Result-4 but it decreased when the x2 was removed, see Result-5 which should not have happened.

Conclusion

So the final take away we have is the last step should not have been performed in the Backward Elimination method. We are left with R&D Spendand Marketing Spend as the final predictors.
X_opt = X[:,[0,3,5]]
This should be used as the matrix of independent variables instead of taking all the independent variables. Here, the index 0 represents a column of 1’s that we added.
That’s all in this article hope you guys have enjoyed reading this, let me know about your views/suggestions/questions in the comment section.
You can also reach me out over LinkedIn for any query.
Thanks for reading !!!

Bio: Nagesh Singh Chauhan is a Data Science enthusiast. Interested in Big Data, Python, Machine Learning.
Original . Reposted with permission.

Images Powered by Shutterstock

The Data Daily

From Data Pre-processing to Optimizing a Regression Model Performance