In one of my recent interviews I was asked one of a routine question which a fresher applying for a data science related job role gets asked — “What kind of algorithm is logistic regression?” Answered —Classification. My own answer triggered a question in my mind or maybe my alleged broken sense of humour did, “if it’s a classification algo why does it has regression in its name?”
Regression and Classification algorithms both belong to supervised learning domain. To understand in beginner level terms -
Regression: Algorithms used to predict continuous values using dataset. Example- number of car bookings for the upcoming quarter, effects of fertiliser and water on crop yield, detection of trends in future options and movement of a stock and even a medical operation efficiency. Meaning it’s a next-stop after correlation. We use feature(s) to predict continuous values.
Classification: As the name calls, it is used to classify. Whether an e-mail is spam or not, a tumour is benign or malignant, a credit card transaction is fraudulent or not.
In this article we’re gonna focus on and understand Logistic regression by implementing a LogisticRegression() on a credit card fraud detection dataset acquired from Kaggle.
Linear regression, as previously stated, predicts a continuous value based on the relationship between dependent and independent variables. If Y (dependent variable) is a function of X (independent variable) then Y is represented as:
Using this mathematical expression a “best fit line”is calculated and we predict a real-valued output y based on a weighted sum of input variables.
Now using the same data, what if we were asked to detect which value belongs to which class? Firstly, we’d have to create and define classes amongst the values. Our output, in that case, would be binary. Either Yes or No. Either 0 or 1. Instead of predicting a numerical dependent variable we’d implement prediction/classification of a categorical dependent variable.
In the above figure, let’s consider threshold as 0.5. Hence —
Now, let’s build a classifier to detect if a credit card transaction is fraudulent or not.
The description of data is available on Kaggle. Let’s move ahead —
The Classfeature is a binary feature, 0 for non-fraudulent and 1 representing fraudulent. As it can be seen, the data is heavily skewed as the majority of the cases are non-fraudulent. The Classfeature is the Y (dependent variable) in this scenario. This will lead inaccurate results from the classifier as the training data isn’t normally distributed. Let’s level the playing field.
To level the playing field, we create a sub-sample with equal number of fraudulent and non-fraudulent cases. In cases like fraudulency it’s a good habit to check if they’re any correlations. Let’s see what does Amounthas to show us.
It’s different than the usual scatter plots that we see. But it does tell us something! Most fraudulent transactions are between the amount 0–2000.
When implementing classifiers, we should also check for false positives. To understand that, we have to refer to confusion matrix.
Well it looks like our model classified 8 transactions as non-fraudulent even though they were fraudulent. And our accuracy score is 93.4%.
Logistic regression is emphatically not a classification algorithm on its own. It is only a classification algorithm in combination with a decision rule that makes dichotomous the predicted probabilities of the outcome. Logistic regression is a regression model because it estimates the probability of class membership as a multilinear function of the features.