K Means Clustering in Python : Label the Unlabeled Data

Read original article here

There are some cases when you have a dataset that is mostly unlabeled. The problems start when you want to structure the datasets and make it valuable by labeling it. In machine learning, there are various methods for labeling these datasets. Clustering is one of them. In this tutorial of “How to“, you will learn to do K Means Clustering in Python.

It is a clustering algorithm that is a simple Unsupervised algorithm used to predict groups from an unlabeled dataset. In Unsupervised machine learning, you don’t need to supervise the model. Here the model does its own work to find the patterns in the dataset. And then it automatically labels the unlabeled data.

In the K Means clustering predictions are dependent or based on the two values.

2. Nearest Mean value between the observations.

There are many popular use cases of the K Means Clustering and some of them are Price and cost Modeling of a Specific Market, Fraud Detection, Portfolio or Hedge Fund Management.

Before going into details and coding part of the K Mean Clustering in Python, you should keep in mind that Clustering is always done on Scaled Variable (Normalized). It means the Mean should be zero and the sum of the covariance should be equal to one. And the other things to remember is the use of a scatter plot or the data table for taking the estimated number of the centroids or the cluster centers (k).

I am using the Jupyter notebook there for showing the figure inline, I am calling the statement %matplotlib inline.

I am loading the default sklearn Iris dataset. You can also use your own dataset. But for the demonstration, I am using the default dataset.

Here the data is the scaled data and the target is the species of the data.

Please note that the data[0:10]will return the np array only.

In this step, you will build the K means cluster model and will call the fit() method for the dataset. After that, you will mode the output for the data visualization.

The above output defines the KMeans() cluster method has been called. You can see there are various arguments are defined inside the method. The type of the algorithm, the number of clusters (n_clusters). e.t.c. You can know about it here. K-Means clustering

Both figures suggest that the model has accurately predicted clusters. The only thing you are seeing is the clusters are mislabelled. To reassign the Label it uses we use the np.choose() method. To do so you change the label position from [0,1,2] to [2,0,1]. The full code is given below.

At the last step, you will verify the results for the accuracy of the model. In order to do so, you use sklearn classification reports.

Before verifying the results know the following term.

Precision: It measures the relevancy of the model.

Recall: Measures the completeness of the model.

In our case, the average Precision is 83% and the average Recall is 83% of the entire dataset. From these results, you can say our model is giving highly accurate results.

K means clustering model is a popular way of clustering the datasets that are unlabelled. But In the real world, you will get large datasets that are mostly unstructured. Thus to make it a structured dataset. You will use machine learning algorithms. There are also other types of clustering methods. The type of Clustering algorithms you will choose will completely depend upon the dataset.

I think you must have easily understood the K Mean Clustering algorithm. In order to get any help from our side, you can directly message us on the Data Science Learn Page. We are always ready to help you.

Images Powered by Shutterstock

The Data Daily

K Means Clustering in Python : Label the Unlabeled Data