Logo

The Data Daily

Supplier Name Standardization using Unsupervised Learning

Supplier Name Standardization using Unsupervised Learning

The primary application for unsupervised learning in spend analysis is vendor name normalization, whereby vendor names are clustered. Many large companies that constitute a large portion of your spend will hold various names within your various data systems.

Aggregating these names into a single name is important to show how much spend is going to certain suppliers so that you may identify your key suppliers.

[ORACLE, ORACLE AMERICA INC, ORACLE CORPORATION, ORACLE FINANCIAL SERVICES, ORACLE USA INC] becomes ORACLE

If you want to skip the hassle, you can find the full code here: https://github.com/rahulissar/ai-supply-chain

For implementing this, we used standard data pre-processing techniques to clean textual data and remove unwanted words.

After cleaning the vendor data you will get something like this;

Normally, we’d use tdf-idf or count vectorizer to generate vectors from textual data. This helps the model to understand how important a term is across our document. We’d then use these vectors to generate a similarity matrix to help our model understand how similar are the word vectors to each other.

For this model, we made use of Levenshtein Distance as a similarity metric because our use case requires aggregation of similar names. This metric accounts for the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

Once we have the similarity matrix and cleaned vendor names, we can feed it into a clustering model to help cluster similar vendor names together.

For this use-case, we made use of the Affinity Propagation algorithm. In contrast to other traditional clustering methods, Affinity Propagation does not require you to specify the number of clusters.

For each cluster identified, we then pass pairs of cleaned vendor names to get the longest common sub-string. We then take the sub-string with the highest no. of occurrences (mode) and assign it as the standard name for that cluster. This process is repeated for all the clusters in the dataset.

Images Powered by Shutterstock