Logo

The Data Daily

Interesting Application of the Zipf Distribution: Data Purging

Interesting Application of the Zipf Distribution: Data Purging

The Zipf distribution is used to model situations in which a few observations have a very high value (or impact) and account for a large part of the total, while a very long tail of observations have medium, small, or very small values. A bit like the 80/20 rule. Examples include:

Here we used it to model the distribution of files (or data) on our laptop or on the cloud (for a specific company), ranked by size. The idea of computing this distribution (say) for your laptop is to identify files that can be deleted to save as much space as possible. Many users have a very large number of files on their computer, many being of no use, slowing eating the gigantic amount of space available for storage. In short, this is an applied, simple, and very practical data storage optimization problem. We also discuss this problem in the context of optimizing resources used to store user data on large social networks such as Facebook or LinkedIn.

The data purging process is simple and consists of three steps:

You can backup all these files before deleting them. 

As for large social networks, data purging consists of identifying inactive accounts or profiles -- they may represent 60% of all members. For instance, Facebook has far more US profiles than there are inhabitants in US. Identify fake and duplicate accounts, consolidate duplicate accounts that are otherwise valid.

We tend to think that the amount of storage space (and Internet bandwidth) that these companies have is infinite, or that storage is so cheap that it does not matter. However, overloaded servers results in errors and slows web page loading. Also it forces these companies to put limitations on user connection graphs: On Facebook, you can only have 5,000 friends. On LinkedIn, only 30,000 connections (I reached my limit.)

Another way for social networks to manage these large "constellations" of people is to offer premium services to members. LinkedIn could charge $20/month once you reach 5,000 members, maybe $100/month when you reach 30,000 members. It would free some space, and would make people more careful when accepting invitations (to be connected with someone) creating a better platform: A win-win both for LinkedIn and its members. 

For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on Twitter at @GranvilleDSC or on LinkedIn.

Images Powered by Shutterstock