This article was written by Claudia Perlich. Claudia is a Chief Scientist Distillery and Adjunct Professor at NYU. She is also a Data Scientist at Quora.
First of, let me state what I think is NOT the the problem: the fact that data scientists spend 80% of their time with data preparation. That is their JOB! If you are not good at data preparation, you are NOT a good data scientists. It is not a janitor problem as Steve Lohr provoked. The validity of any analysis is resting almost completely on the preparation. The algorithm you end up using is close to irrelevant.Complaining about data preparation is the same as being a farmer and complaining about having to do anything but harvesting and please have somebody else deal with the pesky watering, fertilizing, weeding, etc.
This being said - data preparation can be made difficult by the process of raw data collection. Designing a system that collects data in a form that is useful and easily digestible by data science is a high art. Providing full transparency to DS how exactly the data flows to the system is another. It involves processes that consider sampling, data annotation, matching, etc. It does not include things like replacing missing value and excessive normalization. Creating an effective data environment for DS needs to involve DS and cannot be entirely owned by engineering. DS is often NOT able to spec such system requirements in sufficient detail to allow for a clean handover.
But in the bigger picture, there are more important things to consider. The by far biggest issue I see is data science solving irrelevant problems. This is a huge waste of time and energy. The reason is typically that whoever has the problem is lacking data science understanding to even express the issue and data scientists end up solving whatever they understood might be be the problem, ultimately creating a solution that is not really helpful (and often far too complicated)...