Logo

The Data Daily

Chasing the Elusive Machine Learning Platform | 7wData

Chasing the Elusive Machine Learning Platform | 7wData

The core of this insight is the realization that ML and data science projects are nothing like typical application or hardware development projects. Whereas in the past hardware and software development aimed to produce functionality that individuals or businesses could individually run or control, data science and ML projects are really about managing data, continuously evolving learning gleaned from data, and the evolution of data models based on constant iteration. Typical development processes and platforms simply don’t work from a data-centric perspective.

It should be no surprise then that technology vendors of all sizes are focused on developing platforms that data scientists and ML project managers will depend on to develop, run, operate, and manage their ongoing data models for the enterprise. The thought from these vendors is that the ML platform of the future is like the operating system or cloud environment or mobile development platform of the past and present. If you can dominate market share for data science / ML platforms, you will reap rewards for decades to come. As a result, everyone with a dog in this fight is fighting to own a piece of this market.

However, what does a Machine Learning platform look like? How is it the same or different than a Data Science platform? What are the core requirements for ML Platforms, and how do they differ from more general data science platforms? Who are the users of these platforms, and what do they really want? Let’s dive deeper.

In our earlier newsletter piece on Data Scientists vs. Data Engineers, we talked a bit about what data scientists do and what they want to accomplish with the technology to support their missions. In summary, data scientists are tasked with wrangling useful information from a sea of data and translating business and operational informational needs into the language of data and math. Data scientists need to be masters of statistics, probability, mathematics, and algorithms that help to glean useful insights from huge piles of information. A data scientist is a scientist who creates hypothesis, runs tests and analysis of the data, and then translates their results for someone else in the organization to easily view and understand. So it follows that a pure data science platform would meet the needs of helping craft data models, determining the best fit of information to a hypothesis, testing that hypothesis, facilitating collaboration amongst teams of data scientists, and helping to manage and evolve the data model as information continues to change.

Furthermore, data scientists don’t focus their work in code-centric Integrated Development Environments (IDEs), but rather in notebooks. First popularized by academically-oriented math-centric platforms like Mathematica and Matlab, but now prominent in the Python, R, and SAS communities, notebooks are used to document data research and simplify reproducibility of results by allowing the notebook to run on different source data. The best notebooks are shared, collaborative environments where groups of data scientists can work together and iterate models over constantly evolving data sets. While notebooks don’t make great environments for developing code, they make great environments to collaborate, explore, and visualize data. Indeed, the best notebooks are used by data scientists to quickly explore large data sets, assuming sufficient access to clean data.

Indeed, data scientists can’t perform their jobs effectively without access to large volumes of clean data. Extracting, cleaning, and moving data is not really the role of a data scientist, but rather that of a data engineer. Data engineers are challenged with the task of taking data from a wide range of systems in structured and unstructured formats, and data which are usually not “clean”, with missing fields, mismatched data types, and other data-related issues.

Images Powered by Shutterstock