Embrace machine learning and the cloud, don’t rely on data warehouses and legacy tech to solve your problems, and other advice from a data analytics expert.
Big data is a promising investment for firms, but embracing data can also bring confusion and potential minefields — everything from where companies should be spending money to how they should be staffing their data teams.
MIT adjunct professor Michael Stonebraker, a computer scientist, database research pioneer, and Turing award winner, said he sees several things companies should do to build their data enterprises — and just as importantly, mistakes companies should cease or avoid.
In a talk last fall as part of the 2019 MIT Citi Conference, Stonebraker borrowed a page from David Letterman to offer 10 big data blunders he’s seen in the last decade or so. His (sometimes opinionated!) advice comes from discussions with tech and data executives during more than decades in the field as well as his work with several data startups.
Blunder #1: Not moving everything to the cloud
Companies should be moving their data out of the building and into a public cloud, or purchase a private cloud, Stonebraker said. Why? Firms like Amazon offer cloud storage at a fraction of the cost and with better infrastructure, often with tighter security and staff that specialize in cloud management for a living.
“They're deploying servers by the millions; you're deploying them by the tens of thousands,” Stonebraker said. “They're just way further up the cost curve and are offering huge economies of scale.”
Clouds also offer elasticity — with a cloud, your company can use a thousand servers to run end-of-the-month numbers, and a scaled-back amount for everyday tasks.
Blunder #2: Not planning for artificial intelligence and machine learning to be disruptive
Machine learning is already remaking a variety of industries, Stonebraker said, and it is going to replace some workers. “The odds that it is not disruptive in financial services is zero,” he said.
In light of this, companies should avoid being disrupted and instead be the disruptor. This means paying for AI and machine learning expertise, which is in short supply. “There’s going to be an arms race,” he said of the competition to hire talent. “Get going on it as quick as you can.”
Leaders often feel like they are on top of data science, and things like algorithm development, because they’ve hired data scientists. But data scientists typically spend most of their time analyzing and cleaning data and integrating it with other sources, Stonebraker said
For example, a machine learning expert at iRobot told Stonebraker that she spent 90% of her time working on data discovery, integration, and cleaning. Of the 10% left of her time, she said she spent 90% of that fixing data cleaning errors — which left about 1% of her time to the job she was hired for, Stonebraker said.
These tasks are important — “without clean data, or clean enough data, your data science is worthless,” he said.
But it’s also important to realize how data scientists are actually spending their time. “They are in the data integration business, and so you might as well admit that that’s what your data scientists do,” he said. The best way to address this, he said, is to have a clear strategy for dealing with data cleaning and integration, and to have a chief data officer on staff.
Many err in the belief that traditional solutions will help address data cleaning and integration, Stonebraker said, specifically ETL (extract, transform, load) and master data management processes. The ETL process requires intensive human effort, Stonebraker said, and takes a lot of time and gets too expensive if you have more than 20 data sources. These processes also require a global data model at the outset, while today’s enterprises are agile and evolve quickly. The technology is brittle and not going to scale, he said.