Click to learn more about author James Kobielus.
Big Data Analytics has been one of the dominant tech trends of this decade, and it’s also been one of the most dynamic and innovative segments of the IT market.
Today’s Big Data Analytics market is quite different from the industry of even a few years ago, and it’s almost sure to be substantially different in a few years’ time. One of the chief trends in recent years has been the spread of cloud-native computing built on orchestrated containers. Likewise, Artificial Intelligence (AI)—which depends and thrives on Big Data—has become the core of many disruptive applications. These trends will almost certainly drive ongoing evolution of the Big Data Analytics ecosystem going forward.
As we look ahead to 2019, we foresee the following dominant trends in Big Data Analytics:
The Big Data Analytics Ecosystem Will Go Deeply Cloud-native
Kubernetes is the foundation for the new generation of Big Data Analytics in the cloud. The most noteworthy trend over the past year in this market has been the recrystallization of the data ecosystem around Kubernetes. In 2019, we predict that the Open Hybrid Architecture Initiative will deliver on its plan to modularize and containerize HDFS, MapReduce, HBase, Hive, Pig, YARN, and other principal Hadoop components. We also predict that the prime sponsors—Hortonworks (soon to become part of Cloudera) and IBM/Red Hat—will deliver next-generation commercial Hadoop platforms that incorporate this architecture into their respective Hybrid cloud solution portfolios in early 2019, and that other cloud solution providers will follow their lead throughout the year. Similar containerization initiatives in the Spark, TensorFlow, streaming, distributed object store, and block storage segments will take hold in 2019 as the entire Big Data stack decouples for more agile deployment and management in Kubernetes-based DevOps environments
Every Big Data Analytics Platform Provider Will Invest Heavily in Data Science Toolchains
Big Data Analytics solution providers are in a war to win the hearts and minds of the new generation of developers working on AI projects. In 2019, Big Data Analytics solution providers will invest heavily in tools to accelerate deployment of trained AI, Deep Learning (DL), and Machine Learning (ML) applications into production. More vendors will emphasize their offerings’ ability to automate such traditionally manual tasks as feature engineering, hyperparameter optimization, and data labeling. As the Big Data Analytics ecosystem shifts toward cloud-native Architectures, more Data Science workbenches will incorporate the ability to automate tasks over Kubernetes orchestration fabrics and also to containerize models for deployment into public and private clouds. This trend will bring emerging standards, such as Kubeflow, into burgeoning Data Science DevOps toolchain ecosystem.
Hadoop and Spark Will Become Legacy Technologies in Big Data Analytics Market
Hadoop has seen its role in the Big Data Analytics arena steadily diminish in the past several years. Increasingly, it has seen its core use cases narrow to a distributed file system for unstructured data, a platform for batch data transformation, a Big Data Governance repository, and a queryable Big Data archive. In 2019, Hadoop will struggle to expand its application reach to online analytical processing, Business Intelligence, Data Warehousing, and other niches that are addressed by other open-source platforms. projects. By the end of the year, Hadoop will start to be phased out in many enterprise Big Data environments even in its core Data Lake roles in favor of distributed object stores, stream-computing platforms, and massively scalable in-memory clusters. Even Apache Spark—which developed in the middle of this decade as a Hadoop alternative—is feeling increasingly like a legacy technology in many TensorFlow-centric AI shops. This trend can be seen by the ETL niche into which Spark is increasingly being deployed, which may decline in importance as schema-on-read architectures come to the forefront.
Public Cloud Providers Will Absorb Most New Big Data Analytics Growth Opportunities
Enterprises are moving more of their Big Data Analytics workloads to public clouds and developing more greenfield applications for these environments. In 2019, the three principal public cloud providers—Amazon Web Services, Microsoft Azure, and Google cloud Platform—will step up their efforts to help enterprise accounts migrate their Data Lakes away from on-premises platforms. Other public cloud providers—such as IBM cloud and Oracle cloud—will struggle to hold onto their Big Data Analytics market shares. As a defensive strategy to hold onto enterprise accounts, IBM and Oracle will emphasize hybrid cloud solutions that help customers centralize management of Big Data assets distributed between private and public clouds. In addition, more Big Data public cloud providers—ceding the IaaS and PaaS segments to AWS, Microsoft, and Google–will shift to offering premium SaaS analytic applications for line-of-business and industry-specific opportunities.
Big Data Catalogs Will Become Central to AI DevOps
Users’ ability to rapidly search, discover, curate, and govern data assets is now the foundation of digital business success. In 2019, we expect to see more enterprises repurpose their Data Lakes into Big Data Catalogs within application infrastructures that drive the productivity of knowledge workers, support the new generation of developers who are building and training production AI applications, and facilitate algorithmic transparency and e-discovery. We expect vendors such as IBM, Cloudera/Hortonworks, Informatica, Collibra, and others deepen their existing Big Data Catalog platforms’ ability to manage more of the metadata, models, images, containers, and other artifacts that are the lifeblood of the AI DevOps workflow. More Big Data catalogs will be deployed across multi-clouds, leveraging the new generation of virtualization tools that present a single control pane for managing disparate data assets across public and private clouds. And we predict that the principal public cloud providers—AWS, Microsoft, and Google—will roll out their own Big Data Catalogs for customers who choose to deploy those services in hybrid public/private clouds.
Data Lakes Will Evolve Toward Cloud Object Storage and Stream Computing
In 2019, cloud Object Storage platforms—such as AWS S3 and Microsoft Azure Data Lake Storage—will continue to supplant Hadoop in Enterprise Data Lakes. But even that approach is in the process of being eclipsed by stream computing. Low-latency streaming platforms—such as Kafka, Flink, and Spark Structured Streaming—are becoming as fundamental to enterprise data infrastructures as relational data architectures have been since the 1970s. As the year progresses, enterprises will deploy streaming platforms to drive low-latency DevOps pipelines that continuously infuse mobile, IoT, robotics, and other edge applications with trained, best-fit machine-learning models. Online transactional analytic processing, data transformation, and Data Governance workloads are also increasingly moving toward low-latency, stateful streaming backbones.
Databases as we’ve known them are in the process of being are deconstructed and reassembled for edge-facing deployments. In 2019, disruptive new data platforms will come to market that combine mesh, streaming, in-memory, and Blockchain capabilities. Many of these new distributed data platforms will be optimized for continuous AI DevOps pipelines that require low-latency, scalable, and automated data ingest, modeling, training, and serving to edge devices. Serverless interfaces to these analytic-pipeline capabilities will be standard, supplemented by stateful streaming fabrics that support in-line recommendation engines, next best action, and other transactional workloads in edge devices on the emerging 5G broadband wireless networks.
GPUs have made inroads in recent years into commercial database architectures. However, it appears unlikely that they’ll break out of the application-specific coprocessor bucket and address the core transaction processing workloads for which most organizations use database management systems. That’s because GPUs are ill-suited for accelerating any database operations that can’t be parallelized, or that don’t involve floating point and other numeric processing, or that require a lot of data to be moved back and forth across the system bus from the main central processing units. Nevertheless, GPU-optimized databases have established themselves as a promising niche, mostly coming from startups focused on acceleration of AI and other Data Analytics applications. In 2019, it’s likely that the leading enterprise DBMS vendors will jump into this market, integrating GPUs into their on-premises and cloud data platforms to accelerate augmented, virtual, and mixed reality applications that require fast, interactive image processing with embedded AI. And it’s very likely that they’ll do through acquisition of promising GPU database startups such as Brytlyt, BlazingDB, Kinetica, OmniSci, and SQream.
What am I overlooking in Big Data Analytics trends? I would like to hear my readers’ predictions on this topic.