Data 2022 Outlook Part I: Will data clouds get easier and streaming get off its own island?
With the pandemic nearing its two-year anniversary, there is little doubt that growth of cloud adoption has continued accelerating. Although dated last March, the most recent state of the cloud report from Flexera showed significant acceleration in cloud spending for large enterprises, with the proportion shelling out over $1 million/month doubling over the previous year.
As reported by Larry Dignan last summer, a backlash to cloud migration may be starting to brew based on the notion that a lot of cheap eventually becomes expensive. We’ve heard anecdotes from technology providers like Vertica that some of their largest clients were actually repatriating workloads from the cloud back to their own data center or colocation facilities.
Looking back on 2021
Last year saw some of the last on-premises database holdouts, such as Vertica and Couchbase, unveil their own cloud managed services: it reflected the reality that, while not all customers are going to deploy in the public cloud, offering an as-a-service option is now a required addition to the portfolio.
Despite the growth in cloud adoption, the database and analytics world did not see a lot of dramatic product or cloud service introductions; instead, we saw a rounding out of portfolios, with the addition of serverless options for analytics, and a growing move toward pushdown processing in the database or storage tier. Excluding HPE, which unveiled a significant expansion of its GreenLake hybrid cloud platform in midyear, the same was largely true on the hybrid cloud front.
With most providers having planted their stakes in the cloud, the past year was about cloud providers building bridges to make it easier to lift and shift or lift and transform on-premise database deployments. For lift and shift, Microsoft already offered Azure SQL Database Managed Instance to SQL server customers, and last year, added managed instance for Apache Cassandra.
Meanwhile, AWS introduced its answer to Managed Instance: a new RDS Custom option for SQL Server and Oracle customers who have special configurations that wouldn’t otherwise be supported in RDS. This could be especially useful for instances that support, for example, legacy ERP applications. And what about if you want to continue using your existing SQL skills on a new target? Last year, AWS released Babelfish, an open source utility that can automatically convert most SQL Server T-SQL calls into PostgreSQL’s pg/PLSQL dialect. And then there’s Datometry, which says, just virtualize your database.
Also in the spirit of lift and shift, last year saw each of the major clouds adding or expanding database migration services designed to make the process simpler. AWS and Azure already have services that provide guided approaches to migrating from Oracle or SQL Server to MySQL or PostgreSQL. Last year, Google introduced a database migration service that is about like for like: making transfer of on-premises MySQL or PostgreSQL to Cloud SQL into an almost fully automated process.
So what’s on tap for 2022? We’re dividing our year-ahead outlook over two posts. Here we’ll train our eye on trends with cloud data platforms, and tomorrow, we’ll share our thoughts on what will happen when the spotlight shines on data mesh in the coming year.
The cloud might start getting easier
Cloud providers are not going to suddenly stop expanding their portfolios in adding new products and services. But we expect that in the coming year, they will start paying more attention to identifying synergies across their portfolios from which they could deliver new blended solutions. The driver? Offering solutions blending some of their services should move at least some of the burden of integrating capabilities off the shoulders of cloud customers.
The backdrop to all this is that the cloud was supposed to simplify, not only IT budgeting, but also operations. In the data world, when customers adopt managed DBaaS services such as Amazon Aurora, Azure SQL Database, Google Cloud Spanner, IBM Db2 Warehouse Cloud, or Oracle Autonomous Database, compute and storage instances are typically predetermined and the DBaaS provider handles the software housekeeping. Serverless, in turn, takes simplification up another notch by dispensing with the need for customers to capacity plan their deployments.
The problem then becomes, are we getting too much of a good thing?
AWS alone has well over 250 services, of which, for instance, you have 11 different container services, 16 databases, and over 30 machine learning (ML) services. It’s not much different with Google Cloud or Azure either. Google Cloud offers a dozen analytic services, 10 container services, and at least a dozen or more AI and ML services; while Azure offers nearly a dozen DevOps services, 10 hybrid and multi-cloud services, and almost a dozen IoT services. With tongue in cheek, we were privately relieved when AWS did not introduce a 17th database at the last re:Invent conference.
The breadth of managed offerings in the cloud reflects a growing maturity: cloud providers are expanding the reach of their platform-, database-, and software-as-a-service offerings, serving a wider swath of enterprise compute needs.
So, what happens when you want to integrate a BI tool with a database or add a customer experience chatbot, video recognition system, or some event alerting capability for a manufacturing, supply chain, or maintenance process? Or containerize and deploy it as microservices? With such a wealth of choices, the burden has been on the customer to integrate or piece them together.
The next step for cloud providers is to tap the diversity of their portfolios, identify the synergies, and start bundling solutions that at least lift part of the burden of integration off the customer’s shoulders. We’re seeing some early stirrings. For instance, AWS and Google Cloud have made strides to unify their ML development services. As we’ll note below, we’re seeing some progress in the analytics stack where cloud data warehousing services are starting to either morph into end-to-end solutions, or push down more processing into the database. And we’re seeing integration of conversational AI (a.k.a., chatbots) into prescriptive offerings such as Google Contact Center AI.
Our wish list includes embedding some data fabric, cataloging, and federated query capabilities into analytic tools, both for end users and data scientists alike, so they don’t have to integrate a toolchain to get a coherent view of data. There is excellent opportunity to embed ML capabilities that learn and optimize to an end user’s or organization’s querying patterns based on SLA and cost requirements. We’d also like to see prescriptive solutions that tie in different AI services to business applications, such as video recognition for manufacturing quality applications. As we note below, we expect to see streaming integrated more tightly with data warehouses/data lakes and operational database services.
We expect that in 2022, that cloud providers will ramp up efforts to tap the synergies hiding in plain sight in their portfolios – an initiative that should also heavily involve horizontal and vertical solution partners.
Streaming will start converging with analytics, and operational databases
A long elusive goal for operational systems and analytics is unifying data in motion (streaming) with data a rest (data sitting in a database or data lake).
In the coming year, we expect to see streaming and operational systems come closer together. The benefit would be to improve operational decision support by embedding some lightweight analytics or predictive capability. There would be clear benefits for use cases as diverse as Customer 360 and Supply Chain Optimization; Maintenance, Repair, and Overhaul (MRO); capital markets trading; and smart grid balancing. It could also make analytics more current and provide real-time feedback loops for ML models. In a world where business is getting digitized, having that predictive loop to support data-driven operational decisions is morphing from luxury to necessity.
The idea of bringing streaming and data at rest together is hardly new; it was spelled out years ago as the Kappa architecture, and there have been isolated implementations on big data platforms – the former MapR’s “converged platform” (now HPE Ezmeral Unified Analytics) comes to mind.
Streaming workloads traditionally ran on their own dedicated platforms because of their extreme resource demands. The show stopper keeping streaming on its own island of infrastructure has long been resource contention.
Streaming applications, such as parsing real-time capital market feeds, detecting anomalies in the flow of data from physical machines, troubleshooting the operation of networks, or monitoring clinical data, have typically operated standalone. And because of the need to maintain a light footprint, analytics and queries tended to be much simpler than what you could run in a data warehouse or data lake. Specifically, streaming analytics often involves filtering, parsing, and increasingly, predictive trending.
When there was a handoff to data warehouses or data lakes, in most cases the data would be limited to result sets. For instance, you can run a SQL query on Amazon Kinesis Data Analytics that identifies outliers, persist the results to Redshift, and then perform a query on the combined data for more complex analytics. But it’s a multistep operation, involving two services, that’s not strictly real-time.
Admittedly, in-memory operational databases like Redis support near-instant persistence of streaming data with append-only log data formats, but that is not the same as adding a predictive feedback loop to operational applications.
Over the past couple years, we’ve seen some hints that streaming is about to become part of operational and analytic data clouds. Confluent kicked open the doors when it released ksqldb on Confluent Cloud back in 2020, while last year, DataStax introduced the beta for Astra Streaming, backed on Apache Pulsar (not Kafka); it’s currently a separate service but we expect over time that it will be blended in with Astra DB. In the Spark universe, Delta Lake can act as a streaming source or sink for Spark Structured Streaming.
The game changer is cloud-native architecture. The elasticity of the cloud eliminates issues of resource contention, while microservices provide more resilient alternatives to classic design patterns involving a central orchestrator or state machine. In turn, Kubernetes (K8s) enables analytic platforms to support elasticity without having to reinvent the wheel for orchestrating compute resources. Converged streaming and operational or analytic systems can run on distributed clusters that can be partitioned and orchestrated for performing real-time stream analytics, merging results, and correlating with complex operational models.
Such convergence won’t replace dedicated streaming services, but there are clear opportunities for cloud incumbents: Amazon Kinesis Data Analytics paired with Redshift or DynamoDB; Azure Stream Analytics with Cosmos DB or Synapse Analytics; Google Cloud Dataflow with BigQuery or Firestore come to mind. But there are also opportunities for real-time in-memory data stores. We’re talking to you, Redis, not to mention any of the dozens of time series databases out there.
Data share and share, alike
In hindsight, this looks like a no-brainer. With cloud storage being the de facto data lake, promoting wider access to data should be a win-win for everybody: data providers get more mileage (and potentially, monetization) out of their data; data customers gain access to more diverse data sets; cloud platform providers can sell more utilization (e.g., storage and compute); while cloud data warehouse transform themselves into data destinations. From that perspective, it’s surprising that it’s taken each of the major cloud providers almost five years to catch on to an idea that Snowflake hatched.
Snowflake, followed by AWS, have been the most active in promoting data exchanges, although both approached it from opposite directions. Snowflake began with a data sharing capability aimed across internal departments and later opened a data exchange for third parties; AWS went in reverse order, opening a data exchange on AWS Marketplace a couple years back, but only over the past year adding capabilities for internal sharing of data for Redshift customers (that required AWS to develop the RA3 instance that finally separated Redshift data into its own pool). Snowflake has taken the added step of opening vertical industry sections of its marketplace, making it easier for customers to connect to the right data sets; on the other hand, AWS beat Snowflake to the punch in commercializing its data marketplace by utilizing the existing AWS Marketplace mechanism.
Google followed suit this year with Analytics Hub for sharing BigQuery data sets, a capability that they will subsequently extend to other assets such as Looker Blocks and Connected Sheets. Microsoft Azure has also gotten into the act.
Over the next year, we expect each of the cloud providers to flesh out their internal and external data exchanges and marketplaces, especially where it comes to commercialization.
Database platforms turn to ML to run themselves
This is the flip side of in-database ML, which last year, we forecast would become a checkbox item for cloud data warehouses and data lakes. What we’re talking about here is the use of ML under the covers to help run or optimize a database.
Oracle fired the first shot with the Autonomous Database. Oracle went full-bore with ML by designing a database that literally runs itself. That’s only possible with the breadth of database automation that is largely unique to Oracle database. But for Oracle’s rivals, we’re taking a more modest view: applying ML to assist, not replace the DBA in optimizing specific database operations.
As any experienced DBA will testify, running a database involves lots of figurative “knobs.” Examples include physical data placement and storage tiering, the sequence of joins in a complex query, and identifying the right indexes. In the cloud, that could also encompass identifying the most optimal hardware instances. Typically, configurations are set by formal rules or based on the DBA’s informal knowledge.
Optimizing a database is well-suited for ML. The processes are data rich, as databases generate huge troves of log data. The problem is also well-bounded, as the features are well-defined. And there is significant potential for cost savings, especially where it comes to factoring how to best lay out data or design a query. Cloud DBaaS providers are well-situated to apply ML to optimize the running of their database services as they control the infrastructure and have rich pools of anonymized operational data on which to build and continually improve models.
So we’ve been surprised that so far, there have been few takers to Oracle’s challenge. Just about the only formally productized use of ML (aside from Oracle) is with Azure SQL Database and SQL Managed Instance. Microsoft offers autotuning of indexes and queries. That’s a classical problem of trade-offs: the faster speed of retrieval with an index vs. the cost and overhead of writes when you have too many indexes. Azure’s automated tuning can automatically create indexes when it senses query hot spots; drops indexes that go unused after 90 days; and reinstate previous versions of query plans if newer ones prove slower.
Others have experimented with techniques such as reinforcement learning to varying degrees of success. UC Berkeley’s RISELab has experimented with reinforcement learning to boost performance vs. Spark’s existing Catalyst query optimizer. As noted above, cloud managed database service providers have huge troves of data for training ML models. For cost- or performance-conscious customers, ML could provide tactical competitive advantages that, unlike the autonomous database, won’t make their potential market of DBAs perceive their jobs to be threatened.
Over the coming year, we expect to see more cloud DBaaS services introduce options incorporating ML to optimize the database, and promote to enterprises how they can save money.Â
Disclosure: AWS, DataStax, Google Cloud, HPE, IBM, and Oracle are dbInsight clients.