The Current State of Data Processing Systems
For the last 5 years, enterprises have been scrambling to centralize their analytics processing on the cloud (hence the current $20B+ valuations of Databricks & Snowflake). Most recently, all of the major data platform vendors have converged their messaging around the concept of a “LakeHouse” architecture that takes the best attributes from traditional data warehouses and enables them to run on platforms with data lake storage architectures. For near-real-time scenarios, several streaming platforms have been built as well (e.g. Storm, Spark Streaming, Pulsar, and Flink). These systems also adopt a cloud-based, centralized architecture and assume that data ingestion will direct edge streams to cloud message brokers like Kafka[HP1] [GR2] , Kinesis, or Event Hubs. These systems have been able to scale to handle petabytes of data, but often at great cost.
Pressures on the current model
As value is realized through data analytics within organizations, the pressure to increase the volume of data being processed grows. This leads to increasing complexity and non-linear costs to process ever increasing volumes of data. For instance, reprocessing a full dataset to add a calculated column or correct a bug is easy with a few gigabytes of data, but is extremely expensive over a petabyte.
Moore’s law allows 41% more data to be processed (Cross, 2016), compounded annually, while data is growing at a compounded annual growth rate of 61%”[GR2] (Patrizio, 2019). Playing these growth rates forward, if we assume that there is enough power to process all of the useful data in the world today (which likely isn’t…