April 2022
Upcoming events by this speaker:
May 23 – May 24 2022:
Implementing Data Fabric, Mesh, or Lakehouse
Implementing Data Fabric, Mesh, and Lakehouse
If you only joined the data management profession in the past twenty years, or more especially since 2010, it may feel natural to begin all considerations of data architecture and governance from “big data.” After all, isn’t big data the lifeblood of digital business and the beating heart of artificial intelligence? Aren’t the volume, velocity, and variety of big data the main challenges that IT must face in designing and delivering digital transformation solutions?
As an old-timer (since 1985!) in the data management industry, I contend that significant lessons were learned about managing data prior to the millennium and the subsequent flooding of the data lake. Many of these lessons have been forgotten or, at least, minimised since the rise of big data and the rush to the Cloud.
The quote marks I put around big data above give some clue to what I believe has been forgotten or minimised. I do know, of course, that big data no longer has the prestige it once held. It’s not about big data, we have been told for the past few years. It’s just data. There is some truth in this statement, although the reasoning behind it may often be mistaken. We’ll see those misinterpretations as we discuss implementing a data lakehouse, fabric, or mesh.
From Lake to Lakehouse, from water to bricks
The belief underpinning the data lakehouse, according to its promoters at Databricks , is that “today, the vast majority of enterprise data lands in data lakes, low-cost storage systems that can manage any type of data” and increasingly stored in object storage on Cloud platforms.
This assertion is undeniable for any enterprise that operates, processes, and manages its business entirely in the Cloud. The data lake, first conceived when big data was central, was proposed as a way to get a handle on the size and uncertainty of externally sourced data. For companies that had no history of traditional data warehousing or on-premises operational systems, the data lake was thus the obvious place to begin building BI and analytics systems.
The apparent and heavily promoted cost savings opportunities of Hadoop and Cloud then led “bricks-and-mortar” companies with existing traditional data warehouses to adopt the data lake paradigm. They quickly discovered the data management and quality issues that arose. The Cloud-centric enterprises were discovering the same issues. Data lakes became data swamps, though which data had to be half-drowned and dragged to the sunlit uplands of Cloud-based data warehouses and BI tools.
The data lakehouse is a technology-driven approach to preserving the cost benefits of cheap object data stores by re-creating the foundational software of data warehouses (full-function relational databases) in the Cloud environment. Big data technologies and thinking are being retrofitted to problems that were previously solved in traditional environments. It is an approach I often observe where finance-driven platforming decisions are made with no reference to the actual technological capabilities of the new platform and the consequent escalating migration costs. Furthermore, the software needed is still emergent.
Another fine Mesh you’ve got me into
The founder of the data mesh pattern, Thoughtworks’ Zhamak Dehghani, explicitly credits modern software development thinking in her seminal article , saying “that the next enterprise data platform architecture is in the convergence of Distributed Domain Driven Architecture, Self-serve Platform Design, and Product Thinking with Data.” While apparently platform-agnostic, much of this work has emerged from the experience of developing new operational systems in a highly distributed “big data,” hybrid, multi-Cloud environment and the evolution of microservices and DevOps approaches needed to succeed.
There is much valuable thinking in the data mesh pattern, particularly its consideration of how to approach data and development in a highly distributed—both physically and organisationally—environment. However, the lessons of the past—that is to say, pre-big data—are seldom mentioned. Microservices is a child of the Service Oriented Architecture (SOA) of the 1990s, when its strengths and weakness were widely explored. Both SOA / microservices and domain-driven / product-centric thinking emerged specifically in operational systems. As data warehousers of old have long known, the direct application of the principles and lessons of operational systems development to informational systems is far from straightforward. In particular, domain-driven thinking presents significant challenges for the design and delivery of reconciled / consistent cross-functional data resources, such as enterprise data warehouses.
Data mesh implementation is thus challenging. The big data-like thinking of its architectural foundation sits uncomfortably with the data reconciliation approach of data warehousing. In addition, the software required to provide some of the key components of a data mesh are still emerging. So, substantial levels of in-house development and dependence on new open-source projects will be required.
Could implementing a Fabric be called fabrication?
Of the three patterns discussed here, data fabric stands farthest from big-data thinking. Its origins in large analyst companies, such as Forrester and, more recently, Gartner, assure a more broadly based foundation. However, these same origins lead to a more product-focused pattern, based on the existing and foreseen capabilities in the data management vendor ecosystem.
Data fabric is an extension of logical data warehousing, focusing on the metadata and tools required to manage and automate information delivery to businesspeople from whatever source. Of course, neither of these focus areas is new, both dating to the earliest days of data warehousing. What is new is an emphasis on semantics and ontologies in metadata—or better context-setting information (CSI), given its hugely expanded scope—and the recognition that such CSI must be active and reflect the live and ever-changing state of the information environment. Data fabric further proposes a strong role for machine learning in activating CSI and automating many aspects of data delivery.
Although many vendors support the data fabric pattern and many commercial products and open-source projects can contribute to its implementation, the tools remain fragmented and deliver basic functions rather than full-fledged data fabrics. In-house development will still be needed for the foreseeable future to integrate these offerings.
Conclusion
In summary, we can say that all three patterns offer interesting but divergent thinking about how to evolve your analytic and BI environments in an increasingly diverse and distributed environment for digital transformation. Depending on your starting point and skillset, one may be more relevant than the others. However, the underlying thinking beneath all three should be closely studied by all analytics / BI / data warehouse / data lake departments. There are valuable considerations that may be enlightening in evolving your current environments.