Data mesh has been steadily gaining popularity as an approach to enterprise data architecture. While it embodies some great ideas, there are a few crucial principles that could easily be lost when attempting to apply the approach in real life.
The core idea of data mesh is to organize an enterprise data architecture by logical, business-orienteddomains, implemented primarily by distributing responsibilities among the “owners” of each domain. This contrasts with the traditional approach that organizes teams by layers within the technical architecture (ingestion, data management, access, etc.) and is driven primarily by centralized coordination.
As enterprises and their data assets and processes become increasingly complex, establishing a method to manage that complexity makes a lot of sense. And putting as much responsibility as possible into the hands of people most familiar with each domain (business processes, data subjects, etc.) is the best way to match the right expertise and experience to the solution components associated with those domains.
Sounds good so far.
But the problem with data mesh – or, rather, the problem with how it has been initially implemented in multiple organizations – is that in a well-meaning effort to reduce complexity and distribute responsibilities, many of the positive aspects of centralization are lost. That is, the baby (highly coordinated, cross-functional capability) is being thrown out with the bathwater (excessive centralization and unmanageable complexity).
To avoid this troubling trend, data leaders must preserve a few important, time-tested principles that remain relevant and require careful attention when following the data mesh philosophy.
Aligning to funded business initiatives is (still) the most critical success factor. I’ve now encountered multiple instances where, in an attempt to follow the data mesh approach, a central data governance organization divided the data landscape into named domains, assigned data owners to each domain to lead planning and implementation for the assigned domain, and essentially said, “Go!” You may be able to guess the results.
Without adequately aligning to business drivers to justify the effort, scope the work across domains, and prioritize elements of the implementation, each data owner prioritized implementation in seemingly random ways, based on the data they considered “important”. Any data element within each domain was arguably within scope, and any data issue encountered was a candidate for deep analysis and resolution. Without proactive and rigorous coordination, there was no way to ensure the domains could support cross-functional business initiatives on any reasonable schedule anyone could count on. The result was long and costly projects with limited use of the deployed data – and a lot of frustrated application projects left on their own to gather and manage the data they needed.
Any approach to architecture must start with business initiatives; especially funded, cross-functional business initiatives that are, by definition, important to the enterprise. Here, I’m not referring to “data initiatives” with names like “data marketplace” or “single version of the truth” (as useful as those concepts are in support of business initiatives). Instead the drivers should be initiatives like omni-channel customer experience, supply chain optimization, manufacturing automation, and the like. With real business initiatives as drivers, each data owner can commit to delivering the elements within their domain to support the near-term business need. This approach dramatically narrows the focus of each incremental delivery and provides a clear scoping mechanism across domains. Every element delivered has strategic value while contributing to coherent data at the same time.
High-volume, complex, frequently joined, and heavily reused data should be stored in the same database. Let’s acknowledge that the concept of data warehousing is, at least in part, a response to technical limitations. If it were possible to retain all current and historical data in transactional systems (such as point of sale systems, purchasing systems, and so on) and drape a “virtualization layer” on top of these systems so that any end user or application could easily and seamlessly access data in and across those systems, and get good performance for both transactional and analytic workloads, then data warehousing would not exist. But it does exist, and it’s growing. In fact, purpose-built data warehousing technologies are more widely implemented than ever. Why is that? Because there’s still no way to get good performance for complex, frequently joined, and heavily reused data (which is growing at least as fast as technical capability) linked across a multiple data domains without physically storing that data together in a database (Note: Here I’m considering one or more compute engines accessing standard-format data in object storage as a type of database, in addition to databases that store data in proprietary structures. Some data warehouse databases can do both.)
While virtually linking data on-the-fly across distributed and highly granular domains may work for some applications, it doesn’t enable effective performance and scalability for data warehouse workloads.
This doesn’t mean that all core enterprise data must be stored in one, single, giant data warehouse. For example, in a company with multiple mostly-independent business units it might make sense for each business unit to have its own data warehouse, linking or combining data across data warehouses for specific use cases only when needed. The number of data warehouses needed by a large, modern organization may be more than one, but it certainly isn’t zero.
The role of the data warehouse(s) within an organization requires thorough consideration based on the nature of the organization and the number and type of cross-functional business initiatives to be supported.
Data must be semantically linked across domains. Integrating core, enterprise data has always been essential to meet the needs of business initiatives and to enable data sharing across initiatives.And modern business initiatives are driving the need for even more integration across domains than in the past, not less. Providing the coherent, cross-functional view of data required by modern businesses demands the hard work of ensuring that data is not just technically linked, but also semantically linked.
While integration is partly enabled by technology, simply hiding the technical complexities inside each domain while enabling easy communication across domain interfaces does nothing to ensure that the data itself is consistent across domains. For example, any omni-channel initiative must, of course, be able to link information about each individual customer across touchpoints and sales channels. Supply chain visibility requires consistent product data linking various points in the supply chain. This kind of consistency across domains is simply impossible without a proactive, highly coordinated, business-driven data governance process. As anyone with experience in enterprise data management knows, the same data domain can exist in multiple systems with different structures, different data values, different definitions, different hierarchies and other groupings, and so on. A few guidelines and a data catalog with forums for data owners to communicate across domains is not sufficient.
Specialized expertise is needed to successfully leverage data management tools and effectively implement cross-domain data governance. There may come a day when data management tools are so easy to use and so well supported by artificial intelligence that non-specialists need only point the tools at raw data sets and let them loose to infer meaning, define structures, and develop workflows for data transformation and integration. But that day has not yet arrived. And even when it does, no tool will ever be able to finesse the organization to agree to use the same data for a variety of purposes, even when it’s in the wider interest of the organization to do so. Experts in data management must work with data governance and stewardship to help resolve the technical and non-technical issues as they arise during implementation.
It’s not enough to provide guidance and tools to the various domain teams. Qualified professionals must go behind the curtain of each domain to assist in developing data quality processes, data structure design, data reconciliation, and other elements of the architecture in a way that not only works for the near-term application and analytics use cases, but also enables scalability and extensibility so that new data needs enhance the underlying foundation without dramatic rework and redesign to respond to new requirements as they emerge. Doing that well takes specialized training and more than a little experience.
Rather than attempt to compare these principles with what data mesh does or doesn’t promote, I’ll instead simply point out that, due to its emphasis on decentralization, there is a high risk of abandoning these concepts. Data mesh does, in fact, advocate a highly-coordinated, cross-domain governance program, but it’s easy to overlook, for a variety of reasons, not least of which is that many organizations are seeing this as an opportunity to avoid the hard work of cross-domain coordination, much like agile methods were often misunderstood and misused to inappropriately jettison timeless program and project management principles.
So if you’re among the many enterprise data professionals intrigued by data mesh, or are beginning to apply data mesh ideas in your organization, be careful to retain the professionalism required to coordinate activities across domains, applying the stitching you’ll need to hold the mesh together.