Many large organizations are discovering (or in most cases rediscovering) the need for a “single source of truth” to provide consistent and trustworthy data across the enterprise. As an enabler to this goal, the term “data warehousing” has made a comeback. But there is some skepticism. Many experienced practitioners believe it’s impractical to organize all the core data for a large, complex, modern enterprise into one big database.
They’re usually right.
There is still no better way to get good performance for read-intensive data access than by integrating high-volume, frequently-joined data in the same physical database, where the optimizer can use statistics about the data, relationships, and physical storage to make efficient data access and manipulation decisions. But that doesn’t mean all the core data in an enterprise must be physically stored together. Instead, the level of integration should match the needs of the business.
At the retailer where I was responsible for enterprise data management for a decade or so, there were essentially two major business units – grocery and pharmacy. Although each retail location for both units was physically located within the same larger supermarket, they were managed as semi-independent entities. There was a different management structure, a different set of products, different inventory management systems and processes, different job roles, and so on. And, most important, almost all the analyses that were done or would be needed for the foreseeable future were focused within the boundaries of each business unit.
I was fortunate at that time to have the advice of a trusted consultant who suggested that we intentionally build out separate data warehouses for each business unit. As a data architect with a tendency to want to normalize everything I saw, I initially resisted this advice. Sure, grocery and pharmacy may have had a lot of differences, but both had products, customers, sales, inventory, shipments, receipts – drawing two separate data models just didn’t seem right. But I eventually came to accept that it’s the applications and analytics that drive the need for physical data integration, not the logical similarity of the data domains. There were a few analytic needs here and there that crossed boundaries – like understanding the total profitability of baskets that contain pharmacy items, and of course summarized financial reporting – but these were exceptions, and the requirements could be satisfied by linking data across data stores for targeted sets of use cases.
Since that time, I’ve offered the same advice to clients and encountered the same initial resistance. I carefully explain what will happen if they attempt to create one big database for all core data across business units.
First, the business units will each have to wait in a single-file line for the data they want. If you have only one team focused on all core data, the team will have to unnecessarily balance the priorities across units. Second, you’ll create excessive complexity in data structures and processes – more data partitioning, more conditional logic, more parameterization – simply to have a deeply integrated model that isn’t needed by the business. The resulting complexity makes both development and maintenance more difficult.
Here are some questions to ask when determining the boundaries of individual databases within a data and analytics ecosystem:
- Which, if any, upcoming business initiatives will require data to be integrated across business units and data domains? (Think of initiatives like omni-channel customer experience or end-to-end supply chain optimization.)
- Which diverse business initiatives and applications will need access to the same (not just logically similar) data, and therefore benefit from data reuse?
- If the company has made or is likely to make acquisitions, will the goal be to seamlessly integrate the acquisitions into the acquiring enterprise, or will the acquired companies remain semi-independent brands?
- Are there other natural divisions such as country, product line, or customer segment? Do these divisions drive independent technology roadmaps with little or no use for data sharing?
- Which selected data domains, if integrated globally, would provide maximum flexibility to allow changes to organizational structure over time, while quickly enabling short and medium-term structures?
By asking these questions, you’ll envision and plan a much more pragmatic architecture than by simply assuming you should implement one tightly integrated database for all needs.
But be careful not to over-correct. Accepting that one giant database might not make sense for your organization isn’t an excuse to abandon thoughtful architecture altogether. There are always political challenges to address when promoting the use of integrated data any level, and it can be tempting to succumb to those challenges by allowing countless data weeds to grow wherever they will, with all the familiar issues that result – inconsistent metrics, excessive time spent curating the same data over and over again, difficulty protecting sensitive data, and poor performance when attempting to join data across domains. Besides, if you promote data integration at the appropriate level by explaining how it will benefit multiple, specific business initiatives, rather than pitching the idea of enterprise data integration for its own sake, you’ll find the political resistance much easier to overcome.
And, after examining the drivers, you may in fact decide that one global data warehouse – built incrementally in small scope projects, driven by targeted business initiatives – is the right approach. But that decision should be made after careful analysis, not by default.
It’s always been a good idea to consider the natural boundaries within an enterprise and plan data deployment according to those boundaries. Now, with the dramatic increase in data sources, volume, data types, business complexity, and so on, it’s more important than ever.