By Barry Devlin

June 2023

Deduplicate your Data for Better Governance

What is your biggest data-related problem?  If there was one single behaviour within business or IT that you could change overnight, which one could bring the greatest data management/governance benefits?  How could you immediately reduce ongoing data delivery and maintenance costs?

In my experience, business often has a somewhat dysfunctional approach to “data quality”. A key problem lies in the use of the term data when what’s really meant is information. A focus on data—I often define data as information largely stripped of context—is simply too narrow to adequately address the challenges that business faces as it adopts a digital transformation strategy. However, data is the term that is almost always used. And because data consists of raw numbers and bare text, new copies are quickly and easily made, without any consideration of how that data was originally created, what it actually meant or was designed for, or how the copy will be maintained in the future.

At its most pervasive, we see this dysfunction in the wonderful world of spreadsheets.  Perfectly adequate data in the company’s business intelligence (BI) system is copied into a spreadsheet, manipulated and mangled, pivoted and prodded until new insights emerge.  Of course, this is valid and often valuable and innovative business behaviour.  The problem is what happens next.  The spreadsheet data and calculations are saved for future use.  The copy of the data has become hardened, in terms of structure and often content as well.  Future changes in the BI system, especially in structure and meaning, can instantly invalidate this spreadsheet, downstream copies built upon it and the entire decision-making edifice constructed around them.  And let’s not even mention the effect of an invisible calculation error in a spreadsheet…

Let’s move up a level.  Marketing wants to do the latest gee-whiz analysis of every click-through pattern on the company’s website since 2010.  Vendor X has the solution—a new cloud data warehouse app offering faster query speeds and financed by via operational expenditure.  It’s a no-brainer. Marketing is happy with its innovative campaigns, and even finance signs off on the clear return on investment delivered by the new approach.  Except, of course, that this bright, shiny app requires all the existing clickstream data to be copied and maintained to the new database on an ongoing basis.  Who’s counting the cost of managing this additional, extensive effort?

And did I hear mention of the data lake? How much of the data here has been copied from elsewhere? Let’s not even ask how many (near-)copies or corrupted copies of the same data exist within the data lake… or should that be swamp? By the way, if you’re thinking that the data lakehouse is the solution to that data quagmire, just consider the balance between technology smarts and data management methods in the marketing materials you saw.

It’s easy to blame businesspeople who, driven by passion for business results and unaware of data management implications, simply want to have the information they need in the most useful form possible… now.  IT, of course, would never be guilty of such short-sighted behaviour.  Really?

The truth is that IT departments often behave in exactly the same way.  New applications are built with their own independent databases—to reduce inter-project dependencies, shorten delivery times, etc—irrespective of the existence of the information elsewhere in the IT environment. 

Even the widely accepted data warehouse architecture explicitly sanctions data duplication between the enterprise data warehouse (EDW) and dependent data marts; and implicitly assumes that copying (and transforming) data from the operational to the informational environment is the only way to provide decision support. However, allowing duplication is not the same as demanding it. Technology advances since the last millennium may remove the need to make a copy in many cases.

The new data mesh architecture exacerbates the problem. It maintains that data should be managed within business domains, that centralised stores and governance structures get in the way of innovation and change. The result: data is duplicated across business domains with little regard for inconsistencies that may arise, or the costs thus incurred.

In most businesses and IT departments, it doesn’t take much analysis to get a rough estimate of the costs of creating and maintaining these copies of data.  The hardware and software costs, especially in the cloud, may be relatively small in comparison to traditional solutions (although many CFOs are beginning to question the growth curve of these expenses). In addition, and often neglected or underestimated, the staff costs of finding and analysing data duplicates, tracking down inconsistencies, and firefighting when a discrepancy becomes evident grow exponentially as more copies of more data are built.  On the business side are similar ongoing costs of trying to govern copies of data, but by far the most telling are the costs of lost opportunities or mistaken decisions when the duplicated data has diverged from the truth of the properly managed and controlled centralised data warehouse.

So, if you’d like to reduce some of these costs, here are five behavioural changes you could implement to improve data management/governance and reduce data duplication in your organisation:

  1. Instigate a “lean data” policy across the organisation and educate both business users and IT personnel in its benefits. Although some data duplication is unavoidable, such a policy ensures that the starting point of every solution is the existing data resource.
  2. Revisit your existing data marts with a view to either combining marts with similar content or absorbing marts back into the EDW. Improvements in database performance since the marts were originally defined may enable the same solutions without duplicating data.
  3. Define and implement a new policy regarding ongoing use or reuse of spreadsheets. When the same spreadsheet has been used in a management meeting three times in succession, for example, it should be evaluated by IT for possible incorporation of its function into the standard BI system.
  4. Evaluate new database technologies to see if the additional power they offer could allow a significant decrease in the level of data duplication in the data warehouse environment.
  5. Apply formal governance and management techniques to your data lake, on-premises and/or in the cloud to discover how the savings from cheap data storage are being more than consumed in subsequent analysis and correction of avoidable data consistency problems.

Deduplicating data is a first and necessary step toward quality information with clear benefits for the organisation. It’s surprising how reluctant many businesses are to take this first step while embarking on a digital transformation strategy that demands information of the highest quality.