Governing Data across a Distributed Data Estate

For many companies, the last decade has been seismic when it comes to digital transformation. We have seen changes in both operational applications and processes and in the world of analytics. In terms of operational systems, we have seen a gear change in on-line customer engagement and interaction with self-service customer facing mobile apps becoming the new user interface to core transaction systems. Also, many have migrated to software-as-a-services (SaaS) transaction processing applications running outside the corporate firewall, and the Internet of Things has emergence in operations and in manufactured products to capture more data. In terms of analytics, it has been a frenzy of activity. New database technologies, the big data era, cloud data warehousing, data science with machine learning and now AI-driven business automation. In that time we have seen a tsunami of data emerge from an ever increasing number of new data sources that business want to analyse. This includes human generated data such as web chat, inbound emails, voice data, images, video, and social network data. In addition, machine generated data is in demand such as on-line clickstream data generated as people browse your website, IoT data and infrastructure log data. So, it has gone well beyond traditional structured data in transaction databases. However even there, the introduction of customer facing self-service mobile apps and on-line transactions processing has caused transaction rates to skyrocket. Also the number of users wanting data has grown rapidly.

The ’side effect’ of all this is that companies are dealing with a much more complex data estate. Data is being stored in multiple types of data store on-premises, in multiple clouds and at the edge. It could be in Excel spreadsheets, flat files, relational databases, NoSQL databases, Hadoop systems, cloud object storage such as AWS S3, cloud-based relational databases, cloud NoSQL databases, online disk drives (e.g., OneDrive, Google Drive) SharePoint etc. Transaction data may be held in SaaS applications in many different locations and data can also be streaming in from edge devices and/or stored in edge databases. The problem with this is that companies now have to manage and govern data across a distributed data estate which is a huge challenge especially when the number of regulations and laws around data are on the increase. Figure 1 shows the scale of the challenge.

In addition, the scope has broadened. For many years, data governance was just about data quality. Today it is much more than that. It now includes:

Data ownership and stewardship,
Data security – including data access, data usage, and data loss prevention
Data privacy
Data quality – including MDM & RDM)
Data lifecycle management – including data retention)
Data sharing – including data sovereignty & cross-border data sharing

Also please don’t underestimate this challenge. The European Union have already handed out eye watering financial penalties to organisations failing to comply with data privacy legislation for example. In Europe, all I have seen is data governance has constantly gone up in priority year after year to the point that today, in many companies this is now a higher priority than analytics and it is not just for compliance reasons. C-level executives do not want a data breach. The brand damage can be enormous. In addition, the impact of ungoverned data on the business performance can be significant e.g. stopping or delaying a process, delaying decisions, unplanned operational cost and much more. In addition, in a world where data and analytics are now at the heart of business, what happens in machine learning and business intelligence if the data is poor quality? It means predictions and BI are impacted. We have all heard of garbage in equals garbage out. IT professionals need to understand the business impact of ungoverned data if they are to convince business why they need to buy into data governance and change the culture. It is well documented that data culture is a major problem and so if you are struggling with convincing business people to participate, then find the problems caused by ungoverned data and rank the impact of these problems on business performance. Also ask yourself how many of the following questions you can answer.

What are the main systems and data stores in use by your organisation?
What data sources exist and are planned?
What data exists across your data landscape and where is it stored?
What data needs governed?
How should data be classified to know how to govern it?
For structured data, what data names is it known by and what should it be known by?
Is the same data stored in different data stores with different data names?
How good or bad is data quality and who is responsible for cleaning data?
What data is considered sensitive and subject to compliance in each of the countries that you operate in?
Where is all the sensitive data located?
Is all sensitive data protected across all data and content stores?
Do business users know what data is available, and where the data they need is located?
Do business users know if data is sensitive?
Do business users know if data is poor quality?
Do business users know who to ask to gain access to certain data?
Who is allowed to access and maintain sensitive data, how is this policed and is access audited?
How do you currently prevent data loss through accidental oversharing?
What user groups exist in your organisation and who are the users?
How long does data have to be kept for?
What policies and rules should be applied to what data?
How are policies consistently enforced across all the data stores that are in use in your organisation?
What sensitive data is at currently risk and where is it located?
Who can change policies, who needs to approve changes and are changes audited?
How is data usage governed?
Do you know who or what is currently using data and for what purpose?
Where does data originate and what data is being shared?
What transformations have been applied since capture?
What tools do you have in place to govern data and do these tools integrate and share metadata?

Also, what is it that you need to do to govern data across a distributed data estate? In my opinion we need the following capabilities:

Automated data discovery and cataloguing
Data classification to auto classify data and content to know how to govern it
Centralised definition of policies to specify how classified data and content should be governed
Enforce policies across the distributed data landscape
Continually monitor, report and act to keep data governed

One thing is clear, you no longer can do this manually. There is too much data, too many data stores and files (often millions) and people are not prepared to take this challenge on without the help of AI-automation built into the tools they are using. I would also say that

data classification and data policy enforcement are the hardest to get right across so many types of data store and application in a distributed multi-cloud data estate. Please note also that there is a difference between automated data discovery and automated data classification. They are not the same. Buying a data catalog that only automates data discovery is not enough. You need automatic classification of different types of country specific sensitive data. You also need to define and use a data confidentiality classification scheme and a data retention classification scheme. You also need a number of key technologies that must integrate and work together.

If you are tasked with undertaking this challenge, please join me in my 2-day Centralised Data Governance of a Distributed Data Landscape course with Technology Transfer in October to find out how to tackle and solve this problem.

By Mike Ferguson

October 2022

Governing Data across a Distributed Data Estate

USEFUL LINKS

contact info

Address

Phone

Email