Mike Ferguson

By Mike Ferguson

Aprile 2023

Upcoming events by this speaker:

April 26-27, 2023 Online live streaming:
Practical Guidelines for Implementing a Data Mesh

April 28, 2023 Online live streaming:
Migrating to a Cloud Data Warehouse

May 15-16, 2023 Online live streaming:
Data Warehouse Modernisation

May 17, 2023 Online live streaming:
Embedded Analytics, Intelligent Apps & AI Automation

June 22-23, 2023 Online live streaming:
Centralised Data Governance of a Distributed Data Landscape

 

Data Observability – What Is It and Why Should You Care?

Many of you in the last year or so have probably seen the emergence of the term data observability with many new start-ups appearing in the marketplace selling software to do this. Before I explain what data observability is and why it is needed, let me first take you back in time and explain a real-life situation that happened to me many years ago with a data pipeline in a centralised computing environment and contrast it with the data landscape and practices today. The purpose of this is to highlight why data observability is needed.

Back in the 1980’s I was running a data pipeline in batch on a mainframe. The pipeline loaded data into a database every night, processed it and updated hundreds of tables in an application database. As soon as that happened it then took data just updated and applied further changes to a second application database. The pipeline had been running in production for about 10 months with no problems whatsoever. Then one morning I came into the office early to find the helpdesk phones were alive with calls from users around different outlets in the company complaining that the data was all wrong. So, I looked at the batch run, and everything showed that it ran successfully! Strange!

Further investigation revealed two databases were corrupted but nothing to explain why. Data had passed through a perfectly good working pipeline that had been in production for almost a year. There were no errors, yet two databases were compromised along the way.  The impact was that multiple systems were down for almost a week (data downtime) causing a major unplanned outage with business users having to switch to manual processes while we tried to fix the problem. With huge pressure on the us, it took a team of several people three days to just find the root cause and the recovery process took a further 72 hours of several people working non-stop day and night to fix it. In the end, it was all recovered and never happened again. Bear in mind that this happened with all data and all applications running in one central mainframe computer.

Now contrast that with today where most companies are running their business in a hybrid distributed computing environment. Data and applications are not in one place. Data typically resides in multiple types of data store on-premises, in multiple clouds, in software-as-a-service (SaaS) applications and may also be streaming from edge devices.

Therefore, just to enable operational business processes to execute, data could be flowing from SaaS applications to applications on public cloud, applications on public cloud to SaaS applications, SaaS applications to on-premises applications, on-premises applications to SaaS applications, cloud to on-premises, on-premises to cloud, cloud to cloud and on-premises to on-premises. It’s fair to say, I think that today the complexity involved in data flow pipelines is much greater than it was 30+ years ago. But that’s just for operational systems. 

In the world of analytics, complexity has also grown. Many companies now have multiple types of analytical systems some of which are on premises and some of which are in the cloud. This includes data warehouses, data lakes in cloud storage or Hadoop, NoSQL Graph databases and streaming analytics.  We have data engineering pipelines running in all these environments data from source systems and data stores across a complex distributed data landscape and integrating data for use in different analytical systems. Also, many new data sources still emerging.

In addition, data mesh has now emerged. It champions the democratisation of data engineering pipeline development to overcome the bottleneck in IT where professional data engineers are being outpaced by business demand for data. The momentum behind data mesh is significant and those adopting it, will see an increase in the number of data engineers around the enterprise and an increase the number of data pipelines being built to clean and integrate data for analytics. That is because, data mesh supports the idea that domain-based ‘citizen data engineers’ develop data pipelines to create data products that can be published, consumed, and used by others. They in turn may develop data pipelines to combine this data with other data in their part of the business to create new data products to add to the data mesh. Therefore, for those choosing to build a data mesh, it is highly likely that many data engineering pipelines dependencies will emerge where several data pipelines will ultimately be dependent on other data pipelines. This potentially adds further complexity.

Furthermore, decentralising data engineering means that different citizen data engineers could potentially choose to develop data pipelines using many different tools including:

  • ETL tools
  • Self-service data preparation tools
  • Drag and drop data mining tools
  • Data virtualisation tools
  • Data transformation tools
  • Data fabric software
  • DataOps orchestration tools that orchestrate pipelines across multiple tools
  • Data wrangling tools in data science workbenches
  • Data Science Notebooks e.g., Jupyter notebooks
  • Custom built SQL scripts
  • Custom code – Other coding languages

Also, the emergence of DataOps has seen the trend towards collaborative, component-based development where data pipelines today can involve multiple technologies and invoke other component pipelines in other best of breed tools.

What has all this got to do with data observability? To summarise, the situation today is that we are dealing with an increasingly distributed data landscape with data redundancy growing across different data stores. There is a growth in number of data engineers and in the number of pipelines. Collaborative component-based pipeline development is now happening with multiple ‘citizen’ and IT data engineers potentially building pipelines and pipeline components using multiple tools. We now also have data products, data marketplaces and data sharing together with other pipelines consuming data products and creating pipeline dependencies.

Now, thinking back to the problem that happened to me in the 1980’s on a single system, ask yourself the question “What happens if a pipeline built collaboratively using multiple tools and running across multiple execution engines today works but data is corrupted?” or “What happens if a pipeline fails because of data it is processing?” How do you know what the problem is or where to look in the pipeline? How would you find it? What components were run? What code was used? Where is the data? How was the data created? Is there any intermediate data? If so, where is that? How do you explain the root cause of the problem?  

If you have to do this manually in today’s significantly more complex environment, it could take weeks to manually look at the code, track back through the pipeline step by step, looking at data going into and out of each step. You have to understand the pipelines, the pipeline components and who built them, possibly multiple tools, multiple execution engines, pipeline dependencies, then look beyond that at data ingestion, the data flowing through the pipeline, then look at raw data across a distributed data estate. You would have to keep tracking back to find out what the problem is.  …. PLUS, the more data you have the more the risk of bad data breaking otherwise perfectly good data pipelines increases.

What you really need is software that constantly observes what is happening as data pipelines execute. has the ability to automatically collect metadata as you pipelines execute, produce metrics. automatically detect / predict pipeline incidents, use, report on and visualise lineage to pinpoint problems, set rules, monitor them, and automatically act when needed. You would also like it to be self-learning.

Data observability is the ability to continuously monitor the health and usage of your data and health of data pipelines by collecting, consolidating, and analysing signals from multiple layers of technology during pipeline execution. It enables changes and incidents to be detected, tickets to be raised, alerts issued, and actions to be taken to ensure that pipelines become more resilient in design, improve in performance, run at optimal cost, and produce data that, is trustworthy, compliant, and timely with good provenance.  It is part of data governance and vital to Data Mesh implementations

By collecting pipeline execution signals and looking at data it is possible for data observability software to observe all pipelines, pipeline runs, pipeline tasks, pipeline performance, job latency and success/failure trends. It can look at pipeline lineage, pipeline failures and monitoring rules created, and rules violated. It can show datasets and other pipelines affected by failures. It can monitor data freshness, schema changes, row counts, data quality including data completeness (e.g., nulls), data duplication, data values versus expected values, anomalies, unprotected sensitive data processing and other issues.

I will be discussing this and much more in my classes for Technology Transfer on “Centralised Data Governance of a Distributed Data Landscape” and also “Practical Guidelines for implementing a Data Mesh”. Please join me if you can.