By Barry Devlin

April 2021

The Dilemma of Data Digestion

Most of us are familiar with claims made by vendors of business intelligence (BI) and analytics tools that some disappointingly low percentage—often 10% or less—of businesspeople are actually using data in their decision making. The solution proposed is, unsurprisingly, to be found in product X, and more generally in the adoption of self-service BI. The more experienced (read “older”) of us recall that BI vendors have been using the same argument and figures for some thirty years.

Why has the problem persisted? New analytic techniques and improved BI tools haven’t really helped. The problem, I believe, lies between the data and the people. Very few people can truly digest data and extract valid, nutritious information from it. Many examples exist, but use of data in the Coronavirus pandemic illustrates the problem well.

Using Data in the Pandemic

Since the beginning of the pandemic, we have been inundated with data of all shapes and sizes, upon which decisions are based at every level, from personal to societal. The consequences are of enormous significance: economic, health, life or death.

The diagram below shows the daily number of COVID-19 infections recorded in Italy since February 2020. This picture, as well as raw and summary data behind it, are widely used by the media and government.

The basic story it tells is that cases in October and November significantly exceeded the peaks seen in March. For many, that time period is remembered as the crowning horror of the pandemic, and indelibly associated with death tolls peaking at some 800 per day. During the period through August and September, rising case numbers were often presented as a warning to the populace to take precautions against a rising tide of infection. It was only in early October, when cases began to rapidly overtake the spring peak, that reporting began to emphasize that early testing was largely limited to strongly symptomatic people in hospital, whereas current testing is much more widespread. In short, the case numbers pre- and post-summer cannot be compared. Thousands of such graphs from across the world demonstrate the point: data may not lie (although it may be erroneous), but the information may be misinterpreted—both deliberately and inadvertently—and often is.

As vaccines have rolled out and testing has become more varied, a new array of numbers and statistical concepts have been in the news. In some countries, much governmental faith has been placed in lateral flow tests that are less costly, avoid laboratory analysis, and rapidly return results. However, in comparison to the standard polymerase chain reaction (PCR) tests, experts have raised concerns about their effectiveness. To understand the argument, you need to distinguish between false negative and false positive rates, as well as sensitivity and specificity. The fear is that that these tests give people who test negative false reassurance, leading to more infectious people who should be self-isolating, because they misinterpret the data. Indeed, the British government stands accused of misinterpreting some of the data for political ends.

The dilemma is that most “ordinary” people cannot digest data and end up with misinformation. Is the situation any better in business? Two examples suggest it is not.

Using Data in Business

A decade-old white paper from data analytics firm Mu Sigma identifies “5 Common Mistakes People Make in the Name of Statistical Analysis”:

  1. Sophistication in statistics compensates for lack of data and/or business understanding
  2. Extracting meaning out of randomness
  3. Correlation versus causation—modelling will help uncover causal relationships
  4. Extrapolating the models way beyond the permissible limits
  5. Imputing missing values with mean or median is the best way of treating missing values

Important and relevant though they remain, these are rather sophisticated mistakes related to the application of statistics to business. As the authors point out, “Some of these mistakes stem from incomplete understanding of statistics, some from the incomplete understanding of underlying business and the rest from the inability to marry the two together. With the advent of data analytics and decision sciences, our decisions are being increasingly impacted by these errors, which can result in major implications for our business and therefore the need for the business executives to appreciate, sense and avoid these common pitfalls.” The immense growth in the use of analytic methods since 2011 means that these and similar errors become even more important.

The Business Application Research Center’s (BARC) 2017 article “Self-Service BI: An Overview” describes the benefits of self-service to be “to improve agility and flexibility in business departments by increasing user independence from IT departments. At the same time, IT’s workload for simple tasks is reduced, enabling resource-drained IT departments to focus on tasks with a higher value add for their organization.” On the other hand, the report notes significant potential impacts on data governance as data inconsistencies, analytic errors, and data silos proliferate.

Data Intolerance is the Problem

While BI vendors and experts contend that more people should be using data, a more fundamental question is how many businesspeople could (that is, are able to) use data to support decision making via analytic tools or self-service BI. In my experience, the number is smaller than most of us might think or hope.

We hear a lot today about food intolerances. Some people cannot consume wheat, others dairy. A similar concept applies to data. For many, data leads to mental confusion, resulting in loss of information or misinterpretation of meaning. So, is there a way to reduce data intolerance, to improve data digestion?

The standard remedy proposed is metadata. Data catalogs have seen a recent revival in interest as data lakes have proliferated and poorly defined and managed data has become common. Context-setting information (CSI) is a more meaningful name for metadata as I described in Business unIntelligence. Putting a strong emphasis on CSI creation and provision to businesspeople is certainly a first step to tackling data indigestion. However, it is not a complete cure.

Martijn ten Napel, a Dutch BI architect, believes that BI project failures result from a lack of coherence between people, process, information, and technology. His answer is the connected architecture—a framework for the organisation of DW and BI projects. He contends that we must clearly distinguish between information consumers and producers. Information consumers are businesspeople that turn information—not data—into activities, decisions, and ultimately value. Information producers—from ETL specialists to data scientists—prepare data and extrapolate it to information. Sitting between business and IT, information producers must provide a two-way bridge to maximise contextual information transfer and minimise loss or misunderstanding of information.

Information producers must write the prescription to cure data indigestion—a descriptive and understandable story or commentary that always sits with every piece of information used by consumers and explains how and why the data and information can only be interconverted and interpreted in certain logical and meaningful ways. Creating such information commentary is neither simple nor cost-free. However, it is a necessary investment to improve data digestion and increase information value across the business.