By Barry Devlin

May 2020

Upcoming events by this speaker:

Jun 24 – Jun 25 2020:
(Online Events)
How to Revamp your Data Warehouse and Lake for Digital Business

Jun 26 2020:
(Online Events)
How to Revamp your BI and Analytics for AI-based Digital Business

 

AI Requires Data Perfection

AI requires “perfect” data to lessen the chances of faulty AI outcomes. Perfect data is, of course, impossible. But improved data quality via good data management is achievable.

Over the past few years, artificial intelligence (AI)—also known as machine learning (ML), cognitive computing, etc.—has become the Holy Grail of almost every business. According to the hype, it will revolutionise sales with advanced segmentation for customer acquisition, retention, and up-selling. It will optimise production systems and supply chains by anticipating problems and opportunities. It will drive trucks and fly planes. It will converse meaningfully with people, replacing many support staff. It will diagnose disease and provide personalised treatments and drugs.

The list goes on and on and, in many cases, early applications of AI in these areas are already in the market. The race is on. According to industry experts, if you don’t adopt AI aggressively, you will be left behind in a digitally transformed market where the winners take all.

I beg to differ. It is vital to take a few steps back to see the bigger picture, to see where AI has come from, how it truly works and where it doesn’t, and why it depends entirely on a topic that motivates all builders of business intelligence (BI), data warehouses and lakes—data management.

AI is the evolutionary end game of BI

 

Meredith Broussard’s “Artificial Unintelligence—How Computers Misunderstand the World” offers a primer on “real” AI. Somewhat like my similarly named “Business unIntelligence,” Broussard offers an alternative and more nuanced view than found in more standard texts. Her message is that “technochauvinism—the belief that technology is always the solution”—leads many otherwise thoughtful people to “assume that computers always get it right.” Of particular interest here is her oft-stated conclusion that AI is completely data-driven. The outcome of any AI process is wholly dependent on the relevance, completeness and cleanliness of the data used to train it and on which it subsequently operates.

Such a conclusion should not surprise anyone who comes from a data warehousing / data lake / BI background. We have known since the earliest days of BI that data quality is key to valid outcomes.

AI is simply the last phase in the evolution of BI. This relationship is less obvious because the hyped edge cases seem very far from traditional BI. However, consider autonomous vehicles. AI’s role is entirely to support decision making. Which route to plan to take? What route changes are needed due to traffic? What speed to drive at? Which obstacles are static, and which are moving? How to avoid them? In case of conflict, which obstacles (people and bicycles, for example) are more important to avoid? Of course, these are not business decisions and many of them are operational in nature. However, the parallels should be clear. AI is all about automating decision making.

The evolution of decision-making support can be neatly divided into four stages:

  • Descriptive: what happened?
  • Diagnostic: why did it happen?
  • Predictive: what might happen (increasingly an operational consideration)?
  • Prescriptive: make some outcome happen (automatically)

As shown in the accompanying figure, this approach further suggests that as business follows this evolutionary path, value increases, as does the complexity of the endeavour. While the earlier stages are delivered by traditional reporting and BI tools, the later stages demand first data mining / analytic tools and then AI and ML approaches. However, the boundaries between these categories of tools are very vague. The function offered by modern decision-making support tools is more a result of historical product development than well-defined category boundaries.

What this diagram (and similar constructs) doesn’t explicitly show is how the data needed in the different stages also changes and grows. A similar curve can be drawn, with corresponding stages:

  • Prearranged: IT delivers the data it has or can easily obtain—basic ETL (extract, transform, load)
  • Reactive: With growing business needs, data delivery struggles and increasingly complex, ad hoc delivery systems are built—early data mart and data lake projects
  • Integrative: IT integrates a broad selection of enterprise data to ease and speed delivery of decision-making support—advanced data warehousing
  •  Adaptive: a flexible delivery infrastructure offers both integrated and fresh data from many sources, including Internet of Things—requiring a fully modernised data architecture

As IT pursues this path, its focus must turn increasingly toward data management and governance.

AI increasingly demands perfect data

 

As AI is embedded ever deeper and further in the decision making, the types of decisions being undertaken become more complex and their outcomes more impactful on the individuals involved.

As an example, machine learning is already used to implement so-called price optimisation. The optimisation is really of profit. Customer segmentation allows different prices for the same item to be offered to different subsets of the population to test price sensitivity. Studies show that the results often discriminate against minorities or specific subsets of the population. While it may be argued that the impact of small price differences is minimal, applying similar techniques to vetting job applications, accepting or rejecting insurance proposals, or estimating criminals’ likelihood to reoffend—all current uses of AI—have lifetime implications. Getting such decisions wrong is unethical and indefensible.

The underlying problem is the quality of the data used in training and running the machine learning models. A wide variety of issues of bias (unintended or otherwise), missing data, or erroneous values may arise. The fundamental fact is that data quality levels are application dependent. Data collected for one purpose may work well for that use case but may be completely inadequate for training an AI algorithm. The important determining variables may not exist in the data set, but ML will find some set of included variables that correlate, even where there is no rational causal effect.

In effect, AI requires “perfect” data to eliminate or even minimise the chances of faulty predictions or prescriptions. Perfect data is, of course, an impossible ask of any IT system.

The necessity of data management

 

Although data perfection is impossible, we can certainly dramatically improve data quality in both existing and new systems. The path is through a new and maniacal focus on data management and governance across the entire IT environment. This focus will not come cheaply. It demands significant investment, both in capital and operations, and particularly in staffing, both in numbers and skills. It requires a degree of support from executive management beyond that seen in most organisations today.

Are you prepared to make such an infrastructural investment in data management and governance in parallel with your uptake of AI? Or are you willing to risk the reputational and financial damage resulting from some foreseeable, avoidable, and unethical outcomes of your AI algorithms?