By Barry Devlin

January 2017

Data Drives Everything in the Algorithmic Business of the Future

Data—old and new

2016 marks the thirtieth birthday of the data warehouse. Congratulations!

Way back in 1986, I and colleagues in IBM defined the first data warehouse architecture to manage sales and delivery of System/370 mainframes and System/38 minicomputers in Europe. If you recognise those names, you are probably quite a bit past your thirtieth birthday! We subsequently described the architecture in the IBM Systems Journal in 1988 —so, if you want to hold the Asti until 2018, that would work too. The world has changed dramatically in the intervening period. Recall the cost and size of the above technology. Consider that PCs running DOS 3.3 on Intel 80286-based machines with 20MB disks were state-of-the-art. Today’s smartphones are more powerful than the above mainframes. And yet, the data warehouse architecture has changed very little over the same timeframe, as can be seen in figure 1, from the 1988 Systems Journal article. [Figure 1: Original data warehouse architecture, 1988] Of more interest here than the changing technology is the emerging data landscape. The data warehouse was designed for an environment where all information arose from the processes of running the business. These processes were designed within the business to offer the most accurate and reliable view of the legal basis and transactions of the business. The process-mediated data thus delivered from operational systems and integrated in data warehouses continues to be central to business management and operations. For such data, the principles and practices of data warehousing remain just as valid today as they did thirty years ago. The emergence of new types of data in recent years has undermined faith in data warehousing and led many to focus on new technologies, such as Hadoop, and to attempt to define new architectures, such as data lakes and reservoirs. These new information types fall into two broad categories. Human-sourced information refers to a range of socially-generated and -used materials, including tweets, text documents, images and video mostly from outside the enterprise. Machine-generated data originates from all types of sensors and machines, both internally sourced and increasingly from the Internet of Things. These two types of information, often called big data, differ significantly from process-mediated data—and from one another—in terms of structure, volumes, reliability, time sensitivity and other characteristics. They also offer new or enhanced business opportunities to digitalize many aspects of business, with algorithmic approaches increasingly seen as essential to profitability and success across a wide range of industries. Such algorithms are or will be based on recent advances in artificial intelligence and deep learning enabled by the explosive growth of big data. This transition has wide-ranging implications for business and IT, not least in the area of data architecture.

Architecture—old and new – The novelty of big data, as well as its oft-mentioned characteristics of volume, velocity and variety, has led consultants and vendors to focus their development and sales efforts there. This focus has been so intense that many customers have come to believe that data warehousing is outdated and even redundant. In reality, nothing could be further from the truth. The role of process-mediated data is distinct from these new data types. Big data provides context and predictive richness around the formal business transactions of process-mediated data, which continue to depict the legal reality of the business. Another, less common, belief is that some or all of this new data should be merged into the existing data warehouse environment. This is erroneous because the volumes are too large in most instances and the rate of change of data and its meanings is too rapid and frequent to be accommodated in traditional technologies. Furthermore, these new data types may not require the same type of reconciliation and quality management demanded of core process-mediated data. However, experience already shows that the business benefits most when both old and new data types are used together. Human-sourced information from social media sources delivers more value when correlated with customer data from internally sourced process-mediated data. Machine-generated data from click-streams allows deeper and more meaningful analysis of why and how sales dropped last quarter in the process-mediated data warehouse. This leads us to a hybrid type of architecture, shown in figure 2, consisting of multiple pillars, each optimized for the characteristics of a particular class of information/data and interconnected by shared context-setting information and assimilation function. The original data warehouse architecture lives on, although in a modernized form within the process-mediated data pillar. This new architecture is described in detail in my book “Business unIntelligence” . [Figure 2: Modern pillared information architecture] In contrast to the data warehouse, where all data comes from the business transactions, this new architecture recognizes that data/information actually originates outside the enterprise systems that run and manage the business. The business is, in fact, driven by information arriving from the real world in the form of events, measures and messages of human or machine origin. On the basis of such information, all decisions and actions of the business rest. Decision making, as envisaged by data warehousing, is but one component of the modern digitalized, data driven business.

Decision making beyond the data warehouse – The original goal of data warehousing and business intelligence was to support decision making by management and analysts in tactical and strategic questions. As business intelligence evolved, the focus shifted to more operational, shorter-timeframe decisions. While both these needs will continue to exist, they are now seen as only one aspect of decision making support. As the world is increasingly instrumented, with enormous volumes of messages, events and measures flowing in and out of the enterprise, many tactical and operational decisions move from the center to the edge of the organization, from management and staff to algorithms and machines. Such agents will make detailed decisions that were once the preserve of humans, driven by artificial intelligence and deep learning. Their abilities are already impressive—watch AlphaGo beat the world Go champion or autonomous vehicles navigate city roads, and they are improving exponentially—but it is important to recognize that all algorithms are only as good as the data on which they base their interpretations and decisions. As we have already learned in thirty years of data warehousing, high quality, consistent data is vital to decision making. For these thirty years we have worked hard to guarantee the integrity of our data. But if we are honest, we have a poor track record. As the modern business world of digitalized business takes shape, our ability to manage the quality and integrity of all three types of data will be vital to ensure that algorithmic decision making across the span of business actually works as intended.