Establishing A Foundation For The Data Driven Enterprise
In todays world we hear so much about two major business requirements. These are the need to ‘go digital’ and the need to become a data-driven.
In the case of digitalization it is the introduction of digital channels and the internet of things (IoT) that are dominating the agenda. Digital channel adoption has seen web, mobile and social commerce are being introduced into the enterprise as more and more people prefer to transact on-line from desktop and mobile devices and through corporate social network pages. This is creating huge amounts of so-called ‘digital exhaust’ data. For example, click stream data in web server logs, and non-transactional shopping cart data often stored in NoSQL databases. Click stream records everything we click on with a mouse or via touch of a mobile device screen. There is a record for every click with all this data held in Web log files. That means we can precisely track every visitor’s navigational path through your website. We can see what pages they looked at, where they went next and much more. As you can imagine, if there are tens of thousands or even more visitors browsing a web site, it is not long before the volume of click stream data being generated becomes enormous. Similarly, shopping cart data records a history of what you put into your shopping cart, what you took out, what you put in again etc., all before you buy. That means we can see what products and services a visitors or customer might be interested in even if they didn’t actually buy it. With respect to IoT, more and more products are now carrying sensors that emit data. Good examples include mobile phones with GPS sensors, watches, fitness wrist bands, cars, industrial equipment, fridges…etc. The list goes on, and here too, the volumes of data being generated are huge. However companies want this data because, while we already capture transaction activity, this kind of data can tell us much more that we already know. We can learn new things from this new data, especially when we combine it with customer data.
This brings us to the second major requirement – the data driven enterprise.
What does data driven mean? Wikipedia defines it as:
“…progress in an activity is compelled by data, rather than by intuition or personal experience. It is often labeled as the business jargon for what scientists call evidence based decision making”
It means that the data, when analysed, produces evidence-based insights that allow companies to see new opportunities to disrupt existing and new markets. Therefore becoming data-driven, is about being led by data and analytics. The key question is how do you achieve this? What is it that you need to do to become data-driven? How do you deal with the deluge of data now pouring into the enterprise? What kinds of analytical platform or platforms do you need? Should you use the cloud, on-premises systems or both? How do you maximize the potential of predictive and advanced analytics? How do you deal with ‘the internet of things?’ How do you integrate Big Data into you existing analytical architecture? How do you stay agile in a world where data is becoming increasingly distributed and therefore harder to access? How do you handle data governance when data is scattered across OLTP, NoSQL databases, Analytical RDBMSs, Hadoop clusters and other file systems? How do you overcome the potential chaos of business led, self-service BI and self-service data preparation? How do you harness shadow IT and turn it into citizen data science? It is a major challenge and it is not difficult to get overwhelmed by it. Yet among the chaos, common sense must prevail. This is not about replacing what you have. This is about extending existing analytical environments to accommodate new data and new analytical workloads in order to produce new insights to add to what you already know. In the past, the data warehouse was the analytical platform. Today that analytical platform needs to be extended to include:
- Real-time analysis of high velocity live streaming data
- High volume ingest technologies
- New analytical data stores like Hadoop HDFS, Amazon S3 and NoSQL graph databases
- Technologies for scalable exploratory analysis on large volumes of internal and external multi-structured data e.g. Hadoop and Apache Spark
- Advanced analytics e.g. machine learning, text analysis, graph analytics
- End-to-end data management, scalable ETL and data governance
- A combination of self-service and IT based development
- Simplified access to multiple analytical data stores
This new ‘extended analytical platform’ is architecturally much more complex because it now includes all of the above in addition to a data warehouse. However it has to function as if it is fully integrated. We need to:
- Manage scalable ingestion of data
- Create an organized data reservoir and manage it as if it was centralized even though it may be distributed
- Automate the cataloging and profiling of new data coming into the reservoir
- Introduce collaboration into data management to help classify data in the reservoir in terms of quality, sensitivity, business value
- Be able to refine data at scale via ELT processing by defining how we want to transform and integrate data independently of where we want those jobs to execute (e.g. in Hadoop, in DW staging tables, in the cloud etc.)
- Encourage agility, through exploratory data science sandboxes, self-service data preparation and self-service analysis
- Provide transparent access to data and insights, produced by data scientists, irrespective of whether that data is in a data warehouse, in Hadoop, in a live data stream, or a combination of all of these. Also irrespective of whether that data is on-premises in the cloud or both. The way this can be achieved is via SQL and data virtualization to create a virtual ‘Logical Data Warehouse’ layer across all multiple underlying analytical data stores irrespective of whether they are relational or Hadoop based.
At present we are in the middle of seeing this new extended analytical platform come together and even though some software components are still incomplete, many companies are already using technologies like Hadoop and Spark to build new analytical applications. Critical success factors include business alignment (i.e. making sure candidate projects are aligned with strategic business goals), increasing automation so that you don’t have to write code to prepare and analyse data, an information catalog to govern what data is available for reuse, and organizing for success to enable citizen data science to rapidly produce new insights.