Mike Ferguson

By Mike Ferguson

November 2014

Big Data and The Extended Analytical Ecosystem

When you look at competition today, there is no doubt that additional insight is now needed on customers and business operations to remain competitive. New forms of data being created now hold the key to competitive advantage.

They contain high value insights that every company must learn to discover if they are to survive in a market where the impact of the web has had a profound impact on customer behaviour. On the web the customer is king. They can surf around comparing products, services and prices any time, anywhere all from a mobile device. In addition, with new web-based companies springing up everywhere, the customer has much more choice. Loyalty is easily forfeited in favour of a better deal with a few clicks of a mouse. Given that this is the case, we are now at the point where transaction data is not enough to provide comprehensive customer insight. The need to understand on-line customer behaviour is also now mission critical.

The web is also a place where the customer has a voice. Access to social networks is now ubiquitous and companies would be foolish not to recognize that they are a rich source of customer insight. In addition, when it comes to buying decisions, most people trust their network of friends and contacts more than anyone else and so understanding social relationships, interactions and influencers can help provide insight into who or what is influencing purchasing behaviour. Also people are quick to compare products and prices and share this information with others across these networks making potential buyers better informed.

Within the enterprise, web logs are growing at staggering rates as customers switch to online channels as their preferred way to transact business and interact with companies. Also, increasing amounts of sensor networks are being deployed to instrument and optimise business operations. The result is an abundance of new “big data” sources, rapidly increasing data volumes and a flurry of new data streams that all need to be analysed.

The characteristics of these new data sources are different from the structured data. For example, the variety of data types being captured now includes:
• Structured data
• Semi-structured data, e.g. XML, HTML
• Unstructured data, e.g. text, audio, video
• Machine-generated data, e.g. web logs, system logs, sensor data

The arrival of big data and big data analytics has taken us beyond the traditional analytical workloads seen in data warehouses. Examples of new analytical workloads include:
• Analysis of data in motion
• Complex analysis of structured data
• Exploratory analysis of un-modeled multi-structured data
• Graph analysis e.g. social networks
• Accelerating ETL processing of structured and multi-structured data to enrich data in a data warehouse or analytical appliance
• The long term storage and reprocessing of archived data warehouse data for rapid selective retrieval
These new analytical workloads are more likely to be processed outside of traditional data warehouses on platforms more suited to these kinds of workloads.
Looking at the new types of data that businesses want to capture, together with the types of analytical workload they now need to implement to remain competitive, it is clear that a new architecture is needed. The architecture required is shown in Figure 1. It represents an enterprise analytical ecosystem that supports traditional data warehouse ad hoc query processing, analysis and reporting, as well as the new big data analytical workloads now needed.

The enterprise analytical ecosystem includes a number of new analytical platforms that integrate with and expand beyond the traditional data warehouse environment. The technology components required in this new architecture are as follows (in bottom up order of appearance in Figure 1):
• An enterprise information management (EIM) tool suite.

• Multiple analytical platforms that are integrated to manage big data and traditional analytical workloads. These are listed in the following table:
Analytical Platform Purpose
A NoSQL Graph DBMS Graph analytics e.g. social network influencer analysis, fraud pattern analysis
Hadoop platform Exploratory analysis of multi-structured data
Data landing zone/staging area/ data refinery for processing raw multi-structured data en route to other analytical data stores
Support batch, interactive, streaming and graph analysis
An Analytical Relational DBMS Complex analysis of structured data e.g. for data mining to build predictive and statistical models
A Data Warehouse Traditional query, analysis and reporting
Integrated with other analytical platforms in the analytical ecosystem e.g. Hadoop and the NoSQL graph DBMS
An Stream Processing engine For analysing data in motion and for real-time decision management on high velocity data streams such as sensor data, markets data and multi-structured data

• New analytical tools and techniques that have also been added to cater for new analytical workloads requirements. These include:
• Custom analytic applications written to exploit the Hadoop MapReduce framework or the Hadoop Spark in-memory framework to analyse multi-structured data in batch
• BI tools that generate MapReduce or Spark application code to retrieve and analyse data typically stored in Hadoop
• Search based BI tools that index data typically from Hadoop in support of exploratory analysis of multi-structured data
• Graph analysis tools that visualise data from NoSQL Graph databases in support of exploratory analysis

• Existing BI platform tools that can access both SQL-based and NoSQL based analytical platforms (e.g. access Hadoop data via a SQL on Hadoop initiative) in support of different types of visual discovery and reporting needs

• It should also be possible to develop predictive and statistical models and deploy them in a Hadoop system, an analytical RDBMS and event stream processing workflows for real-time predictive analytics

Within this new analytical ecosystem there is also an integration requirement to facilitate cross platform analysis. For example, when variations in streaming data occur, the business impact is analysed and action can be taken if required. Filtered events of interest can also be passed to EIM software and loaded into Hadoop for subsequent historical analysis. If any further insight is produced on Hadoop, that insight may then be fed into a data warehouse to enrich what is already known. For un-modelled multi-structured data, this data can be loaded directly into Hadoop where it can be cleaned, transformed and integrated using EIM data integration software on the Hadoop platform in preparation for exploratory analysis by data scientists. They can then analyse the data using custom analytic applications, or map/reduce tools that generate Java or Pig. In-Hadoop analytics can be used as needed. Alternatively search-based BI tools can be used to analyse big data via search indexes built in Hadoop. If data scientists produce any valuable insight, it can be accessed via BI tools using SQL on Hadoop or be loaded into the data warehouse to make this new insight available to traditional BI tool users.
Traditional data warehouse workloads also continue as normal.

A key emerging role for Hadoop is that of an Enterprise Data Hub as shown in Figure 2.
Figure 2
An enterprise data hub is a managed and governed Hadoop environment in which to land raw data, refine it and publish new insights that can be delivered to authorised users throughout the enterprise, either on-demand or on a subscription basis. These users may want to add the new insights to existing data warehouses and data marts to enrich what they already know and/or conduct further analyses for competitive advantage.
The Enterprise Data Hub consists of:
• A managed data landing zone (data reservoir)
• A governed data refinery
• Published, protected and secure high value insights
• Long-term storage of archived data from data warehouses
All of this is made available in a secure, well-governed environment. Within the enterprise data hub, a data reservoir is where raw data is landed, collected and organised before it enters into the data refinery where data and data relationships are discovered, data is parsed, profiled, cleansed, transformation and integrated. It is then made available to data scientists who may combine it with other trusted data such as master data or historical data from a data warehouse before conducting exploratory analyses in a sandbox environment to identify and produce new business insights. These insights are the output of the data refining process. They are made available to other authorised users in the enterprise by first describing them using common vocabulary data definitions and then publishing them into a new insights zone where they become available for distribution to other platforms and analytical projects.