Mike Ferguson

By Mike Ferguson

September 2016

The Challenge Of Big Data

From recent research I have conducted across Europe into vertical industry business priorities, three things dominate almost every type of business. They are:

  1. The need to become customer centric and improve customer engagement across all channels
  2. The need to optimise business operations
  3. The need to invest in data privacy and security in order to remain compliant with the new European Union General Data Protection Regulation (GDPR) which companies must be able to ensure by May 2018.


The first two should not be a surprise. After all, they are clearly about profit margin. The focus on customer is to protect and grow revenue while the need to optimise operations is about cost reduction and efficiency. The surprise perhaps is the very rapid rise in importance of data privacy no doubt caused by the introduction of new legislation from the European Parliament.  In that sense it is imposed on organisations as opposed to the first two, which are purely business driven.  Nevertheless, all three are require deep investment in data management, big data and analytics.  Let’s look at each of these in more detail to understand why.


With respect to customer centricity, the objective is to retain and grow your customer base by offering them competitive, personalised products and services along with the highest quality customer service. To do this requires that companies gain as deep an understanding of their customers as possible. Certainly way beyond the insights offered by analyzing historical transaction activity in data warehouses and data marts. It requires the capture and analysis of much more data from both internal and external data sources.  An example would be social media data offering up customer opinions and networks of friends and influencers for each and every customer. Other examples would be personal location data, weather data and open government data all of which can provide deeper understanding of purchasing behavior and factors that can drive up sales. There is a lot of potential data out there some of which can vary in data type, be very large in volume and/or be generated at very high velocity. To gain an understanding of customer opinion about your brand and products requires the collection and analysis of text data from social media and review web sites. To understand customer web and mobile on-line navigation behavior requires clickstream data from web logs. However capturing every click and every touch of a mobile phone screen can be a challenge as this kind of data can be very large in volumes. Even more so if you also plan to capture intermediate shopping cart data leading up to a purchase to identify potential products of intereste that were looked at but not bought or that were placed in a cart but taken out again. Also If you want to offer personalized recommendations while a customer or prospect is online, then web log data needs to be captured, prepared and analysed in real-time as people browse your website – a again a challenge as scalability is needed. Also if you want to monitor customer location continuously you will need to collect high velocity GPS sensor data from customer smart phones as it is generated.


The point here is that for companies to produce a more comprehensive single view of a customer, new data is required but that can throw up new challenges. It may require the use of multiple analytical platforms only one of which is a data warehouse.  More advanced analytics are needed but they need to run at scale to deal with data volume and data velocity. Also data may need to be integrated at scale to prepare the data for analysis and if new insights are produced on multiple analytical platforms then those insights need to be integrated for each and every customer across these platforms to provide the single customer view needed. In addition, customer master data management is central to this strategy. This, by default, brings several other data management requirements into the spotlight including identity management, data profiling, standardisation, cleansing, enrichment and matching, all of which are critical to successfully delivering single customer view. The need to present insight from different underlying analytical platforms as a single integrated customer view is made possible by the use of data virtualization software to enable what some are calling ‘The Logical Data Warehouse”.


Optimisation of business operations also has its challenges. Companies are instrumenting their business operations to gain a deeper understanding of what is going on in an area where there has often been very little insight. The deployment of sensors in manufacturing production lines and in logistics operations means that new data is now available to see what is happening at every stage of production all the way through to product delivery. Equally, in Oil and Gas, sensors allow us to see live drilling operations, monitor well integrity and pipeline flow as it happens. However this type of data is typically generated at very high rates. Furthermore, it may be stored in the cloud which means data management has to span both on-premises and cloud storage. Also as OLTP applications are upgraded to be web and mobile enabled, they enable the capturing of customer on-line behaviour. However, they need to scale to handle high levels of concurrent usage, which means that non-transaction data such as shopping cart data and on-line session data need to be captured and retrieved rapidly. This often warrants the use of NoSQL data stores to read and write data quickly but it opens up new challenges such as how best to design a schema in these relatively new kinds of systems to efficiently capture the data needed and to handle low latency data to monitor events in real-time.  If this involves sensor data, then the velocity at which data is being generated can be very high. Therefore data needs to be captured and ingested at scale. It also highlights the need to analyse streaming data before it is stored anywhere. This requirements makes performance and scalability critical to be able to analyse, detect, and act when business patterns indicating specific business conditions are detected. Machine learning analytical models need to run at scale as do this. Operational applications that can capture data quickly may also be required in business operations. The question is however, should all this data be brought to the centre for analysis or should some of the analysis be done much closer to the point where the data is being generated.  I think the latter for several reasons. Firstly, we don’t want to wait for all the data to be centralized before we analyse it, we may need to analyse it as close to real-time as possible. Spotting patterns in real time and acting on them as they happen allows companies to respond rapidly to keep operations optimised. Secondly, we do not want to pump all that data over the network when sensors, for example, might be emitting the same reading at every interval with a variance only on occasion. Therefore only a small percentage of that data (which could still be sizable) is likely to be used in analysis. Thirdly, we need to be able to scale in these kinds of environments and the way to do that is to distribute analytics to happen at the edge.


The last and final high priority is the issue of data privacy which for all of us here in Europe is not just a data governance issue but now a legal requirement since the introduction of GDPR by the European Parliament earlier this year. This is a monumental challenge for most organisations for many reasons. Probably the biggest of which is that we are now dealing with multiple different types all data, held in multiple different types of data store both inside the enterprise and on the cloud. The data lake is not centralised, as Hadoop vendors would like us to believe. It is in fact distributed with redundant copies of data in multiple locations and data stores. The challenge therefore is to identify where sensitive data is located, to classify it with a level of sensitivity required and to then apply the appropriate policies and rules appropriate to that level of sensitivity. More specifically we need to uphold privacy policies when sensitive data is maintained irrespective of what kind of data store (e.g. NoSQL DBMS, relational DBMS, Hadoop, cloud storage etc.) it resides in. We also need to apply the same policies to data that is duplicated in multiple data stores and uphold those policies if data is moved between different types of data store. Lastly we need to ensure that people can not be identified when data is being integrated and analysed.  It is a tough challenge.


All of these topics will be addressed at the up and coming Big Data Summit in Rome in November when we will be looking at data modeling for NoSQL databases, fast data analytics, predictive & advances analytics, big data on the cloud, data privacy and how to create a Logical Data Warehouse