From Analytical Silos To An Integrated Analytical Ecosystem
Over the last five years or so, many companies have gone beyond the traditional data warehouse to pursue new kinds of analytical workloads. This would include exploratory analytics on unstructured data, natural language processing, streaming analytics, graph analytics, and deep learning.
Typically these types of analytical workloads have been built on different types of analytical systems that are optimised for specific types of analytics. An example here would be that graph analytics has been developed on top of a graph database. Also, streaming analytics has been developed on a streaming platform.
While each of these initiatives may have been successful, the problem that has arisen is that development on each workload specific analytical system has resulted in analytical silos (see Figure 1) with each silo using a different approach to data processing and analytical development. For example the data processing and development of analytics on a platform like Hadoop is done differently to the data processing on a streaming analytics platform or a NoSQL graph database. In addition, all of the processing and development in each of these silos is different to what has been happening in traditional data warehousing. It is almost that the data architecture is influenced by the analytical platform in use.
Looking there are several questions that can be asked. For example, is it possible to adopt a more common approach to producing data for use in multiple workload specific analytical systems? Can data management be simplified so that data is prepared once and re-used everywhere? Can the approach to developing analytics be done using a common extensible platform that accommodates multiple libraries and types of analysis (e.g. Scikit Learn, Apache Spark, Tensorflow, H2O,…) while at the same time accommodating R and Python development via RStudio and Jupyter notebooks respectively? Is there any way to develop analytical models on a common a common analytical platform extensible analytical platform and then manage model deployment across different environments e.g. in-database, in-spark, in-stream, at the edge?
To answer these questions we need to look at what you can do to start simplifying data architecture while introducing agility and driving integration across the silos. There are several things that can be done. The first is adopt a common collaborative approach to data management and data governance. This involves:
• Incrementally establishing a common business vocabulary for all trusted logical data entities (e.g. customer product, orders etc.) and insights you intend to share. The purpose of this is to ensure that the meaning of widely shared data is documented.
• Rationalising the individual data management tools shown in each of the silos in Figure 1 and making use of common data fabric software with an accompanying enterprise data catalog instead. Examples of data fabric software include: Talend Data Fabric, Google Cask Data Application Platform, IBM Cloud Private for Data and Informatica Intelligent Data Platform. In most cases the enterprise data catalog is included as part of the data fabric software although some vendors may offer it as a separately purchasable product. If you do purchase common data fabric software with an accompanying enterprise data catalog it should be possible to connect to data sources both on-premises and in the cloud and so create a ‘logical’ data lake. From there you can start you have and put in place a programme to produce trusted, commonly understood data that is governed.
• Using the enterprise data catalog to automatically discover, catalog AND map discovered data back to your common vocabulary which is held in a business glossary within the catalog
• Organising yourselves to become a data driven company. This can be done by thinking about data as an asset and enabling business and IT information producers to work together in the same data project team preparing and integrating data using the data fabric software. The purpose of these data projects are to produce trusted, commonly understood, re-usable ‘data products’. For example customer data, orders data, products data, etc.
• Creating an information supply chain within the enterprise. This is a publish and subscribe ‘production line’ where the aforementioned teams of information producers (business and IT) take raw data from a logical data lake and create ready-made” data products for use and re-use across the enterprise
• Publish the ready-made data products in an enterprise data marketplace – a data catalog containing trusted data – for information consumers to find and use
This approach should contribute significantly to speeding up data preparation, enable data governance and improve productivity through re-use instead of re-invention.
The second major thing that can be done is to establish a collaborative approach to developing machine learning models using a common analytical platform irrespective of where the models are going to be deployed. One of the biggest problems in data science today is the fact that a ‘wild west’ situation has emerged where every data science team across the business is taking its own approach to model development using whatever tools it chooses without considering model development and model management from an enterprise perspective. Also so many companies are failing to deploy any models. The result is a ‘cottage industry’ of many different technologies being used for the same thing, skills spread too thinly across all these tools, a maintenance nightmare and no-reuse.
In order to deal with this, one option that many companies are looking at is to invest in an extensible integrated development environment (IDE) for analytics. What do we mean by that? We want data science software that allows you to bring your own code (e.g. R, Python, Scala), integrates with RStudio and Jupyter notebooks, integrates with Github, allows you to add analytical libraries like Tensorflow, Spark MLlib, H2O etc., and even supports pipelines for drag and drop development as opposed to coding. Tools like Amazon SageMaker, Cloudera Data Science Workbench, IBM Watson Studio, Tibco Statistica etc. Also, machine learning automation tools like DataRobot, H2O Driverless AI, SAS Factory Miner etc., are being considered to speed up model development. We also need the ability to deploy models anywhere. So for example, in the case of fraud, multiple different kinds of analytics could be needed including real-time streaming analytics to stop fraudulent transactions in-flight, batch analytics to see re-current fraudulent activity, and graph analytics to identify fraud rings. The point being is that it should be possible to find data via the catalog (including ready-made trusted data assets), develop all of these analytics from a common platform, deploy any models built to the environments in which they need to run, manage model versions, monitor model accuracy and schedule re-training and re-deployment if accuracy drops below user defined thresholds. Simplifying model deployment in particular is sorely needed in many companies as analytical projects seem to stall at the deployment phase.
In addition to managing and integrating multiple analytical silos, many traditional data warehouses also need modernised as part of the process of creating an integrated analytical ecosystem. Adoption of agile data modelling techniques like Data Vault, offload of ETL processing to the data lake, data warehouse migration to the cloud, virtual data marts and a logical data warehouse are all part of a modernisation programme.
We shall be talking about all of these topics, at the Italian Analytics for Enterprise Conference in Rome in 27-28 June 2019. I hope you can join us.