Self-Service Data Integration – Improved Productivity or Total Chaos?
In the last two years, we have seen many companies go beyond traditional data warehouses and begin to adopt Big Data technologies. For most, the reason is simple. They need to remain competitive by capturing more data to produce new insights e.g. about customers. Examples of such data include governments open data, in-bound customer email, social media data, clickstream etc. It could also be more data about products, product usage or infrastructure e.g. sensor data, and machine-generated data.
All of these examples are of data that you typically would not find in a data warehouse. Also, much of this new data is different from structured data in a data warehouse. New data is often multi-structured in terms of data types, large in volume and may be being generated or created at very high rates. That makes it much more challenging to capture, prepare and analyse. Also with more and more data sources, there is a huge need to speed up data preparation and to not only rely on IT to do everything.
To that end, new self-service data integration technology has emerged aimed at data scientists and business analysts to help them prepare and integrate their own data without any need for IT to get involved.
Self-service data integration technology has emerged in three main ways:
• As separate stand-alone self-service data integration tools
• As self-service data integration functionality embedded in self-service visual discovery BI tools and in Microsoft Excel
• As a new business user option that is part of an Enterprise Information Management tool suite
Examples of products in the first category are Paxata, Tamr and Trifacta. Vendors with self-service data integration as part of a Visual Discovery tool include Datameer, MicroStategy and Tableau. Also Microsoft Excel 2013 has self-service data integration capability with PowerQuery. Examples of self-service data integration as part of an established EIM platform include IBM DataWorks, Informatica Rev and to some extent SAS Data Loader for Hadoop.
Let’s look at these in more detail.
Stand-alone self-service data integration tools have emerged on top of Hadoop. They have simplified user interfaces to make them easier to use and more interactive. For example, Trifacta uses a mechanism called predictive interaction where the user does not need to specify data transformation detail. Instead, users highlight features of interest in data visualization (e.g. text they want to extract from a document) and, based on what a user selects, predictive methods then suggest a variety of possible next step transformations for the data. These suggested transformations are ranked in order of highest probability of transformation that the user most likely wants to do next. The user then decides on the best next step and the transformation selected is then compiled down into a language that can execute in parallel on Hadoop. To speed up and guide data cleansing, self-service data integration tools also support automatic data profiling. Figure 1 shows an example of this in Paxata with graphical profile results on the top of the screen above each column.
In addition, users can click on a drop down list associated with each column and select the appropriate data cleansing transformation to improve the data quality profile. Paxata in this case compiles transformations to run as code on Spark, which executes in parallel across all Hadoop data nodes where the dataset resides. Also all the steps taken by users are recorded so that metadata lineage is available to find out how data was transformed. This also allows users to easily undo transformations by going as far back through the steps as they wish.
The second approach with respect to self-service data integration is this functionality being embedded in visual discovery tools. This allows business analysts to connect to and ‘blend’ data from multiple sources to answer specific business questions. However more functionality has been added into self-service visual discovery tools to help business users to clean and integrate data from multiple data sources.
The last category is EIM platforms that have been extended to accommodate business user self-service data integration in addition to being used by IT. In this case, business users are presented with a new simplified user interface offering a broad range of data services to manage, refine, optionally analyse and provision data. Available data services include data loading, data profiling, validation, standardisation, data cleansing, data transformation, data integration, data enrichment, data masking, data encryption etc. Both IT and business users can define processes that clean and transform data in this case. In addition, analytics may also be included to automatically analyse data. Scalable execution of data services, is achieved by exploitation of in-memory transformation of streaming data, big data ELT(extract, load and transform) processing on Hadoop, and ELT processing on MPP relational DBMSs.
The emergence of self-service data integration has created one obvious question. If business users are ‘doing their own thing’ what does it mean in terms of impact on enterprise data governance?
Up until now, most enterprise data governance initiatives have been run by centralized IT organizations with the help of part time data stewards scattered around the business. Central IT organisations typically use an EIM tool suite to do this. What then is the impact of self-service DI if business users integrate data on their own using a completely different type of tool from those used by IT professionals? Well, it is pretty clear that even if central IT do a great job of enterprise data governance, the impact of self-service DI is that it could easily lead to data chaos with every user doing their own personal data cleansing and integration. Inconsistency could reign and destroy everything an enterprise data governance initiative has worked for. So what can be done? Are we about to descend into data chaos? Self-service DI tools record / log users actions on data to understand exactly how data has been manipulated. That is of course a good thing. However for stand-alone self-service DI tools, the metadata lineage is in the repository of the self-service DI tool and not that of an EIM platform used by IT. Also there are no standards for metadata import/export to/from existing EIM platform repositories to re-use data definitions and transformations across both EIM tools and self-service DI tools. If an EIM platform supports both IT and business then we have the best of both worlds. If they are separate tools then we need interfaces to plug self-service DI tools into EIM platforms to support enterprise data governance initiatives. Paxata is one example of a vendor that has opened up its product to be invoked by EIM tools for example. However self-service DI tools also need to be able to invoke data integration jobs on EIM platforms to re-use what has been created by IT.
Self-service DI is here to stay. IT need to provide templates and services for users to reuse when creating newly integrated datasets.
They also need to monitor what data business users access to understand data in demand and encourage business and IT participation in the data governance process.