Is EII Virtual Data Warehousing Revisited?
Data warehousing has frequently been the subject of heated debate. Some controversies appear and quickly fade, while others continue to dominate press articles and industry conferences. Examples in this latter category of course include the topics of data marts and multidimensional design.
Recently, a new controversy has erupted over the subject of Enterprise Information Integration (EII). Some people (particularly vendor marketing and salespeople) are arguing that EII can eliminate the need to build a data warehouse. Data warehousing purists on the other hand are saying that EII is just another form of virtual data warehousing, which has been proved in the past to be a failure.
To understand the pros and cons of this debate, we to first need to categorize the different types of integration being used by organizations, and then identify how EII and data warehousing fit into, and support this integration taxonomy.
There are four broad categories of integration used in IT systems: user interface, business process, application, and data. Many products fit neatly into one of these categories, but as we will see, there is a trend in the industry toward products supporting multiple integration technologies, and as a result the dividing line between these four types of integration is sometimes a little fuzzy.
User interface integration provides a single view of operational and decision support data and applications at the presentation logic layer of an IT system. An enterprise portal is an example of a product that supports user interface integration. A key issue with integration at the user interface level is that although the user is given a single view of multiple disparate systems applications, this view highlights the lack of data and application integration between those systems. This is why some portal vendors are now adding the ability to construct composite applications that add a business semantic layer between the user interface and back-end corporate systems. This semantic layer adds a basic form of business process integration.
Business process integration enables developers to separate application design from application deployment. Business process design tools allow developers to analyze and model business process. Business process automation and integration tools then implement these process models using underlying application integration technologies. The key benefit here is that the design process is isolated from the physical implementation by the business semantic layer built into the process models.
It is also important to point out that business process automation tools not only manage the implementation of distinct applications, but also monitor the flow of information between those applications. Many business process tools are adding monitoring capabilities into this process flow for analyzing business performance. This is a form of business activity monitoring, or BAM.
Application integration technology supports the flow of business transactions between application systems that may reside within or outside of an organization. The trend of the industry here is toward a service-oriented architecture that employs XML-based Web services for defining and moving business transactions across systems. If the systems involved share a common definition (i.e., a common business metamodel) for the transactions that flow between them, then little or no information transformation is required. If a common definition does not exist, then the application integration software must transform the information to match the different business metamodels of the applications involved.
Although application integration technology was initially designed for moving business transactions between systems, it is now also being used to transfer data between applications. In the data warehousing world, for example, many ETL tools work with application integration software to extract data from an application workflow, and transform and load it into a data warehouse. To highlight this trend, many ETL tool vendors now market their products as a data integration platform.
Before we move on to discuss data integration, it is important to highlight some key aspects of the discussion so far. The first thing to note is that user interface, business process and application integration technologies are being used not only for operational processing, but also for decision support processing. Another is that the three types of integration can interact with each other, and can be used together. This is why the marketplace is moving toward portal, business process, and application integration software being bundled into an application server suite platform, such as IBM’s WebSphere, for example. A final thing to note about the three integration technologies is that business level metadata and metamodels play a key role in the integration process.
For data integration, several different technologies can be used, including data replication and transformation servers, ETL tools, and now EII middleware. The technology used depends on several factors such as the type of application processing, the data volumes involved, data currency requirements, and the amount of data transformation needed.
Traditionally, from a processing perspective, data integration technologies have been separated into those techniques used for operational processing, and those used for decision support processing. For operational processing, data replication and data transformation servers are often used (instead of application integration) where large amounts of need to be copied (and possibly transformed) in batch mode between different applications. In some cases, data replication is used to trickle feed data changes between systems. The focus in these cases is on performance and transformation power, and little attention is given to the business semantics and processes involved, i.e., business metadata and process models rarely play a role with this type of processing.
For decision support processing, ETL tools dominate the marketplace for extracting and transforming operational data for loading into a data warehouse. Data warehouse data is used typically for strategic and tactical reporting and analysis. More recently, performance management tools extend this processing to scorecarding, which compares the business intelligence produced by decision support processing to actual business plans and goals, and informs the appropriate business user when out of line situations occur.
Little attention is given in ETL tools when building a data warehouse to the business semantics and processes involved. Outbound from the warehouse, some analysis and performance management tools employ a business semantic or metadata layer to isolate analyses and reports from the data ware house structures being used, but it still up to the business user to map the results back to the business processes involved.
There is increasing need in organizations for solutions that can exploit decision support processing for day-to-day business decision making, i.e., for operational reporting and analysis. Operational decision support processing is about making organizations more responsive. This may involve the querying and analysis of current or low-latency operational data, business intelligence-driven alerts and automated decision making, or rules-driven recommendations and predictive analysis.
From a data perspective, operational decision support processing may be performed directly against operational data, or using a low-latency data store such as an ODS. Operational data can be used directly when limited data transformation is required, and when data query and analysis volumes and complexity are low. When significant data transformation and analysis is required, and where some level of data latency can be tolerated, the use of a low latency store is the better approach.
The use of a low-latency store causes considerable confusion because there are multiple uses of such a store. In all cases, however, the motivation for creating a low-latency store is to integrate and clean operational data. If the operational source data was integrated and consistent, then such a store would not be required. This is the same with data warehousing; if source systems maintained integrated, consistent, current, and historical data, then a data warehouse would not be required.
A low-latency data store has several uses. Some of these are associated with operational processing, and some are related to decision support. In reality, however, the distinction between these two types of processing is often fuzzy and overlapping, and is often only relevant from a political and organizational perspective.
In operational processing, a low-latency data store can be used for integrating operational and master data. This data can be used as a base for new operational applications, for the staged migration of older legacy applications, and for propagating data to downstream applications. For decision support processing, a low-latency store can be used for staging data into a data warehouse, and for reporting and analysis.
The controversy surrounding a low-latency store concerns whether such a store can be justified purely for decision support processing, i.e., for operational reporting and analysis. In many applications, the business benefits obtained can justify the building of the store. In other cases, it cannot. This is why some organizations are looking for ways of doing operational reporting and analysis without the need to build a low-latency store. This is why business activity monitoring solutions (BAM) from application integration and independent BAM vendors are attractive. These solutions can report on and analyze business processes in memory during operational processing without the need to create a separate data store. This type of BAM software is also attractive because, unlike many data integration techniques, it is event-driven and tied to a business process. Note; however, that BAM is not suitable for tactical or strategic decision support.
Another approach that supports operational reporting and analysis without the need to create a separate data store is Enterprise Information Integration (EII). This technology provides a federated query server that can retrieve and integrate data from several non-integrated data sources. Such a server is used primarily for querying and accessing structured operational data. Some products, however, also support unstructured data stores. The results of a federated query can be processed by operational applications or by decision support query and analysis tools. It could also be used to feed an ETL tool for building a data warehouse. Given that most EII products are not event-driven, they are not suitable for building a low-latency data store where the latency has to be as close to real-time as possible.
When reviewing EII products, several questions arise. One key one is how EII is different from the traditional distributed query processing of relational DBMS products. The answer is EII is an evolution of the facilities provided by these products. The emphasis of EII is on optimizing access to heterogeneous data, and on providing a single view of this data. Products, however, vary in their capabilities. IBM’s DB2 Information Integrator, for example, places strong emphasis on the ability to access unstructured data (it also provides a data replication capability). Other, non-DBMS products, like MetaMatrix and Certive, for example, emphasize metadata and analysis power. Application integration vendors such as BEA with Liquid Data, for example, are also targeting the EII space.
At the heart of the EII controversy is whether EII negates the need for a data warehouse, i.e., can EII be used for virtual data warehousing. Although EII provides capabilities over and beyond those supplied by traditional relational DBMS products (which have been used for virtual data warehousing in the past), it still cannot solve the data quality problems that exist in many operational systems. Adding EII on the front of operational data is like adding a portal on the front of multiple corporate systems. The technology may provide a single access point to disparate information, but it cannot hide the integration problems that may exist. EII is therefore suitable for accessing live operational data when zero data latency is required, and where significant data integration problems do not exist, i.e., where complex data transformation and analysis is not required.
One final point that should be made is that EII is still a data-driven technology, and like most data integration technologies suffers from the complete disregard for business semantics and business processes. To succeed, EII vendors need to place as much emphasis on solving metadata issues, as they do on promoting how many data sources they can support. If they don’t solve this issue, then it will be difficult to integrate this technology into the overall enterprise integration stack.