by Rick van der Lans

December 2012

Data Virtualization for Agile Business Intelligence Systems

The biggest challenge facing the business intelligence industry today is how to develop business intelligence systems that have an agility level that matches the speed with which the business evolves.
If the industry fails in this, current business intelligence systems will slowly become obsolete and will weaken the organization’s decision-making strength. Now that the economic recession is not going to pass soon and now that businesses have to operate more competitively, the need to increase the agility of the business intelligence systems should be number one on every organization’s list of business intelligence requirements. Agility is becoming a crucial property of each and every business intelligence system.
A recent study by TDWI showed plainly that many business intelligence systems are not agile. Here are some figures from that study:
• The average time needed to add a new data source to an existing business intelligence system was 8.4 weeks in 2009, 7.4 weeks in 2010, and 7.8 weeks in 2011.
• 33% of the organizations needed more than 3 months to add a new data source.
• Developing a complex report or dashboard with about 20 dimensions, 12 measures, and 6 user access rules, took on average 6.3 weeks in 2009, 6.6 weeks in 2010, and 7 weeks in 2011. This shows it’s not improving over the years.
• 30% of the respondents indicated they needed at least 3 months or more for such a development exercise.
It’s not simple to pinpoint why most of the current Business Intelligence systems are not that agile. It’s not one aspect that makes them static. But undoubtedly one of the dominant reasons is the database-centric solution that forms the heart of so many Business Intelligence systems. The architectures of most Business Intelligence systems are based on a chain of data stores; Examples of such data stores are production databases, a data staging area, a data warehouse, data marts, and some personal data stores (PDS). The latter can be a small file or a spreadsheet used by one or two business users. In some systems an operational data store (ODS) is included as well. These data stores are chained by transformation logic that copies data from one data store to another. ETL and replication are commonly the technologies used for copying. We will call systems with this architectyre classic Business Intelligence systems in this article.
The reason why so many Business Intelligence systems have been designed and developed in this way, has to do with the state of software and hardware of the last twenty years. These technologies had their limitations with respect to performance and scalability, and therefore, on one hand, the reporting and analytical workload had to be distributed over multiple data stores, and on the other hand, transformation and cleansing processing had to be broken down into multiple steps.
Data virtualization is a technology that can help make business intelligence systems more agile. It simplifies the development process of reports through aspects such as unified data access; data store independency; centralized data integration, transformation, and cleansing; consistent reporting results, data language translation, minimal data store interference, simplified data structures, and efficient distributed data access.
Data Virtualization accomplishes this by decoupling reports from data structures, by integrating data in an on-demand fashion, and by managing meta data specifications centrally without having to replicate them. This makes data virtualization the ideal technology for developing agile business intelligence systems. This is the primary reason for the increased agility.
When data virtualization is applied, an abstraction and encapsulation layer is provided which hides for applications most of the technical aspects of how and where data is stored; Because of that layer, applications don’t need to know where all the data is physically stored, how the data should be integrated, where the database servers run, what the required APIs are, which database language to use, and so on. When data virtualization technology is deployed, to every application it feels as if one large database is accessed.
The neutral term data consumer refers to any application that retrieves, enters, or manipulates data. For example, a data consumer can be an online data entry application, a reporting application, a statistical model, an internet application, a batch application, or an RFID sensor. Likewise, the term data store is used to refer to any source of data. This data source can be anything, it can be a table in a SQL database, a simple text file, an XML document, a spreadsheet, a web service, a sequential file, an HTML page, and so on. In some cases, a data store is just a passive file and in others it’s a data source including the software to access the data. The latter applies, for example, to database servers and web services.
The concepts of data consumer and data store are key to the definition of data virtualization:
Data virtualization is the technology that offers data consumers a unified, abstracted, and encapsulated view for querying and manipulating data stored in a heterogeneous set of data stores.
Typical business intelligence application areas that experience an increased level of agility are virtual data marts, self-service reporting and analytics, operational reporting and analytics, interactive prototyping, virtual sandboxing, collaborative development, and disposable reports.
To summarize, deploying data virtualization in BI systems leads to more lightweights architectures with a smaller data footprint, resulting in more agile systems.
Note: For more information on data virtualization we refer to Rick van der Lans’ upcoming book entitled Data Virtualization for Business Intelligence Systems which will be released in the summer of 2012.