by Derek Strauss

November 2010

Paying attention to metadata in DW2.0

In order to be successful with second generation data warehousing (DW2.0), organizations need to pay attention to metadata. Metadata is the key to reusability of data and analysis. With metadata the analyst can find out what has already been built. Without metadata, the analyst has a hard time finding out what data structures and infrastructure have already been built.

METADATA IN DW 2.0

Metadata is loosely defined as data about data. Though this definition is easy to remember, it is not very precise. The strength of this definition is in recognizing that metadata is data. As such, metadata can be stored and managed in a database, often called a registry or repository. Metadata is a concept that applies mainly to electronic data and is used to describe the definition, structure and administration of data files and their contents and context. In a broader sense, metadata provides meaning to all enterprise artifacts, including business processes, technology platforms, and so forth.
One of the essential ingredients of the DW 2.0 environment is that of metadata. Unlike first generation data warehouses where metadata either was not present or was an afterthought, metadata is one of the cornerstones of the DW 2.0 data warehouse.
There are many reasons why metadata is so important. Metadata is important to the developer who must align his/her efforts with work that has been done previously. Metadata is important to the maintenance technician, who must deal with day to day issues of keeping the data warehouse in order. But metadata is perhaps most important to the end user who needs to find out what the possibilities are for new analysis.

REUSABILITY OF DATA AND ANALYSIS

Consider the end user. The end user feels the need for information. The need for information may come from a management directive, come from a corporate mandate, or simply come from the curiosity of the end user. However it comes, the end user ponders how to approach the analysis. And it is metadata that is the logical place to turn to. With metadata the analyst can determine what data is available. Once the analyst has determined what the most likely place to start is, the analyst can then proceed to access the data.
Without metadata the analyst has a really hard time determining what the possible sources of data are. The analyst may be familiar with some sources of data. But it is questionable whether the analyst is aware of all of the possibilities. In this case the existence of metadata may save huge amounts of unnecessary work.
In the same vein, the end user needs to use metadata to determine if an analysis has already been done. Answering a question may be as simple as merely looking at what someone else has done. But without metadata, the end user analyst will never know what has already been done.
For these reasons then (and plenty more!), metadata is a very important component of the DW 2.0 environment.

THE LOCATION OF METADATA IN DW 2.0

Metadata has a special place in the DW 2.0 environment. There exists separate metadata for each sector of DW 2.0. There exists metadata for the interactive sector. There exists metadata for the integrated sector. There exists metadata for the near line sector. And there exists metadata for the archival sector.
The metadata for the archival sector is different from the other metadata in that the metadata for the archival sector is placed directly in the archival data. The reason for this is that over time the archival metadata must not be separated from the archival data that is being described.
There is a general structure for metadata as it exists in DW 2.0. There really are two parallel structures – one metadata structure for the unstructured environment and one metadata structure for the structured environment.
For unstructured data, there are really two types of metadata – enterprise metadata and local metadata. The enterprise metadata is also referred to as general metadata and the local metadata referred to as specific metadata.
For structured metadata, there are three levels – enterprise metadata, local metadata, and business and technical metadata. There is an important relationship between these different kinds of metadata. The best place to start to explain that relationship is at the local metadata level.
Local metadata is a good place to start because most people have the most familiarity with that type of metadata. Local metadata exists in many places and in many forms. Local metadata exists inside ETL processes. Local metadata exists inside a DBMS directory. Local metadata exists inside a business intelligence universe.
Local metadata is that metadata that exists inside a tool that is useful for describing the metadata immediate to the tool. ETL metadata is metadata about sources and targets and the transformations that take place as data is passed from source to target. DBMS directory metadata is metadata about tables, attributes, indexes, and the like. BI universe metadata is metadata about data used in analytical processing. And there are many more forms of local metadata other than these common sources of local metadata.
Local metadata is stored in a tool or technology that is central to the usage of the local metadata. Enterprise metadata, on the other hand is stored in a local that is central to all of the tools and all of the processes that exist within DW 2.0.

IN SUMMARY

Metadata is the key to reusability of data and analysis. With metadata the analyst can find out what has already been built. Without metadata, the analyst has a hard time finding out what data structures and outputs have already been built.
There are four levels of metadata – enterprise, local, business, technical. There is metadata for both the structured environment and the unstructured environment. Archival metadata is stored directly in the archival environment. By storing the metadata directly in the physical storage of archival data, a time capsule of data can be created.

What is DW2.0? In the two decades that data warehousing has been around, there has been much change. Older technologies have matured, there is new technology, and organizations have accepted Business Intelligence as a standard part of the infrastructure. Today there are many different renditions of what a data warehouse is – an active data warehouse, a federated data warehouse, a star schema data warehouse and so forth. Unfortunately none of these types of a data warehouse are the same. There is no integrity in the definition of what a data warehouse is. In addition, 1st generation data warehouses have failed to take into account many important requirements that are now recognized as legitimate aspects of data warehousing. Now there is DW 2.0 which is the definition of data warehouse architecture for the future of data warehousing.
Some of the more prominent features of DW 2.0ä include the recognition of the life cycle of data within the data warehouse; inclusion of unstructured data along with structured data inside the data warehouse; inclusion of metadata as a tightly integrated part of the data warehouse; matching of unstructured data to structured data; and the ability to seamlessly handle massive amounts of data.