Choosing An Enterprise Information Catalog
Over recent years, the demand to analyse new types of data has caused significant changes to the analytical landscape. Many companies today are way beyond just having a data warehouse. The demand now is to also capture, process and analysing new structured, semi-structured and unstructured data from internal and external sources for analysis that are not in a traditional data warehouse. As a result, new types of analytical workloads are needed to derive insight these new types of data and it is this that has resulted in new data stores and analytical platforms being used in addition to the data warehouse. This includes cloud storage, NoSQL column family databases and key value stores capable of rapidly ingesting data, NoSQL graph DBMSs for graph analysis, Hadoop, and streaming data analytics platforms. All of these are now in play. It seems that companies are now creating analytical ecosystems consisting of multiple data stores including the traditional data warehouse.
The problem however with multiple analytical data stores on–premises and in the cloud is that complexity has increased. Also different types of data are being ingested into all of these data stores. As a result, many companies are facing the reality that they do not have a centralised data lake of all data in one data store, but instead have a distributed data lake with multiple data stores that may include multiple Hadoop systems, relational DBMSs, NoSQL data stores and cloud storage.
In this kind of setup, it is hard to know what data is located where. Also data relationships across multiple data stores are often unknown. We also often don’t know what kind of data preparation is going on where and what analytical models exist analyse prepared data. The emergence of self-service data preparation tools has made this even more challenging with both IT and business users now preparing and integrating data.
The problem is that there is no common place to tell you what data is available, what data preparation jobs exist and what analytical models exist that you can potentiall reuse rather than reinvent. Business users have no place to go to find out if trusted, prepared and integrated data is already in existance that could satisfy their needs and save them time.
The answer to managing all these issues is to establish an enterprise information catalog. Information catalog technology penables you to see what data and artefacts exist across multiple data stores both on-premises and in the cloud. It is now central to data governance as well as analytics.
When looking to buy an information catalog some key capabilities to look for include the ability to:
- Nominate / bookmark and register data sources
- Automatically discover data to understand what data exists in sources, data lake and analytical data stores, that may contain both raw ingested data and trusted data already cleaned and integrated in data warehouses, data marts and master data management systems. This would include automated discovery of data in RDBMSs, Hadoop, cloud storage and NoSQL databases. During automatic discovery it should be possible to:
o Use built-in machine learning to automatically tag /label (name) and annotate individual data fields to indicate what the meaning of the data
o Use built-in machine learning to automatically recognise data that matches out-of-the-box or user-defined patterns to instantly determine what data means
o Automatically discovery of the same, similar and related data across multiple data stores regardless of whether the data names for these data are different
o Automatically profile data to understand the quality of every item
o Automatically derive data lineage to understand where data came from
o Automatically discover personally identifiable information (PII)
o Automatically detect change (a critical requirement)
- Allow users to manually tag, data to introduce it into the catalog
- Create roles within the catalog e.g. data owners, data experts, data curators/producers, data stewards, approvers, catalog editors, consumers
- Allow virtual communities to be created and maintained to allow people to:
- Curate, collaborate over, and manually override tags automatically generated by the software during automatic discovery
- Collaborate other artefacts published in the catalog e.g. ETL jobs, self-service data preparation jobs, analytical models, dashboards, BI reports, etc.
- Define a set of common business terms in a catalog business glossary and/or import of business glossary terms into a catalog business glossary that can be used to tag data published in a catalog to understand what the data means
- Automatically tag data at field level to know what it means
- Tag data at the dataset, folder, database and collection level.
- Support multiple pre-defined ‘out-of-the-box’ data governance classification (tagging) schemes that indicate levels of data confidentiality, data retention, and data trustworthiness (quality). The purpose of these schemes is to be able to label data with a specific level of confidentiality and with a specific level of retention in order to know how govern it in terms of data protection and data retention.
- Add user defined data governance classification schemes to allow data to be tagged/labelled in accordance with these schemes in order know how to organise and govern it.
- Automate data classification by making use of pre-defined patterns, user defined patterns (e.g. regular expressions or reference lists) to automatically identify and classify specific kinds of data in a data lake e.g. to recognise a social security number, and email address, a company name, a credit card number etc.
- Automate data classification using artificial intelligence to, observe, learn and predict the meaning of data in a data lake
- Allow manual tagging of data and other artefacts in the catalog to specify data meaning and to allow the data to be correctly governed.
- Allow multiple governance and use tags to be placed on data including:
o A level of confidentiality tag e.g. to classify it as PII
o A level of quality tag
o A level of data retention tag
o A business use tag e.g. Customer engagement, risk management etc.
o Tagging a file to indicate its retention level or which processing zone it resides in within a data lake e.g. ingestion zone, approved raw data zone, data refinery zone, trusted data zone
- Automatically propagate tags by using machine learning to recognise similar data across multiple data stores
- Define, manage and attach policies and rules to specific tags (e.g. a Personally Identifiable Information tag) to know how to consistently govern all data in the catalog that has been labelled with that same tag
- Import the metadata from 3rd party tools to automatically discover, classify and publish the following in the catalog:
o IT developed ETL jobs
o Self-service data preparation jobs (also known as ‘recipes’)
o BI tool artefacts (queries, reports, dashboards)
o Analytical models
o Data science notebooks
o Virtual tables in a data virtualization server
to understand what is available across the distributed data lake to prepare, query, report and analyse data held within it
- Manually classify (tag) and publish IT developed ETL jobs, self-service data preparation ‘recipes’, virtual tables, BI queries, reports, dashboards, analytical models and data science notebooks to the catalog
- Create a ‘data marketplace’ within the catalog to offer up data and insights as a service to subscribing consumers
- Support faceted search to zoom in on and find ‘ready to use’ data and other analytical assets like reports, dashboards, and models published in the catalog that a user is authorised to see
- Understand relationships across data and artefacts in the catalog to make recommendations on related data
- Allow users to easily see lineage from end-to-end in technical and business terms and navigate relationships to explore related data
- Integrate the catalog with other tools and applications via REST APIs
Information catalogs have a critical role in helping companies find data to develop analytics to maximise business value. They also help shorten the time to value and allow you to govern data across multiple data stores. All of this and more will be discussed at the International Big Data Conference in Rome on 3rd-4th December, 2018. I hope you can join us.