What Is Big Data and Why Do We Need It?
Global digital content created will increase some 30 times over the next ten years – to 35 zettabytes
Big data is a popular, but poorly defined marketing buzzword. One way of looking at big data is that it represents the large and rapidly growing volume of information that is mostly untapped by existing analytical applications and data warehousing systems. Examples of this data include high-volume sensor data and social networking information from web sites such as FaceBook and Twitter. Organizations are interested in capturing and analyzing this data because it can add significant value to the decision making process. Such processing, however, may involve complex workloads that push the boundaries of what is possible using traditional data warehousing and data management techniques and technologies.
This article looks the benefits analyzing big data brings to the business. It examines different types of big data and offers suggestions on how to optimize systems to handle different workloads and integrate them into a single infrastructure.
What Is Big Data?
It is important to realize that big data comes in many shapes and sizes. It also has many different uses – real-time fraud detection, web display advertising and competitive analysis, call center optimization, social media and sentiment analysis, intelligent traffic management and smart power grids, to name just a few. All of these analytical solutions involve significant (and growing) volumes of both multi-structured and structured data.
Many of these analytical solutions were not possible previously because they were too costly to implement, or because analytical processing technologies were not capable of handling the large volumes of data involved in a timely manner. In some cases, the required data simply did not exist in an electronic form.
New and evolving analytical processing technologies now make possible what was not possible before. Examples include:
- New data management systems that handle a wide variety of data from sensor data to web and social media data.
- Improved analytical capabilities (sometimes called advanced or big analytics) including event, predictive and text analytics.
- Faster hardware ranging from faster multi-core processors and large memory spaces, to solid-state drives and tiered data storage for handling hot and cold data.
Supporting big data involves combining these technologies to enable new solutions that can bring significant benefits to the business.
Managing and Analyzing Big Data
For the past two decades most business analytics have been created using structured data extracted from operational systems and consolidated into a data warehouse. Big data dramatically increases both the number of data sources and the variety and volume of data that is useful for analysis. A high percentage of this data is often described as multi-structured to distinguish it from the structured operational data used to populate a data warehouse. In most organizations, multi-structured data is growing at a considerably faster rate than structured data.
Two important data management trends for processing big data are relational DBMS products optimized for analytical workloads (often called analytic RDBMSs, or ADBMSs) and non-relational systems (sometimes called NoSQL systems) for processing multi-structured data. A non-relational system can be used to produce analytics from big data, or to preprocess big data before it is consolidated into a data warehouse.
Analytic RDBMSs (ADBMSs)
An analytic RDBMS is an integrated solution for managing data and generating analytics that offers improved price/performance, simpler management and administration, and time to value superior to more generalized RDBMS offerings. Performance improvements are achieved through the use of massively parallel processing, enhanced data structures, data compression, and the ability to push analytical processing into the DBMS.
ADBMSs can be categorized into two broad groups: packaged hardware and software appliances, and software-only platforms.
Packaged hardware and software appliances fall into two sub-groups: purpose-built appliances and optimized hardware/software platforms. The objective in both cases is to provide an integrated package that can be installed and maintained as a single system. Depending on the vendor, the dividing line between the two sub-groups is not always clear, and this is why in this article they are both categorized simply as appliances.
A purpose-built appliance is an integrated system built from the ground up to provide good price/performance for analytical workloads. This type of appliance enables the complete configuration, from the application workload to the storage system used to manage the data, to be optimized for analytical processing. It also allows the solution provider to deliver customized tools for installing, managing and administering the integrated hardware and software system.
Many of these products were developed initially by small vendors and targeted at specific high-volume business area projects that are independent of the enterprise data warehouse. As these appliances have matured and added workload management capabilities, their use has expanded to handle mixed workloads and in some cases support smaller enterprise data warehouses.
The success of these purpose-built appliances led to more traditional RDBMS vendors building packaged offerings by combining existing products.
This involved improving the analytical
processing capabilities of the software and then building integrated and optimized hardware and software solutions. These solutions consist of optimized hardware/software platforms designed for specific analytical workloads. The level of integration and optimization achieved varies by vendor. In some cases, the vendor may offer a choice of hardware platform.
A software-only platform is a set of integrated software components for handling analytical workloads. As with purpose-built appliances, many of these products were developed initially by small vendors and targeted at specific high-volume business area projects. They often make use of underlying open source software products and are designed for deployment on low-cost commodity hardware. The tradeoff for hardware portability is the inability of the product to exploit the performance and management capabilities of a specific hardware platform. Some software platforms are available as virtual images, which are useful for evaluation and development purposes, and also for use in cloud-based environments.
The development of ADBMS products by smaller innovative companies has led to established vendors improving the analytical processing capabilities of their database management offerings. Not only has this involved enhancing and adding new features to existing RDBMS products, but it has also led to certain vendors acquiring these smaller companies. Examples of acquisitions include EMC (Greenplum), HP (Vertica), IBM (Netezza), Microsoft (Datallegro), Teradata (Aster Data).
A single database model or technology cannot satisfy the needs of every organization or workload. Despite its success and universal adoption, this is also true for RDBMS technology. This is especially true when processing large amounts of multi-structured data and this is why several organizations with big data problems have developed their own non-relational systems to deal with extreme data volumes. Web-focused companies such as Google and Yahoo that have significant volumes of web information to index and analyze are examples of organizations that have built their own optimized solutions. Several of these companies have placed these systems into the public domain so that they can be made available as open source software.
are useful for processing big data where most of the data is multi-structured. They are particularly popular with developers who prefer to use a procedural programming language, rather than a declarative language such as SQL, to process data. These systems support several different types of data structures including document data, graphical information, and key-value pairs.
One leading non-relational system is the Hadoop distributed processing system from the open source Apache Software Foundation. Apache defines Hadoop as “a framework for running applications on a large hardware cluster built of commodity hardware.” This framework includes a distributed file system (HDFS) that can distribute and manage huge volumes of data across the nodes of a hardware cluster to provide high data throughput. Hadoop uses the MapReduce programming model to divide application processing into small fragments of work that can be executed on multiple nodes of the cluster to provide massively parallel processing. Hadoop also includes the Pig and Hive languages for developing and generating MapReduce programs. Hive includes HiveQL, which provides a subset of SQL.
Hadoop MapReduce is intended for the batch processing of large volumes of multi-structured data. It is not suitable for low-latency data processing, many small files, or the random updating of data. These latter capabilities are provided by database products such as HBase and Cassandra that run on top of Hadoop. Several companies offer commercialized open source or open core versions of Hadoop for handling big data projects.
Which DBMS To Use When?
It is important to realize that generalized relational DBMSs, analytic RDBMSs, and non-relational data systems are not mutually exclusive. Each approach has its benefits, and it is likely that most organizations will employ some combination of all three of them.
The actual system chosen will depend on three factors, i) data volume – data storage size and rate of change, ii) data variety – structured, multi-structured, and iii) complexity of the analytical processing involved – query complexity, mixed query workload, data schema complexity, need for concurrent data loading, need for near-real-time results, batch or interactive processing, and so forth. Figure 1 positions the different approaches with respect to these factors. To provide an integrated analytical infrastructure these approaches must coexist and interoperate with each other. This is why vendors are delivering connectors that allow data to flow between the different
Many people view “big data” as an over-hyped buzzword. It is, however, a useful term because it highlights new data management and data analysis technologies that enable organizations to analyze certain types of data and handle certain types of workload that were not previously possible. The actual technologies used will depend on the volume of data, the variety of data, the complexity of the analytical processing workloads involved, and the responsiveness required by the business. It will also depend on the capabilities provided by vendors for managing, administering, and governing the enhanced environment. These capabilities are important selection criteria for product evaluation.
Big data, however, involves more than simply implementing new technologies. It requires senior management to understand the benefits of smarter and more timely decision making. It also requires business users to make pragmatic decisions about agility requirements for analyzing data and producing analytics, given tight IT budgets. The good news is that many of the technologies outlined in this article not only support smarter decision making, but also provide faster time to value.