Database Technology for the Web: The MapReduce Debate
Over the course of the nearly forty years I have been working on database systems there have been many debates and arguments about which database technology to use for any given application. These arguments have become heated especially when a new database technology appears that claims to be superior to anything that came before.
When relational systems were first introduced, the hierarchical (IMS) and network (IDMS) database system camps argued that relational systems were inferior and could not provide good performance. Over time this argument proved false, and relational products now provide the database management underpinnings for a vast number of operational and analytical applications. Relational database products have survived similar battles with object-oriented database technology and multidimensional database systems.
Although relational technology survived these various skirmishes, the debates that took place did demonstrate that one size does not fit all and that some applications can benefit by using an alternative approach. The debates also often led to relational product enhancements that incorporated features (e.g., complex data types, XML and XQuery support) from competitive approaches. Some experts argue that many of these features have corrupted the purity and simplicity of the relational model.
Just when I thought the main relational products had become a commodity, a new technology known as MapReduce appeared that caused the debates to start again. In this article, I want to look at the pros and cons of MapReduce, which Michael Stonebraker (together with David DeWitt), one of the original relational database technology researchers, recently described as a “a giant step backwards.”1
What is MapReduce?
MapReduce has been popularized by Google who use it to process many petabytes of data every day. A landmark paper2 by Jeffrey Dean and Sanjay Ghemawat of Google states that:
“MapReduce is a programming model and an associated implementation for processing and generating large data sets…. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.”
Michael Stonebraker’s comments on MapReduce explain MapReduce in more detail:
“The basic idea of MapReduce is straightforward. It consists of two programs that the user writes called map and reduce plus a framework for executing a possibly large number of instances of each program on a compute cluster.
The map program reads a set of records from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a “split” function partitions the records into M disjoint buckets by applying a function to the key of each output record. This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with M output files, one for each bucket.
After being collected by the map-reduce framework, the input records to a reduce instance are grouped on their keys (by sorting or hashing) and fed to the reduce program. Like the map program, the reduce program is an arbitrary computation in a general-purpose language. Hence, it can do anything it wants with its records. For example, it might compute some additional function over other data fields in the record. Each reduce instance can write records to an output file, which forms part of the answer to a MapReduce computation.”
The key/value pairs produced by the map program can contain any type of arbitrary data in the value field. Google, for example, uses this approach to index large volumes of unstructured data. Although Google uses its own version of MapReduce there is also an open source version called Hadoop3 from the Apache project. IBM and Google have announced a major initiative to use Hadoop to support university courses in distributed computer programming.
MapReduce is not a new concept. It is based on the list processing capabilities in declarative functional programming languages such as LISP (LISt Processing). Today’s systems implement MapReduce in imperative languages such as Java, C++, Python, Perl, Ruby, etc.
The key/value pairs used in MapReduce processing may be stored in a file or a database system. Google uses its BigTable database system (which is built on top of the Google distributed file system, GFS) to manage the data.
Key/value pair databases have existed for many years. For example, Berkeley DB is an embedded database system that stores data in a key/value pair data structure. It was originally developed in the 1980s at Berkeley, but it is now owned by Oracle. Berkeley DB can also act as a backend storage engine for the MySQL open source relational DBMS.
Why the Controversy?
Given that MapReduce is not a database model, but a programming model for building powerful distributed and parallel processing applications, why is there such a controversy with respect to relational systems? To answer this question we need to examine the relational model of data in more detail.
In a relational model, data is conceptually stored in a set of relations or tables. These tables are manipulated using relational operators such as selection, projection and join. Today, these relational operators are implemented primarily using the structured query language (SQL).
How the table data is physically stored and managed in a relational database management system (RDBMS) is up to the vendor. The mapping of relational operators (SQL statements) to the backend storage engine is handled by the relational optimizer whose job it is to find the optimal way of physically accessing the data. This physical data independence is a key benefit of the relational model. When using SQL, users define what data they want, not how it is to be accessed. Techniques such as indexing and parallel and distributed computing are handled by the underlying RDBMS. SQL is a declarative language, and not an imperative/procedural language like Java and C++, which require a detailed description of any data access algorithms that need to be run. Of course SQL statements can be embedded in procedural languages. The reverse is also true; SQL can invoke stored procedures and user-defined functions written in a procedural language.
The concern of Michael Stonebraker is that the use and teaching of MapReduce will take the industry back to the pre-relational times when there was a lack of formalized database schemas and application data independence. MapReduce advocates argue that much of the data processed by MapReduce involves unstructured data that lacks a data schema. They also argue that today’s programmers vastly outnumber SQL experts, don’t know or don’t want to know SQL, find MapReduce much simpler, and prefer to access and analyze data using their own procedural programming.
Both camps are correct and both approaches have their benefits and uses. As I said at the beginning of this article, one size does not fit all. The challenge is to understand where each approach fits.
(To be continued in December)