UPDATED 15:33 EDT / MARCH 19 2014

Is MapReduce merely a new implementation of old ideas writ to scale?

explaination point question mark puzzle pieces predictive analytics big dataBig Data as an industry buzzword doesn’t appear to be fading in popularity any time soon. Big Data is more of a business problem and everyone agrees with the Big Data definitions currently in use that Hadoop and MapReduce have become synonymous with basically anything Big Data related.

Organizations are moving away from aggregation of facts and stats to actually understanding real behavior (via patterns) and modelling what’s going on at the present . Big Data is used to glean new business insights about customers and markets by exploring the distributed data processing and execution environment. MapReduce technology is widely used for such practice to better understand the true patterns of data.

MapReduce is a programming model for distributed data processing and execution environment that runs on large clusters of commodity machines. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. MapReduce originated from Google research papers in 2004 and since then become the heart of Hadoop.

The programming model is useful in a wide range of applications, including distributed pattern-based searching, distributed sorting, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, and statistical machine translation. Moreover, the MapReduce model has been adapted to several computing environments like multi-core and many-core systems, desktop grids, volunteer computing environments, dynamic cloud environments, and mobile environments.

The advantage of MapReduce is that it allows you to produce a distributed preprocessing operation and convolution. Preprocessing operations run independently of each other and can be performed in parallel (although in practice this is limited to the input source and the number of processors). Similarly, a plurality of operating units can perform convolution–it is only necessary that all the results of pre-treatment with one specific key value processed by one node at a time. Although this process may be less effective than the more consistent algorithms, MapReduce can be applied to large amounts of data, which can handle a large number of servers. So, MapReduce can be used to sort a petabyte of data that will only take a few hours.

No new ground in the theory of computation?

MapReduce is good for single-iteration and parallel distributed tasks like feature processing but it is not good for iterative algorithms with computational dependencies or complex asynchronous schedules. Some developers have strong opinion that MapReduce is nothing new in computation, or even distributed computing was discovered by MapReduce. It does not show a new way of decomposing a problem into simpler operations.

To a question on what is the novelty in MapReduce, some developers said in the Computer Science Stack Exchange forum that MapReduce does not break new ground in the theory of computation. The MapReduce paper’s contribution was that these well-understood operators with a specific set of optimizations had been successfully used to solve real problems more easily and fault-tolerantly than one-off solutions. MapReduce computation doesn’t easily decompose into map & reduce operations.

I should say that I have since worked more on algorithms for MapReduce-type models of computation… The Divide-Compress-Conquer technique I talk about below is surprisingly versatile, and can be the basis of algorithms which I think are non-trivial and interesting.” — Sasho Nikolov

MapReduce is a poor replacement for a relational DB. MapReduce has no indexes and therefore has only brute force as a processing option.

If a programmer wants to write a new application against a data set, he must discover the record structure. In modern DBMSs, the schema is stored in a collection of system catalogs and can be queried (in SQL) by any user to uncover such structure.

In contrast, when the schema does not exist in an application, the programmer must discover the structure by an examination of the code. Not only is this a very tedious exercise, but also the programmer must find the source code for the application. Writing MapReduce applications on top of Hadoop does not really change the situation significantly. BigTable and HBase do not provide logical independence.

One of the developer said Map/Reduce was originally known in Parallel Computing as a data-flow programming model. However, from a practical point of view, Map/Reduce as proposed by Google and with the subsequent open-source implementations has also fueled Cloud Computing and is now quite popular for very simple parallel decompositions and processing. Of course, it is not well suited for anything else requiring complex domain or functional decompositions. MapReduce also has important missing features of DBMS such as bulk loader, indexing, updates, transactions, integrity constraint, referential integrity and views.

“I should say that I have since worked more on algorithms for MapReduce-type models of computation, and I feel like I was being overly negative. The Divide-Compress-Conquer technique I talk about below is surprisingly versatile, and can be the basis of algorithms which I think are non-trivial and interesting,” said Sasho Nikolov, another developer in the discussion.

Other developers say when you need to process variables or data sets jointly, MapReduce offers no benefit over non-distributed architectures. One must come with a more sophisticated solution.

photo credit: “Punctuation marks made of puzzle pieces”, Horia Varlan via photopin cc


A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU