Hadoop Summit 2013 kicks off tomorrow and expect YARN to be a major topic of conversation. Three years in the making, YARN is essentially a new operating system for Hadoop that will allow the open source Big Data framework to break free from the shackles of MapReduce.
Perhaps that was a bit harsh towards MapReduce. As anyone following Big Data knows, Hadoop, was originally developed at Yahoo! to search and index the web. It is an extremely powerful framework, without which it would be a lot harder to find what you’re looking for online today. But Hadoop essentially was and still is a “one application platform” supported by a single computing paradigm – you guessed it – MapReduce.
MapReduce is the main mechanism for manipulating data in HDFS. This is great if you’re trying to process and analyze huge volumes of data – think years worth of log files or other semi-structured data – but less than ideal for other types of data analysis.
To evolve Hadoop into a more versatile Big Data platform, Arun Murthy, then of Yahoo, set about re-architecting Hadoop three years back. The result, making its debut this summer, is Apache YARN. Murthy, who went on to co-found Hortonworks, describes YARN this way:
When we set out to build Hadoop 2.0, we wanted to fundamentally re-architect Hadoop to be able to run multiple applications against relevant data sets. And do so in a way where multiple types of applications can operate efficiently and predictably within the same cluster – this is really the reason behind Apache YARN, which is foundational to Hadoop 2.0. By managing the resource requests across a cluster, YARN turns Hadoop from a single application system to a multi-application operating system.
So what are some of the other types of applications Murthy is referring to? Among them are machine learning, graph analysis, streaming analysis and interactive query capabilities. Once YARN is fully operational, developers will be able to manipulate data stored in HDFS with these types applications via the YARN ‘operating system.’
Now you may be thinking, can’t Hadoop already support these types of applications? Yes and no. Hive was developed by Facebook to serve as a SQL-style data warehouse layer on top of HDFS, but under the covers it still processes data via MapReduce. It also consumes a lot of resources, potentially impacting other jobs running (or at least trying to run) at the same time. Other Hadoop-related sub-projects for analyzing data operate in a similar way.
Which brings us to why YARN is so important. YARN is as a true Hadoop resource manager, allowing multiple applications – MapReduce, SQL, streaming analysis, etc. – to run on a single cluster of machines simultaneously while maintaining high performance levels. With YARN Hadoop is a true multi-application platform that can serve an entire enterprise.
This means Hadoop can be used as the foundation of an enterprise data management architecture, storing all of an enterprise’s data and being utilized as a shared data service. With YARN, the marketing team can run SQL-style applications while the Data Science team churns through petabytes of data, all on a single Hadoop deployment.
There’s still a ways to go before YARN is ready for production deployment, but it will certainly be the topic of many conversations tomorrow in San Jose. theCUBE kicks off live coverage of Hadoop Summit 2013 at 10:30am PT on Wednesday, and rest assured we’ll be covering the developments as they relate to YARN. Murthy himself joins us at 11:20am PT to provide us the latest details. Catch all the action at SiliconANGLE.com.