By James Kobielus, Contributor, InfoWorld |

YARN unwinds MapReduce's grip on Hadoop

Hadoop has been known as MapReduce running on HDFS, but with YARN, Hadoop 2.0 broadens pool of potential applications

Hadoop has always been a catch-all for disparate open source initiatives that combine for a more or less unified big data architecture. Some would claim that Hadoop has always been, at its very heart, simply a distributed file system (HDFS), but the range of HDFS-alternative databases, including Hbase and Cassandra, undermines that assertion.

Until recently, Hadoop has been, down deep, a specific job-execution layer -- MapReduce -- that executes on one or more alternative, massively parallel data-persistence layers, one of which happens to be HDFS. But the recent introduction of the next-generation execution layer for Hadoop -- known as YARN (Yet Another Resource Negotiator) -- eliminates the strict dependency of Hadoop environments on MapReduce.

[ Download InfoWorld's Big Data Analytics Deep Dive for a comprehensive, practical overview of this booming field. | For a quick, smart take on the news you'll be talking about, check out InfoWorld TechBrief -- subscribe today. ]

Just as critical, YARN eliminates a job-execution bottleneck that has bedeviled MapReduce from the start: the fact that all MapReduce jobs (pre-YARN) have had to run as batch processes through a single daemon (JobTracker), a constraint that limits scalability and dampens processing speed. These MapReduce constraints have spurred many vendors to implement their own speedups, such as IBM's Adaptive MapReduce, to get around the bottleneck of native MapReduce.

All of this might make one wonder what, specifically, "Hadoop" means anymore, in terms of an identifiable "stack" distinct from other big data and analytics platforms and tools. That's a definitional quibble -- YARN is a foundational component of the evolving big data mosaic. YARN puts traditional Hadoop into a larger context of composable, fit-to-purpose platforms for processing the full gamut of data management, analytics, and transactional computing jobs.

YARN transforms Hadoop (however defined) into a general-purpose, distributed job-execution layer of the sort that the open source initiative's original definition (still on the Apache website) alludes to. Though it retains backward compatibility with the MapReduce API and continues to execute MapReduce jobs, a YARN engine is capable of executing a wide range of jobs developed in other languages.

Just as important, YARN can become a unifying thread for diverse Apache open source initiatives around big data. As InfoWorld recently noted: "The biggest win of all here is how MapReduce itself becomes just one possible way of many to mine data through Hadoop."

That's the YARN promise, but seeing it realized requires that the industry retool their Hadoop stacks and tools to work with it. Per the article, "Apache claims that any distributed application can run on YARN, albeit with some porting. To that end, Apache's maintained a list of YARN-compatible applications, such as the social-graph analysis system Apache Giraph (which Facebook uses). More are on the way from other parties, too."

This is good, but notice that disclaimer: "albeit with some porting." The article says YARN's true test will be in the extent to which vendors port their analytic development tools to output jobs that are conformant with YARN. As the author states, porting development languages to YARN "isn't a trivial effort."

Will this take place consistently throughout the industry and diverse Apache and other open source communities? If so, to what extent? Those factors will determine the degree to which YARN, the defining feature of what some call "Hadoop 2.0," truly takes hold.

Considering that Hadoop 2.0 preserves backward compatibility with MapReduce and YARN requires some porting to bring MapReduce applications up to speed, that nontrivial effort may significantly slow developer adoption of the new framework.

Also, in light of the range of alternative languages (R) and alternative platforms (any NoSQL approach) with which big data applications are being developed, it's not even clear that Hadoop, 1.0 or 2.0, can maintain its current marketplace momentum indefinitely.

This story, "Devops can take data science to the next level," was originally published at InfoWorld.com. Read more of Extreme Analytics and follow the latest developments in big data at InfoWorld.com. For the latest developments in business technology news, follow InfoWorld.com on Twitter.

Next read this:

James Kobielus is principal analyst at Franconia Research.