Spark: Toward A More-Complete, Albeit Imperfect, Open-Source Big-Data Analytics Stack

Spark’s not perfect. Neither is Hadoop. Nor R. Or any other open-source platform or tool, for that matter. But, then again, the open-source process recognizes that and, in fact, is designed to close the gap between the codebase’s ideal “should” state (i.e., the working definition of “perfect”) and its current warts-and-all status.

Open-source platforms are successful not because they’re perfect. They’re successful because, as I stated in this blog, they boost productivity throughout the economy by accelerating reuse, sharing, collaboration and innovation within entire industry ecosystems.

Industry Convergence on a Core Open-Source Big Data Stack 

That dynamic has certainly driven the amazing adoption of open-source big-data analytics tools and infrastructure up and down the stack. So it’s no surprise that the core components of this stack—Spark, R, and Hadoop—are converging into what amounts to a dominant platform for the full range of analytic and data management challenges in this emerging cloud-centric global economy. The year before last, I blogged about this trend in the larger context of open-source initiatives, including platforms, ecosystems, languages, tools, APIs, expertise, and data. And several weeks ago, I blogged on it in the context of Spark and the Hadoop Open Data Platform.

One key milestone in this trend came the week before Spark Summit, with the release of Apache Spark 1.4. Here’s the overview of that announcement in Computerworld, noting high up that its chief new feature is SparkR, a language binding for R programming in Spark projects. And here’s a related blog, written by yours truly at Spark Summit, summarizing the Databricks presenters discussing the Apache Spark community’s plans to roll out additional language bindings beyond R, Python, Java, and Scala.

The new SparkR binding, based on the DataFrame API, is a very significant addition to the core Spark codebase. It lets R developers access the environment’s scale-out parallel runtime, leverage Spark’s input and output formats, and call directly into Spark SQL. In this way, R, which was designed to work only on a single computer, will be able to run large jobs across multiple cores in single machines and/or across massively parallel server clusters. In the process, R can become a full-blown big-data analytics development tool for the era of Spark-accelerated machine learning, in-memory, streaming, and graph analytics.

The Open-Source Stack's Discontents

Of course, if you’re not big fan of R and are simply a begrudging user, the new SparkR API doesn’t address any issues you may have with that specific language. And, indeed, R has its discontents, as discussed in this recent blog from Revolution Analytics. In the pieces, author David Smith lists and discusses some of R’s more irksome “quirks,” which also pointing how these were the consequence of design decisions that enabled many features of R that its users find valuable.

O’Reilly characterized this double-edge point of view on R succinctly in the newsletter of theirs that linked me to Smith’s post. What hooked me to read the piece was their rhetorical question: “Why do we love to hate R so much?”

Shifting the focus only slightly, many big-data analytics professionals love Hadoop, but also love to hate its limitations/imperfections. Hence, a substantial portion of that community has spun off complementary projects such as Spark to address many of these issues while at the same time broadening the range of use cases into which some subset of Hadoop (e.g, HDFS, but not MapReduce) might be deployed.

As Spark evolves and matures, and as the current mania abates, its own imperfections will become more obvious. No less than Turing Prize winner Michael Stonebraker has already weighed in on the topic. During the week of Spark Summit, Stonebraker spoke to assembled IBM-ers at the new Spark Technology Center and gave us his lowdown. To quote the exact bullet points that he spoke from in front of the packed crowd:

  • “Spark is on top of RDDs (files!)”
  • Spark is 80% SparkSQL (according to Matei)
  • Spark is written in Java (think “slow”)
  • Spark is a batch processing system—even Spark Streaming
  • Faster Hadoop with all the same issues
  • Likely to follow Hadoop into Gartner’s [trough of disillusionment] with a delay of a couple of years”

Heeding Stonebraker's Reality Check

For me personally, that was exactly the reality check I needed to close out the week of Spark Summit. I believe my colleagues also appreciated hearing it from an acknowledged uber-expert who didn’t candy-coat it or tell us what we wanted to hear.

That said, we’re still quite enthusiastic about Spark, and about open-source platforms generally. All of them are moving targets, so any criticisms by Stonebraker or anyone else in June 2015 may have declining validity as we the industry collectively address them going forward.

Nothing’s perfect, but we’re engaged with the communities—and our global ecosystems—to evolve all of these ubiquitous platforms in the right directions.

Joe Osborne

Sales & Cloud GTM Leader

8y

A good read. Did you see the WSJ article on AirBnB's use of a spark-enabled tool? They're using it to drive bookings and revenues by "taking pricing signals from five billion data points each day"... http://blogs.wsj.com/cio/2015/07/01/airbnb-tops-challenges-of-spark-implementation/#?mod=wsj_valettop_email

Like
Reply
Bill Schmarzo

Dean of Big Data, CDO Chief AI Officer Whisperer, recognized global innovator, educator, and practitioner in Big Data, Data Science, & Design Thinking

8y

Nice review of SparkR and it's potential and issues. Thanks!

Like
Reply

To view or add a comment, sign in

Insights from the community

Explore topics