A Comprehensive Analysis - Data Processing Part Deux: Apache Spark vs Apache Storm

A Comprehensive Analysis - Data Processing Part Deux: Apache Spark vs Apache Storm

Disclaimer: This post is a combination of original content and facts gathered from reputable sources sited below. I've been compelled to write these posts due so many tech writers putting out articles that are not technically sound, these posts are meant to be factoid for a "one-stop" reference. Also please keep in mind many of these topics are so new they are evolving as I type this post, so your inputs are greatly appreciated & welcomed.

Those of you that follow my Big Data posts on Linkedin may have read my post on Data Processing engines, "A Comprehensive Analysis Apache Flink and How It Compares to Apache Spark"

The reason I am re-visiting data processing again is to help clarify some misunderstandings about Apache Storm, another popular (also fast) data processing engine that deserves it's time in the limelight. Big Data enthusiast's interests has been piqued by the recent news of the Yahoo! Big Data team recently performing some bench marking tests comparing Apache Flink, Storm and Spark which you can read about here: (please keep in mind, much of the Apache Flink results are pending and updated results are in process): Link to Yahoo data processing benchmarks test 

Many Big Data experts evaluating these data processing tools, look not only at performance but also other very important elements such as security, and overall integration capabilities. 

In this post, I would like to establish the following: 

  • How did Apache Storm come about and who created it?
  • Where does it fall into the Big Data eco-system? 
  • What are the different components of Apache Storm?
  • How does Apache Storm compare to Apache Spark streaming ?

So in order for us to get a little bit of background on Apache Storm and how it came about let's get the info directly from the horse's mouth, in this case the creator, Nathan Marz's blog site who writes:

"Apache Storm recently (2014) became a top-level project, marking a huge milestone for the project and for me personally. It's crazy to think that four years ago Storm was nothing more than an idea in my head, and now it's a thriving project with a large community used by a ton of companies. In this post I want to look back at how Storm got to this point and the lessons I learned along the way.

 The topics I will cover through Storm's history naturally follow whatever key challenges I had to deal with at those points in time. The first 25% of this post is about how Storm was conceived and initially created, so the main topics covered there are the technical issues I had to figure out to enable the project to exist. The rest of the post is about releasing Storm and establishing it as a widely used project with active user and developer communities. The main topics discussed there are marketing, communication, and community development.

Any successful project requires two things:

  1. It solves a useful problem
  2. You are able to convince a significant number of people that your project is the best solution to their problem

What I think many developers fail to understand is that achieving that second condition is as hard and as interesting as building the project itself. I hope this becomes apparent as you read through Storm's history.

Before Storm

Storm originated out of my work at BackType. At BackType we built analytics products to help businesses understand their impact on social media both historically and in realtime. Before Storm, the realtime portions of our implementation were done using a standard queues and workers approach. For example, we would write the Twitter firehose to a set of queues, and then Python workers would read those tweets and process them. Oftentimes these workers would send messages through another set of queues to another set of workers for further processing.

We were very unsatisfied with this approach. It was brittle – we had to make sure the queues and workers all stayed up – and it was very cumbersome to build apps. Most of the logic we were writing had to do with where to send/receive messages, how to serialize/deserialize messages, and so on. The actual business logic was a small portion of the codebase. Plus, it didn't feel right – the logic for one application would be spread across many workers, all of which were deployed separately. It felt like all that logic should be self-contained in one application.

The first insight

In December of 2010, I had my first big realization. That's when I came up with the idea of a "stream" as a distributed abstraction. Streams would be produced and processed in parallel, but they could be represented in a single program as a singleabstraction. That led me to the idea of "spouts" and "bolts" – a spout produces brand new streams, and a bolt takes in streams as input and produces streams as output. They key insight was that spouts and bolts were inherently parallel, similar to how mappers and reducers are inherently parallel in Hadoop. Bolts would simply subscribe to whatever streams they need to process and indicate how the incoming stream should be partitioned to the bolt. Finally, the top-level abstraction I came up with was the "topology", a network of spouts and bolts.

I tested these abstractions against our use cases at BackType and everything fit together very nicely. I especially liked the fact that all the grunt work we were dealing with before – sending/receiving messages, serialization, deployment, etc. would be automated by these new abstractions.

Before embarking on building Storm, I wanted to validate my ideas against a wider set of use cases. So I sent out this tweet:"

The initial stable release was on September 17, 2011.

Wikipedia does a great job describing the topology here:

"A Storm application is designed as a "topology" in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. At a superficial level the general topology structure is similar to a MapReduce job, with the main difference being that data is processed in real-time as opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed, while a MapReduce job DAG must eventually end.[5]

Storm became an Apache Top-Level Project in September 2014[6] and was previously in incubation since September 2013.[7][8]"

When comparing Apache Storm to other data processing engines it's important to note, that Apache Storm comes in (2) "flavors" Apache Storm Core & "Trident," which actually runs on top of Storm. Apache Storm Trident is essentially a high level abstraction of Apache Storm. Here's what I've gathered are some differences that Trident offers versus Core Storm:

  • Trident adds complexity to a Storm topology, lowers performance and generates state.
  • Trident processes messages in batches, so throughput time could be longer.
  • Trident is not yet able to process loops in topologies. (this may have changed with newer releases).

The Diagram below (by Trivadis) does a great job comparing Core & Trident Storm vs Apache Spark Streaming.

Stream Processing Architectures – The Old and the New

At a high level, modern distributed stream processing pipelines execute as following the Data Lifecycle as we discussed in another previous post, "A Comprehensive Analysis - Apache Kafka"

The data stream life cycle consists of these (3) key components:

1. Create -  creating data from a multitude of event sources (machines, social media, sensors, logs, databases, click stream, email, HTML, images, location, etc)

2. Collect - make data streams available for consumption (ActiveMQ, RabbitMQ, Apache Kafka, etc.)

3. Process - processing streams and possibly creating derived streams (Apache MapReduce, Tez, Flink, Spark, Storm, etc.)

 This illustration below from DataArtisans' Apache Flink presentation really demonstrates where these data processing engines sit in relation to each other and in relation to other technologies. With this note and overview I would like to dip into Apache Spark streaming (directly from Databricks blog site) and previewing it's various components. Most of the specifics within Apache Spark 1.5 were covered in a previous post here, the changes released for Apache Spark 1.6 will be examined in a future post.

Source: DataArtisans presentation

The following is an excerpt from Databricks blog site about Apache Spark streaming:

"To process the data, most traditional stream processing systems are designed with a continuous operator model, which works as follows:

  • There is a set of worker nodes, each of which run one or more continuous operators.
  • Each continuous operator processes the streaming data one record at a time and forwards the records to other operators in the pipeline.
  • There are “source” operators for receiving data from ingestion systems, and “sink” operators that output to downstream systems.

Continuous operators are a simple and natural model. However, with today’s trend towards larger scale and more complex real-time analytics, this traditional architecture has also met some challenges. We designed Spark Streaming to satisfy the following requirements:

  • Fast failure and straggler recovery – With greater scale, there is a higher likelihood of a cluster node failing or unpredictably slowing down (i.e. stragglers). The system must be able to automatically recover from failures and stragglers to provide results in real time. Unfortunately, the static allocation of continuous operators to worker nodes makes it challenging for traditional systems to recover quickly from faults and stragglers.
  • Load balancing – Uneven allocation of the processing load between the workers can cause bottlenecks in a continuous operator system. This is more likely to occur in large clusters and dynamically varying workloads. The system needs to be able to dynamically adapt the resource allocation based on the workload.
  • Unification of streaming, batch and interactive workloads – In many use cases, it is also attractive to query the streaming data interactively (after all, the streaming system has it all in memory), or to combine it with static datasets (e.g. pre-computed models). This is hard in continuous operator systems as they are not designed to the dynamically introduce new operators for ad-hoc queries. This requires a single engine that can combine batch, streaming and interactive queries.
  • Advanced analytics like machine learning and SQL queries – More complex workloads require continuously learning and updating data models, or even querying the “latest” view of streaming data with SQL queries. Again, having a common abstraction across these analytic tasks makes the developer’s job much easier.

To address these requirements, Spark Streaming uses a new architecture called discretized streams that directly leverages the rich libraries and fault tolerance of the Spark engine.

Architecture of Spark Streaming: Discretized Streams

Instead of processing the streaming data one record at a time, Spark Streaming discretizes the streaming data into tiny, sub-second micro-batches. In other words, Spark Streaming’s Receivers accept data in parallel and buffer it in the memory of Spark’s workers nodes. Then the latency-optimized Spark engine runs short tasks (tens of milliseconds) to process the batches and output the results to other systems. Note that unlike the traditional continuous operator model, where the computation is statically allocated to a node, Spark tasks are assigned dynamically to the workers based on the locality of the data and available resources. This enables both better load balancing and faster fault recovery, as we will illustrate next.

In addition, each batch of data is a Resilient Distributed Dataset (RDD), which is the basic abstraction of a fault-tolerant dataset in Spark. This allows the streaming data to be processed using any Spark code or library.

 Benefits of Discretized Stream Processing

Let’s see how this architecture allows Spark Streaming to achieve the goals we set earlier.

Dynamic load balancing

Dividing the data into small micro-batches allows for fine-grained allocation of computations to resources. For example, consider a simple workload where the input data stream needs to partitioned by a key and processed. In the traditional record-at-a-time approach taken by most other systems, if one of the partitions is more computationally intensive than the others, the node statically assigned to process that partition will become a bottleneck and slow down the pipeline. In Spark Streaming, the job’s tasks will be naturally load balanced across the workers — some workers will process a few longer tasks, others will process more of the shorter tasks.

 Fast failure and straggler recovery

In case of node failures, traditional systems have to restart the failed continuous operator on another node and replay some part of the data stream to recompute the lost information. Note that only one node is handling the recomputation, and the pipeline cannot proceed until the new node has caught up after the replay. In Spark, the computation is already discretized into small, deterministic tasks that can run anywhere without affecting correctness. So failed tasks can be relaunched in parallel on all the other nodes in the cluster, thus evenly distributing all the recomputations across many nodes, and recovering from the failure faster than the traditional approach.

 Unification of batch, streaming and interactive analytics

The key programming abstraction in Spark Streaming is a DStream, or distributed stream. Each batch of streaming data is represented by an RDD, which is Spark’s concept for a distributed dataset. Therefore a DStream is just a series of RDDs. This common representation allows batch and streaming workloads to interoperate seamlessly. Users can apply arbitrary Spark functions on each batch of streaming data: for example, it’s easy to join a DStream with a precomputed static dataset (as an RDD).

 Since the batches of streaming data are stored in the Spark’s worker memory, it can be interactively queried on demand. For example, you can expose all the streaming state through the Spark SQL JDBC server, as we will show in the next section. This kind of unification of batch, streaming and interactive workloads is very simple in Spark, but hard to achieve in systems without a common abstraction for these workloads.

Advanced analytics like machine learning and interactive SQL

Spark interoperability extends to rich libraries like MLlib (machine learning), SQL, DataFrames, and GraphX. Let’s explore a few use cases:

Streaming + SQL and DataFrames

RDDs generated by DStreams can be converted to DataFrames (the programmatic interface to Spark SQL), and queried with SQL. For example, using Spark SQL’s JDBC server, you can expose the state of the stream to any external application that talks SQL.

 Then you can interactively query the continuously updated “word_counts” table through the JDBC server, using the beeline client that ships with Spark, or tools like Tableau.

Streaming + MLlib

Machine learning models generated offline with MLlib can applied on streaming data. For example, the following code trains a KMeans clustering model with some static data and then uses the model to classify events in a Kafka data stream.

 We demonstrated this offline-learning-online-prediction at our Spark Summit 2014 Databricks demo. Since then, we have also added streaming machine learning algorithms in MLLib that can continuously train from a labelled data stream. Other Spark libraries can also easily be called from Spark Streaming.

Performance

Given the unique design of Spark Streaming, how fast does it run? In practice, Spark Streaming’s ability to batch data and leverage the Spark engine leads to comparable or higher throughput to other streaming systems. In terms of latency, Spark Streaming can achieve latencies as low as a few hundred milliseconds. Developers sometimes ask whether the micro-batching inherently adds too much latency. In practice, batching latency is only a small component of end-to-end pipeline latency. For example, many applications compute results over a sliding window, and even in continuous operator systems, this window is only updated periodically (e.g. a 20 second window that slides every 2 seconds). Many pipelines collect records from multiple sources and wait for a short period to process delayed or out-of-order data. Finally, any automatic triggering algorithm tends to wait for some time period to fire a trigger. Therefore, compared to the end-to-end latency, batching rarely adds significant overheads. In fact, the throughput gains from DStreams often means that you need fewer machines to handle the same workload.

 

Future Directions for Spark Streaming

Spark Streaming is one of the most widely used components in Spark, and there is a lot more coming for streaming users down the road. Some of the highest priority items our team is working on are discussed below. You can expect these in the next few releases of Spark:

  • Backpressure – Streaming workloads can often have bursts of data (e.g. sudden spike in tweets during the Oscars) and the processing system must be able to handle them gracefully. In the upcoming Spark 1.5 release (next month), Spark will be adding better backpressure mechanisms that allow Spark Streaming dynamically control the ingestion rate for such bursts. This feature represents joint work between us at Databricks and engineers at Typesafe.
  • Dynamic scaling – Controlling the ingestion rate may not be sufficient to handle longer terms variations in data rates (e.g. sustained higher tweet rate during the day than night). Such variations can be handled by dynamically scaling the cluster resource based on the processing demands. This is very easy to do within the Spark Streaming architecture — since the computation is already divided into small tasks, they can be dynamically redistributed to a larger cluster if more nodes are acquired from the cluster manager (YARN, Mesos, Amazon EC2, etc). We plan to add support for automatic dynamic scaling.
  • Event time and out-of-order data – In practice, users sometimes have records that are delivered out of order, or with a timestamp that differs from the time of ingestion. Spark streaming will support “event time” by allowing user-defined time extraction function. This will include a slack duration for late or out-of-order data.
  • UI enhancements – Finally, we want to make it easy for developers to debug their streaming applications. For this purpose, in Spark 1.4, we added new visualizations to the streaming Spark UI that let developers closely monitor the performance of their application. In Spark 1.5, we are further improving this by showing more input information like Kafka offsetsprocessed in each batch.

To learn more about Spark Streaming, read the official programming guide, or the Spark Streaming research paper that introduces its execution and fault tolerance model."

I hope this post has further highlighted (2) popular data processing engines that many Big Data professional are comparing. It is important to note that these data processing engines are not replacements for each other but they should be be used in conjunction with each other. Feel free to post your comments below, these posts are meant to be as much collaborative as informative, would love to hear from everyone.Rassul Fazelat (follow me here @BigDataVision), is Managing Partner - Founder of Data Talent Advisors, a boutique Data & Analytics Talent Advisory & Headhunting firm, Organizer of NYC Big Data Visionaries MeetupCo-Organizer of NYC Marketing Analytics Forum & Co-Organizer of NYC Advanced Analytics Meetup.

Other posts in the Comprehensive Analysis (Big Data) series:

 Big Data Career series:

George Vassis

Data Engineer Technical Lead

8y

Performance section in spark has twice the same paragraph. The comparison between spark and storm is almost inexisting besides the table. Feels a bit like a sequential copy/paste of a vague shoort describtion of storm followed by spark separated by a table.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics