Spark Closes in on Real-Time Processing with Redis Pairing

Feb 2nd, 2016 11:00am by Joab Jackson

Redis Labs has released a connector that would allow the Spark data processing platform to use the Redis in-memory data store.

Using Redis for Spark will allow users to “store a huge amount of data without paying a significant amount of money for infrastructure,” explained Yiftach Shoolman, co-founder and Chief Technology Officer of Redis Labs, noting that Redis can be a lower cost alternative to a full-fledged in-memory database system. “Today we want the big data performance to be as close to real-time as possible. That is what we try to do.”

Specifically, the open source Spark-Redis connector package provides an easy way to run SparkSQL queries against data stored on Redis.

Running Spark against a Redis data store can speed processing by 135 times, compared to using HDFS (Hadoop File System) and is even 45 times faster than using the Tachyon in-memory data store, according to benchmarks from Redis Labs.

Redis

Redis Labs is eager to make Redis the de-facto data store for Spark, Shoolman asserted.

The package is a library that provides a library for writing to and reading from a Redis cluster. It exposes all of Redis’ data structures – string, hash, list, set, sorted set, bitmaps, hyperloglogs – as Spark RDDs (Resilient Data Sets) or through the Spark DataSet API.

The library minimizes the overhead that occurs with serialization and deserialization of large amounts of data.

Spark itself has emerged as the chief successor to the Hadoop data processing platform thanks in no small part to an ability to process data in near-real time, rather than the batch processing of ‘big data’ that Hadoop originally offered.

“Apache Spark is becoming a default in-memory engine for high-performance data integration and analytics,” said Matt Aslett, research director, data platforms and analytics at 451 Research, in a statement. “The combination of Redis and Spark should enable high-performance, real-time analytics with extremely large and variable datasets.”

Joab Jackson is a senior editor for The New Stack, covering cloud native computing and system operations. He has reported on IT infrastructure and development for over 25 years, including stints at IDG and Government Computer News. Before that, he...