H2O.ai Melds Machine Learning with Spark, Via Sparkling Water 2.0

by Ostatic Staff - Jul. 01, 2016

In recent interviews here on OStatic, found here and here, we have explored the efforts of H2O.ai, formerly known as Oxdata, which has steadily been carving out a niche with its  open source software for big data analysis and machine learning. You can get the main H2O platform and Sparkling Water, a package that works with Apache Spark, by simply downloading them. You can run them on clusters powered by Amazon Web Services (AWS) and others for just a few hundred dollars, putting powerful artificial intelligence muscle in reach of everyone.

Now, H2O.ai has announced the availability of Sparkling Water 2.0. Sparkling Water 2.0 builds off the popularity of Sparkling Water, H2O.ai's API for Apache Spark, with additional features and functionality. New features include the ability to interface with Apache Spark, Scala and MLlib via H2O.ai's Flow UI, build ensembles using algorithms from both H2O and MLlib and give Spark users the power of H2O's visual intelligence capabilities.

According to H2O.ai:

"Sparkling Water was designed to allow users to get the best of Apache Spark -- its elegant APIs, RDDs and multi-tenant Context -- along with H2O's speed, columnar-compression and fully-featured machine learning algorithms. Sparkling Water also allows for greater flexibility when it comes to finding the best algorithm for a given use case. Apache Spark's MLlib offers a library of efficient implementations of popular algorithms directly built using Spark. Sparkling Water empowers enterprise customers to use H2O algorithms in conjunction with, or instead of, MLlib algorithms on Apache Spark."

Spark users may want to take a close look. "Enterprises are looking to take advantage of a variety of machine learning algorithms to address an increasingly complex set of use cases when determining how to best serve their customers," said Matt Aslett, Research Director, Data Platforms and Analytics at 451 Research. "Sparkling Water is likely to be attractive to H2O and Spark users alike, enabling them to mix and match algorithms as required."

 Sparkling Water 2.0 includes many improvements:

Support for Apache Spark 2.0 and backwards compatibility with all previous versions.

The ability to run Apache Spark and Scala through H2O's Flow UI.

Support for the Apache Zeppelin notebook.

H2O feature improvements and visualizations for MLlib algorithms, including the ability to score feature importance.

Visual intelligence for Apache Spark.

The ability to build Ensembles using H2O plus MLlib algorithms.

The power to export MLlib models as POJOs (Plain Old Java Objects), which can be easily run on commodity hardware.

A toolchain for building machine learning pipelines on Apache Spark

 H2O.ai's Director of Marketing, Vinod Iyengar (shown above), told us the following in our interview with him:

"Being open source is the core of who we are at H2O.ai. Open source doesn’t just describe our product offerings, but also our philosophy towards our customers and community. Our machine learning platform H2O and Sparkling Water, our package for Spark, are completely open source and fully available for download at http://www.h2o.ai/download."

"Code is truly getting commoditized and the only defensible asset is community. The relationships we have with our customers are also deepened due to the open source nature of our products. Because H2O and Sparkling Water are open source, our customers are also our community. They take part in H2O not just as consumers, but as developers as well."

 Notably, H2O.ai is also working on a data science hub called Steam, which will eliminate all the DevOps work required to build and deploy artificial intelligence models. With Steam, developers and data scientists will be encouraged to compare models across teams and take them into production without the need for heavy engineering work on the backend. We will follow up on Steam in a post to come soon.