By Matt Asay, InfoWorld |

About |

Emerging tech dissected by technologists

MongoDB, Cassandra, and HBase -- the three NoSQL databases to watch

With so many NoSQL choices, how do you decide on one? Here’s a handy guide for narrowing your choice to three

Hadoop gets much of the big data credit, but the reality is that NoSQL databases are far more broadly deployed -- and far more broadly developed. In fact, while shopping for a Hadoop vendor is relatively straightforward, picking a NoSQL database is anything but. There are, after all, in excess of 100 NoSQL databases, as the DB-Engines database popularity ranking shows.

Which should you choose?

Spoiled for choice

Because choose you must. As nice as it might be to live in a happy utopia of so-called polyglot persistence, “where any decent-sized enterprise will have a variety of different data storage technologies for different kinds of data,” as Martin Fowler argues, the reality is you can’t afford to invest in learning more than a few.

Fortunately, the choice is getting easier as the market coalesces around three dominant NoSQL databases: MongoDB (backed by my former employer), Cassandra (primarily developed by DataStax, though hatched at Facebook), and HBase (closely aligned with Hadoop and developed by the same community).

Note that I purposefully exclude Redis from this list. While a great data store, it’s primarily used for caching data and isn’t well suited for a wide array of workloads.

LinkedIn data from 451 Research shows how the market is gravitating to MongoDB, Cassandra, and HBase:

That’s LinkedIn profile data. A more complete view is DB-Engines', which aggregates jobs, search, and other data to understand database popularity. While Oracle, SQL Server, and MySQL reign supreme, MongoDB (no. 5), Cassandra (no. 9), and HBase (no. 15) are giving them a run for their money.

While it’s too soon to call every other NoSQL database a rounding error, we’re rapidly reaching that point, exactly as happened in the relational database market.

To better understand why these three databases shine, I asked representatives from each to identify key attributes for their success: Kelly Stirman, director of products at MongoDB; Patrick McFadin, chief Cassandra evangelist at DataStax; and Justin Kestelyn, senior director of developer relations at Cloudera.

But first, we need to understand why NoSQL matters.

A world built with unstructured data

We increasingly live in a world where data doesn’t fit nicely into the tidy rows and columns of an RDBMS. Mobile, social, and cloud computing have spawned a massive flood of data. According to a variety of estimates, 90 percent of the world’s data was created in the last two years, with Gartner pegging 80 percent of all enterprise data as unstructured. What's more, unstructured data is growing at twice the rate of structured data.

As the world changes, data management requirements go beyond the effective scope of traditional relational databases. The first organizations to observe the need for alternative solutions were Web pioneers, government agencies, and companies that specialize in information services.

Increasingly now, companies of all stripes are looking to capitalize on the advantage of alternatives like NoSQL and Hadoop: NoSQL to build operational applications that drive their business through systems of engagement, and Hadoop to build applications that analyze their data retrospectively and help deliver powerful insights.

MongoDB: Of the developers, for the developers

Among the NoSQL options, MongoDB's Stirman points out, MongoDB has aimed for a balanced approach suited to a wide variety of applications. While the functionality is close to that of a traditional relational database, MongoDB allows users to capitalize on the benefits of cloud infrastructure with its horizontal scalability and to easily work with the diverse data sets in use today thanks to its flexible data model.

MongoDB is often the first NoSQL database developers will try because it’s so easy to learn. Will Shulman, CEO of MongoLab (a MongoDB-as-a-service provider), says it this way:

The disproportionate success of MongoDB is largely based on its innovation as a data structure store that lets us more easily and expressively model the "things" at the heart of our applications….

Having the same basic data model in our code and in the database is the superior method for most use cases, as it dramatically simplifies the task of application development, and eliminates the layers of complex mapping code that are otherwise required.

Notably, MongoDB, like the other databases on this list, is not a one-trick pony. Enterprises that learn MongoDB “can amortize their investments in MongoDB across many, many projects, making it one of short list of standards they rely upon for all data management,” as Stirman told me.

Of course, like any technology MongoDB has its strengths and weaknesses. MongoDB is designed for OLTP workloads. It can do complex queries, but it’s not necessarily the best fit for reporting-style workloads. Or if you need complex transactions, it’s not going to be a good choice. However, MongoDB’s simplicity makes it a great place to start.

Cassandra: Safely run at scale

There are at least two kinds of database simplicity: development simplicity and operational simplicity. While MongoDB rightly gets credit for an easy out-of-the-box experience, Cassandra earns full marks for being easy to manage at scale.

As DataStax's McFadin told me, users tend to gravitate to Cassandra the more they butt their heads against the difficulty of making relational databases faster and more reliable, particularly at scale. A former Oracle DBA, McFadin was elated to discover that “replication and linear scaling are primitives” with Cassandra, and the features were “the primary design goal from the beginning.”

In the RDBMS world, database features like scaling and replication are the hard parts left to the user. This worked fine in yesterday’s enterprise when scale wasn’t a big issue. Today it’s quickly becoming the issue.

As I heard from McFadin and others, Cassandra particularly shines in scale-out deployments. Cassandra comes with baked-in support for multiple data centers. As for adding capacity to a cluster, “You simply boot up a new machine and tell Cassandra where the other nodes are," McFadin said, "and it takes care of the rest.”

This ease of scaling, coupled with exceptional write performance (“All you’re doing is appending to the end of a log file”) and predictable query performance, add up to a high-performance workhorse in Cassandra.

One article of NoSQL faith I’ve long held is that Cassandra may be powerful at scale, but it requires a doctorate degree to get started. Not so, McFadin insisted:

The replication and read and write paths are purposefully simple. You can learn the core internals of Cassandra in a few hours. That can bring a lot of confidence as you deploy new technology because there are less “black box” details that introduce complex failure modes.

This means that the price for admission to effective Cassandra development is in understanding the data model and how it will work with your application. Given the familiarity of Cassandra’s CQL query language (intended to be “exactly like SQL except when it’s not”), McFadin said, it’s not a steep learning curve.

More important, he told me, “Cassandra rewards you with the one thing you want from a database: no drama. This is why users love to use Cassandra.”

HBase: Bosom buddies with Hadoop

HBase, like Cassandra a column-oriented key-value store, gets a lot of use in large part because of its common pedigree with Hadoop. Indeed, as Cloudera's Kestelyn put it, “HBase provides a record-based storage layer that enables fast, random reads and writes to data, complementing Hadoop by emphasizing high throughput at the expense of low-latency I/O.”

Kestelyn goes on:

Changes are efficiently cataloged in memory to achieve maximum access while the data is persisted to HDFS. This design enables a Hadoop-based EDH [enterprise data hub] to serve random reads and writes to users and applications in real time, yet still enjoy the fault-tolerance and durability of HDFS.

Affinity with Hadoop isn’t the only reason HBase keeps rising in the database popularity ranks, though that might be enough. Similar to Cassandra, HBase’s roots as an open source implementation of Google’s Bigtable translate into the database being highly scalable by design.

Because it can utilize the storage, memory, and CPU resources of any number of servers, as well as has scale-out features like automatic sharding, HBase can scale limitlessly as load and performance demands increase simply by adding server nodes. HBase was designed from the ground up to provide optimal performance when consistency is critical.

But scale isn’t it’s only utility. As Kestelyn noted, “Thanks to its tight integration with the rest of the Hadoop ecosystem, data is readily available to users and applications via SQL queries (using Cloudera Impala, Apache Phoenix, or Apache Hive) or even faceted free-text search (using Cloudera Search).” Thus, HBase gives developers a way to leverage existing expertise with SQL while building on a more modern, distributed database.

Each database comes with its own strengths and shortcomings, but each of the three profiled here has filled a major hole in the big data landscape. While it’s possible that a new database will come along to claim a spot in the NoSQL top three (DynamoDB?), the reality is that developers and the enterprises they serve are already standardizing on a few strong options: MongoDB, Cassandra, and HBase.

Now VP of mobile at Adobe, Matt Asay was previously vice president of community at MongoDB, Inc. He is an emeritus board member of the Open Source Initiative (OSI) and earned his juris doctorate at Stanford, where he focused on open source and other intellectual property licensing issues, and his master's from the University of Kent at Canterbury and his bachelor's from Brigham Young University. Asay was one of InfoWorld's first bloggers.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Next read this: