Apache Parquet paves the way for better Hadoop data storage

Newly graduated from the Apache Incubator, the Parquet project allows column-stored data to be handled at high speed

Apache Parquet, which provides columnar storage in Hadoop, is now a top-level Apache Software Foundation (ASF)-sponsored project, paving the way for its more advanced use in the Hadoop ecosystem.

Already adopted by Netflix and Twitter, Parquet began in 2013 as a co-production between engineers at Twitter and Cloudera to allow complex data to be encoded efficiently in bulk.

Databases traditionally store information in rows and are optimized for working with one record at a time. Columnar storage systems serialize and store data by column, meaning that searches across large data sets and reads of large sets of data can be highly optimized.

Hadoop was built for managing large sets of data, so a columnar store is a natural complement. Most Hadoop projects can read and write data to and from Parquet; the Hive, Pig, and Drill projects already do this, as well as conventional MapReduce.

As another benefit, per-column data compression further accelerates performance in Parquet. A textual data column is compressed differently than a column loaded with only integer data, and being able to compress columns separately provides its own performance boost. Parquet also implements column compression so that it's "future-proofed to allow adding more encodings as they are invented and implemented."

Early adopters and project leads have used Parquet for some time and built functionality around it. Cloudera, the project's co-progenitor, uses Parquet as a native data storage format for its Impala analytics database project, and MapR has added data self-description functions to Parquet. Netflix -- never one to shy away from a forward-looking technology (such as Cassandra) -- has 7 petabytes of warehoused data in Parquet format, according to the ASF.

Parquet isn't the only way to store columnar data in Hadoop, but it's shaping up as the leader. Hive has its own columnar-data format, called ORC, although it's mainly intended as an extension to Hive rather than as a general data store for Hadoop.

Hortonworks, a Cloudera competitor (in more ways than one), claimed earlier in Parquet's lifecycle that ORC compresses data more efficiently than Parquet. And IBM ran its own performance comparisons in September 2014 and found that while ORC used the least amount of HDFS storage, Parquet had the best overall query and analysis time, which are the metrics that typically matter most for Hadoop users.

Next read this:

Serdar Yegulalp is a senior writer at InfoWorld, focused on machine learning, containerization, devops, the Python ecosystem, and periodic reviews.