The Elephant in the Data Lake and Snowflake

Let’s talk about the elephant in the data lake, Hadoop, and the constant evolution of technology.

Hadoop, (symbolized by an elephant), was created to handle massive amounts of raw data that were beyond the capabilities of existing database technologies. At its core, Hadoop is simply a distributed file system. There are no restrictions on the types of data files that can be stored, but the primary file contents are structured and semi-structured text. “Data lake” and Hadoop have been largely synonymous, but, as we’ll discuss, it’s time to break that connection with Snowflake’s cloud data warehouse technology.

Hadoop’s infrastructure requires a great deal of system administration, even in cloud managed systems.   Administration tasks include: replication, adding nodes, creating directories and partitions, performance, workload management, data (re-)distribution, etc.  Core security tools are minimal, often requiring add-ons. Disaster recovery is another major headache.  Although Hadoop is considered a “shared nothing” architecture, all users compete for resources across the cluster.

Map Reduce (MR) is a fairly simple distributed, disk based computing solution, requiring the writing of complex, procedural code. The developer needs to be intimately familiar with the contents of files being used. MR is batch oriented and the number of simultaneous queries is limited. This led to development of other technologies, such as Spark, to improve performance by using memory instead of disk.

It became apparent for reporting and analytics purposes that representing data in relational form with a dictionary was still very desirable.   Hive was born.   With a relational model and dictionary, SQL itself was once again back in favor.  Many people were surprised to discover that SQL was not dependent on the underlying storage technique, but could be mapped to delimited and semi-structured data.   “Schema on read” became “schema on demand”.

A plethora of new SQL tools were born.  Some ran entirely within the Hadoop cluster, others grafted external servers to improve query performance and concurrency.  This also made it much easier for existing Business Intelligence tools to work against the Hadoop data lake.

ETL from Hadoop into other SQL RDBMS has also become commonplace.

But Hadoop’s underlying administration issues still exist, even with cloud managed services.  Feeding and caring for an elephant is an expensive, non-trivial exercise.

Hadoop in its time was a sea change.  Snowflake’s technology is the latest sea change in database technology.

First, let’s determine if Snowflake is suitable for use as a data lake.  The key features of a data lake are:

  1. Support for a wide variety of data types, e.g. structured data, semi-structured (JSON, Avro, Parquet, etc.), log files, IOT events, etc.
  2. Unlimited storage.
  3. Wide range of ingestion techniques and technologies for ETL, ELT and streaming.

Snowflake meets all 3 of the above criteria, and offers substantial improvements over Hadoop.

For #1, Snowflake supports all of the standard RDBMS data types in native format, i.e. NUMBER is stored as binary, not text.  Semi-structured data is stored as VARIANT. Unlike most other semi-structured technologies, Snowflake performs the same optimizations on the VARIANT data as it does on structured date, i.e. meta-data information, statistics, compression and encryption. This provides tremendous improvement in performance, allowing Snowflake to perform its micro-partition pruning technology on semi-structured data as well as strongly typed data.

For #2, a simple “yes” would suffice; there are no limits to Snowflake’s storage capacity.   But Snowflake also eliminates the storage management headache.  There are no nodes to set up and manage, no file systems or directories to define, no data distribution policies to implement and revise.  Replication, backup and disaster recovery are all part of Snowflake’s managed services.  Data protection policies are set by simple SQL or UI actions.

For #3, Snowflake and its ecosystem offers a tremendous number of partners and connectors.

Let’s look at other advantages:

Security – In addition to role based security, Snowflake security offers:

  • Data is always encrypted both at rest and in transit, with managed hierarchical and periodic key rotation. Data uploaded to staging may be encrypted by the user.  Snowflake offers “triple key security” and customer managed keys in its Enterprise for Sensitive Data (ESD) level.
  • All editions are Soc 1 Type II and Soc 2 Type II compliant. EDS is PCI DSS certified, FedRamp Ready, and HIPAA compliant.
  • The “Virtual Private Snowflake” edition offers additional security, providing a single tenant Snowflake cloud.

Performance and Concurrency – Snowflake’s architecture offers both high performance and unlimited concurrency.

First, let’s look at performance:

  • Snowflake’s unique micro-partition architecture provides both partition pruning as well as column pruning within a micro-partition.
  • “Compute” is separate from storage. Compute is provided by “Virtual Data Warehouses (VDW)”.   A VDW consists of:
    1. 1 or more clusters
    2. A cluster may contain 1 – 128 servers. A server is a virtual machine in the appropriate cloud.
  • Query performance depends on the cluster size of a VDW. Cluster size can be increased or decreased dynamically by simple SQL or UI commands without any service interruption. In-flight queries continue to completion; only new queries are affected by the change.
    1. Snowflake maintains the results of all queries for 24 hours; identical queries retrieve results from this result cache.

Now let’s look at concurrency:

  • Concurrency, i.e. the number of simultaneous queries may be scaled in 2 ways.
    1. A VDW may be configured with multiple clusters, again via simple SQL or UI command. As of this writing, the maximum number of clusters is 10.  A VDW may be configured as autoscale, with clusters being started and stopped based on dynamic load balancing.
    2. For even higher levels of concurrency, multiple VDWs can be created for similar workloads.

As we’ve seen, Snowflake clearly adds the ability to meet the requirements of a data lake, in addition to its more traditional data warehouse capabilities.

It’s time to return the elephant to the wild!

Copyright © Jeffrey Jacobs, 2019, permission to redistribute with copyright notice

Comments

2 responses to “The Elephant in the Data Lake and Snowflake”

  1. Kent Graziano Avatar

    Reblogged this on The Data Warrior and commented:
    So if Hadoop finally dead? For many use cases, I think it really is. The cloud and the continue evolution of technology has created newer, better ways of working with data at scale. Check out what my Jeff has to say about it!

    Like

  2. […] It was also great watching the presentation on data lakes in Snowflake, almost matching my post The Elephant in the Data Lake and Snowflake! […]

    Like

Leave a comment