By Andrew C. Oliver, Columnist, InfoWorld |

The 10 worst big data practices

It's a new world full of shiny toys, but some have sharp edges. Don't hurt yourself or others. Learn to play nice with them

Previous 1 2 Page 2

Page 2 of 2

You can be that kid again. You can learn Pig, at least. It won't hurt ... much. Think of it as PL/SQL on steroids with maybe a touch of acid. You can do this! I believe in you! To do a larger bit of analytics, you may need a bigger tool set that may include Hive, Pig, MapReduce, Uzi, and more. Never say, "Hive can't do it, so we can't do it." The whole point of big data is to expand beyond what you could do with one technology.

6. Treating HBase like an RDBMS. You went nosing around Hadoop and realized indeed there was a database. Maybe you found Cassandra, but most likely you found HBase. Phew, a database -- now I don't have to try so hard! Trust me, HDFS-plus-Hive will drain less glucose from your head muscle (IANAD).

The only real commonality between HBase and your RDBMS is that both have something resembling a table. You can do things with HBase that would make your RDBMS's head spin, but the reverse is also true. HBase is good for what HBase is good for, and it is terrible at nearly everything else. If you try and represent your whole RDBMS schema as-is in HBase, you will experience a searing hot migraine that will make your head explode.

7. Installing 100 nodes by hand. Oh my gosh, really? You are going to hand-install Hadoop and all its moving parts on 100 nodes by Tuesday? Nothing beats those hand-rolled bits and bytes, huh? That is all fun and good until someone loses a node and you're hand-rolling those too. At the very least, use Puppet -- actually, use Ambari (or your distribution's equivalent) first.

8. RAID/LVM/SAN/VMing your data nodes. Hadoop stripes blocks of data across multiple nodes, and RAID stripes it across multiple disks. Put them together, what do you have? A roaring, low-performing, latent mess. This isn't even turducken -- it's like roasting a turkey inside of a turkey. Likewise, LVM is great for internal file systems, but you're not really going to randomly decide all hundred of your data nodes need to be larger, instead of, like, adding a few more data nodes.

And your SAN, your holy SAN -- loved by many, I/O bound, and latent to all. You're using HDFS for a higher burst rate, so now you're going to stick everything back in the box? The idea is to scale horizontally -- how are you going to do that across the same network pipe to the same box o' disks?

Hey, EMC will sell you more SAN, but maybe you need to think outside the box. VMs are great. However, if you want high-end performance, I/O is king. Fine, you can virtualize the name node and much of the rest of Hadoop, but nothing beats bare metal for data nodes. You can achieve much of the same advantage as virtualization with devops tools. Even most of the cloud vendors are offering metal options.

9. Treating HDFS as just a file system. If you dump stuff onto HDFS, you haven't necessarily accomplished anything. The tooling around it is important, of course. Now you can Hive, Pig, and MapReduce it, but you have to think a bit about what, why, and where you're dumping things onto HDFS. You need to think about how you're going to secure all of this and for whom.

10. Whoo, shiney! Also known as, "today is Thursday, let's move to Spark." Yes, Hadoop is a growing ecosystem, and you want to stay ahead of the curve. I feel you, man, but let's remember that freedom is just another word for nothing left to lose. Once you have real data and real users, you don't have the same amount of freedom as when you had no real responsibility. Now you must have a plan.

Fortunately, you have the tools to manage that evolution and move forward responsibly. Maybe you don't get to deploy this week's cool thing while it is fresh, but you don't have to run Hadoop 1.1 anymore, either. As with any technology -- or anything in life -- find that moderate path that prevents you from being the last gazelle in the pack or the first lemming off the cliff.

This is the current top 10 I'm seeing in the field. How's your big data project going? What anti-patterns or patterns have you found?

This article, "The 10 worst big data practices," was originally published at InfoWorld.com. Keep up on the latest news in application development and read more of Andrew Oliver's Strategic Developer blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

Next read this:

Andrew C. Oliver is a columnist and software developer with a long history in open source, databases, and cloud computing. He founded Apache POI and served on the board of the Open Source Initiative. Oliver has helped with marketing in startups including JBoss, Lucidworks, and Couchbase. He advises startups on marketing, growth, and outreach.

Previous 1 2 Page 2

Page 2 of 2