BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Why Building A Distributed Data Supply Chain Is More Important Than Big Data

This article is more than 10 years old.

Greater Cincinnati transit map (Photo credit: Wikipedia)

It is time to stop the stampede to create capacity to analyze big data and instead pursue a more balanced approach that focuses on finding more data sets and understanding how to use them to improve your business. The goal should not be to create one big factory that can handle any data set, no matter how big. Instead, we should be seeking to create an extended supply chain that accepts data from a wide variety of sources, both internal and external, processes that data in various nodes of the supply chain, passing data where it is needed, transforming it as it flows, storing key signals and events in central repositories, triggering action immediately when possible, and adding data to a queue for deeper analysis. The era of the massive data warehouse is coming to an end. The era of a distributed data supply chain is just beginning.

The forces driving this transformation are as follows:

Big data analysis will become a product. There is not enough big data expertise to go around. The infrastructure to process big data can be obtained cheaply in the cloud. Those who understand how big data can help in an industry context are setting up shop. At the high end, companies like Opera Solutions allow you to plug into data and advanced analytics. More targeted, industry specific offerings aimed at helping specific business processes arrive every day. It is possible to gain competitive advantage by doing your own big data analysis. My advice is to take on that challenge later, after you have improved your use of data as suggested below.

Quality of data is more important than quantity. What’s the easiest way to improve the effectiveness of a model? As any advanced statistician or data scientist knows, it’s better data. It stands to reason that companies should be on a mission to improve the quality of the data and to find more sources of high quality data.

More data that matters is more important than the size of any data set. While big data is a great new source of insights, it is only one of myriad sources of data. A wide-ranging search for more data is in order. In addition, opportunities to create new data should be explored. It is possible to log data from applications to create new sources of data that are meaningful and also provide a competitive advantage.

The number and value of external data sets will rise. As I pointed out in these articles (How External Data Opens Up a New Frontier for Business Intelligence ) and in the series on the Data Not Invented Here syndrome (Do You Suffer From the Data not Invented Here Syndrome?, How to Create a Nervous and Vascular System External Big Data, Mission: How to Find and Use External Data), it doesn’t make sense to suffer from the Data Not Invented Here Syndrome. There will be as much or more valuable data outside your company as inside it. Large companies like UPS are getting into the data business and so are plenty of small ones. Finding new sources of valuable data will become a form of business development.

Focusing your efforts is crucial. The wealth of technology and data available means that there is a significant danger of diluting your efforts. It is vital to be able to say no to both data and technology with confidence so that you can focus on what will bear the most fruit. To do this, however, you must understand what you want to do. The problem of business and IT alignment is more urgent than ever. My suggestion is to use a simple approach such as The Question Game to achieve alignment. The most important thing is that the tech staff, from the CIO to the developer, know what the business is trying to achieve.

Broadening use of data is the key value creator. Of course, none of this matters if the value of the data isn’t making a difference in your business. After decades of effort on business intelligence, the penetration of BI usage in most companies is below 30 percent. We have miles to go. It is important to recognize that increasing use of data is a fundamental, organization-wide problem. Many strategies and technologies will be required.

If all of these observations hit home with you, even partially, it should no longer make sense to be obsessed solely with big data. My view is that most companies should find ways to experiment with big data and be open to products that provide the benefit of big data, but should balance these efforts by expanding their data processing infrastructure to handle data in distributed repositories that are both sources and destinations.

The end result will be a construct that looks a lot more like a supply chain with many interacting nodes than the hub and spoke model with a data warehouse or a large Hadoop cluster in the center. In this supply chain, some of the nodes will be outside the company and some will be inside. The ability to move data around and process it in motion will be critical. In other words, we will have many nodes in a distributed data supply chain. We need to be good at building those nodes to communicate with each other and to meet many different types of needs.

Enabling Technology for a Distributed Data Supply Chain

If you find this vision compelling, the next step is to figure out how to get it done. Selecting the right technology for any purpose is a challenging job that should involve gathering internal requirements, assessing the cultural fit and understanding how a product fits into both technology and business strategy going forward.

A number of companies have products well suited to supporting a distributed data supply chain. Here are a few to consider:

  • Actian has done a number of acquisitions recently and has assembled technology for integrating many data sources, processing it at scale, and delivering it in multiple types of databases, each for a different purpose. (See this CITOResearch Note "Research Note: Actian’s Strategy for Distributed Big Data" for a description of Actian’s strategy for supporting a distributed data supply chain.)
  • Splunk can harvest data from many different sources, distill and combine it, and then allow end-users to explore it. Applications for data exploration or to view real time flows of information are easy to create. Splunk has made machine data its focus, but it is underappreciated as a way to expand the number of people who can explore structured and unstructured data, big or otherwise, using the Search Processing Language.
  • Apigee Insights provides a way to present data for use through APIs, but also to have a repository associated with those APIs that can keep data needed during high performance or chatty interactions close by.
  • Hadoop will be one of the nodes in the data supply chain envisioned above, but only if it is easy to move data in and out of it. MapR Technologies M7 Hadoop distribution is unique in providing support for NFS as well as a variety of other features and extensions that support enterprise use. The NFS support is crucial for allowing simple patterns of use, such as allowing a file to be updated and processed by Hadoop using common, simple mechanisms like a tail command.
  • Pentaho offers an open source-based collection of tools to harvest data from many sources, integrate and clean it, and then build analytical apps of varying power. The open source model could work well for companies that want to extend the supply chain to a network of partners or suppliers.
  • GoodData provides a cloud-based solution for a BI and data analytics stack that could become the hub of a data supply chain. This model could work for a consortium of companies that do not want to have a central infrastructure that is captive to any one of them.
  • 1010data provides a spreadsheet like experience for big data sets, supplemented by a portfolio of advanced analytics. Analysts can have direct access to data, avoiding bottlenecks associated with other approaches. One of the coolest features of 1010 data is the ability to incrementally chip away at a data set, executing and rolling back commands as you go, and taking snapshots when you get to a good spot.
  • SiSense provides a convenient collection of functionality that will be useful as a distributed processing capability. For example, if you find that a partner company has data that is useful to you, instead of transferring all of the data, SiSense could be used to process that data at the partner’s data center and then ship the distilled data containing events and signals to your data center.
  • For companies with a large SAP application footprint, SAP HANA can act as a consolidation and distribution node that can handle both big data and real time information. SAP has productized the integration with SAP HANA so that data can flow easily to and from the system from any of its applications, from its SAP Business Objects BI Suite and using all of the Sybase capabilities for processing data at scale or creating mobile apps.

There are literally hundreds of other choices of technology that can enable a distributed data supply chain. Instead of focusing on just the data warehouse node or the node that processes big data, more value will accrue from finding ways to build nodes to extend the reach of BI and expand the use of data both inside and outside your company.

Dan Woods is CTO and editor of CITO Research, a publication that seeks to advance the craft of technology leadership. For more stories like this one visit www.CITOResearch.com. Dan has performed research for many big data and business intelligence companies including 1010data, GoodData, MapR Technologies, Splunk, and SAP.