UPDATED 11:16 EST / APRIL 17 2014

Diving into Big Data : Data lakes vs. data streams

diving into the data lake jump inData lakes and data streams are becoming common analogies in the discussion around analytics and potential enterprise big data strategies. The analogies are appropriate from several aspects beyond just visualizing different approaches to accessing useful information. As in nature, both lakes and streams have their individual characteristics and are each important to the overall ecosystem. The question is not whether one should exist in favor of the other; both provide benefits but the key is in understanding how to best utilize each for the optimum outcome.

Fishing for Trends in Data Lakes

 .

Data lakes as in nature represent a large pool, in this case a body of information that has been built up over time. Man-made lakes are often formed by the construction of a dam that is placed on a river to harness the power of that river in a controlled fashion for use at a later time.

Data lakes operate in much the same way, with a deluge of information being diverted into a large repository where it can be held for a long period of time until it is needed. The lakes are constantly fed by new water flowing into the lake, and in fact, they are dependent on the constant flow of water to keep the environment vibrant otherwise the lake could stagnate.

Similarly, data lakes must constantly be enriched by current flows of information in order to assure that the overall data set remains relevant. However, this also means that the storage capacity of the lake must constantly be expanded to accommodate all of the new data being added to the existing base of legacy information. The primary challenge with data lakes is to determine how best to generate useful outputs, since the huge volume of water in the lake represents some valuable information but also a tremendous amount of data that is not useful or relevant. You can compare trying to extract meaningful analysis from a data lake to fishing in a lake for a particular type of fish. If you only use a single fishing pole, then your chances of catching that one specific fish across the whole lake is small unless you spend a significant amount of time in the effort. You can increase your chances by using a net to cover a larger area at once but again with a net you may get much extraneous material along with the data that is most relevant so you have to spend time again to sort the appropriate from the non-relevant.

In both cases, as you are fishing in one area, you may miss new input that may be flowing into a different area. Therefore, once your fishing is complete you may have missed much new information that might have changed the analysis. However, this is not to say that data lakes are not useful, just that their use must be tailored to the characteristics. Data lakes are best used in situations where significant historical perspective is needed especially in cases to examine trends over a longer period of time.

Digging for Real-Time Nuggets with Data Stream Analysis

 .

Analysis using data streams is a fundamentally different approach than data lakes. Rather than diverting the flow to store and then analyze, with streams, analysis occurs as the information is flowing in real- or near-real time. The analogy here is that working in data streams is much like panning for gold. As the data stream passes by, analysis occurs in parallel that seeks to capture the relevant nuggets of information to best address specific questions or areas of concern.

data stream panning real time nuggetsThe primary value in this approach is that information can be accessed quickly and insights can be gleaned in a rapid fashion. Given the dynamic nature of the current environment for enterprises, it is often imperative that anomaly information or real time trends can be understood quickly so that appropriate action can be taken before they significantly impact service or revenue. Data stream analysis is the most effective solution to manage in this challenging real-time environment.

However, operating in the data stream comes with unique challenges of really being able to extract the most valuable elements from the overall flow. The stream is often running so fast and composed of so many different elements that pulling out real gold from fool’s gold is where many operations get stuck in the mud and ultimately don’t deliver. The topic of big data has generated such hype that it is now much like a gold rush with many potential players pouring into the market with ideas and making claims that promise quick strikes. It is important to evaluate all of these approaches to assess who can make legitimate claims and those that have invested appropriately to truly manage data stream analysis. An effective system for data stream analysis must be able to handle billions of transactions on a consistent basis. Additionally, the system must be able to take information not just from one stream but from several streams at the same time. The richest insights will be achieved by combining information gained from multiple different sources, which creates a fully formed perspective on the situation rather than just a single viewpoint. In prospecting, you want to take raw material from a number of different places within the claim in order to have the best chance of finding the mother lode. Similarly, taking information from many sources in the network is the best way to assure that the largest value from the information can be realized.

Finally, operating in the stream requires more than just being able to handle the fast running flow of information. The analysis methodology must be able to be constructed such that the particular data that is most relevant to the situation being examined is obtained. This amounts to creating the right type of sieve that can quickly pull out the proper pieces of data and discard the mass of other material that is extraneous.

The art and science of performing this type of analysis requires a very thorough understanding of the business environment intersected with the complexities of data science. This is a unique set of capabilities, but without this the gold will not be extracted.

Ready to Jump into the Water

.

In summary, data lakes and data streams are both very valid approaches to managing information analysis in an enterprise. They are each, however, targeted at very different types of analysis and thus must be implemented appropriately in order to get the most value. Data lakes are best used to fish for broad historical perspectives and trends. Data stream analysis is most effective for extracting real time nuggets of gold to understand the current environment and react quickly and efficiently, thereby maximizing value and improving operations and overall customer experience. With this difference in mind, enterprises can appropriately devise their big data strategies based on their immediate and long-term business needs.

 

About the Author

Rob Chimsky jpgRob Chimsky, Guavus, Vice President, Insights

With over 30 years in the telecommunications industry, Mr. Chimsky brings an impressive track record across technology, product, and marketing. In his most recent position, at inCode, Chimsky was responsible for managing the company’s technology thought leadership and knowledge capital, guiding clients on technology evolution strategy including migration to LTE and spectrum management, with an emphasis on next generation wireless technology and services. Mr. Chimsky also led inCode’s work with private equity clients in the evaluation of potential acquisitions in the telecom industry. Before joining inCode, he spent over seven years as Vice President of Technology Development at Nextel, overseeing Nextel’s dramatic growth from niche carrier to one of the top carriers in the US. Prior to Nextel, Chimsky spent a total of 12 years in management positions at MCI and AT&T.

photo credit: Hani Amir via photopin cc
photo credit: JD Hancock via photopin cc

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU