If there is one topic in computing, networking and storage that is commanding attention these days, it’s Big Data. We read about it in the media, we hear about it at conferences and seminars, we learn about it in podcasts and webinars. In fact, we can now even enroll in courses of study on it. Today the technology community is working closely with academia to create new programs leading to a major in data science.
But for an IT organization, big challenges remain around Big Data and specifically, the analysis thereof. These departments face the daunting tasks of implementing the necessary infrastructure to harness the information. Specific computing, Networking and Storage architecture is needed to enable companies to benefit. How does one, for example, ingest (say) a staggering 30 TB of new data per day (equivalent to 1,740 HD-quality movies, analyze it, store it for possible re-analysis, and archive it? If you’re counting, that is roughly 10 PB annually. That figure is expected to rapidly become trivial as we will soon enter into the Exabyte era – 1,000 PB or more annually.
As fast as the technology has developed, there remain questions around how businesses can measurably benefit from Big Data and the tremendous insights within the data. There are two distinct paths a company can take in implementing a ‘Big Data Farm’. Choosing one over the other can become a conundrum for IT.
The Big Data Farm is where data is ingested (planted), nurtured, grown, weeded, harvested and finally consumed – or perhaps stored for a later day. But there are two different types of farms – one which uses many disparate servers to hold data, often referred to as direct attach, or another which uses scale-out designs where all data is housed in a single entity, such as a file system.
CTOs and CIOs know that at small scale, using servers with disks is often cost-effective and simple. The first few hundred servers typically pose no significant problems for the IT staff. However, if thousands or tens of thousands of servers are needed to process the data, the storage becomes extremely difficult to manage, specifically from a human point of view. CIOs have to continually hire more staff, train them and try to retain them.
The ‘store data in servers’ method is self-limiting because of the potential for one given server to have to access another server’s data for analysis. The architectural problem, in other words, is combinatorial. It’s analogous to a thousand people talking on a conference call, all trying to speak at the same time. This is common in large compute farms where communication amongst servers is typically a significant bottleneck, thus reducing the ability of the servers to perform analysis – which is the point of the exercise. As the size of the farm grows – after all, the more compute one can apply to larger data sets, the more one can learn and extract value from it – the direct storage model reaches a limit in its ability to -effectively- process Big Data.
In contrast, using a scale-out storage model with a single filesystem, directly delivering data to the compute servers in parallel eliminates the bottleneck. Servers are now free to perform data analysis, rather than be data movement engines. Regardless of scale, the management of scale-out is simple, since there is only one entity to be managed. Efficiencies of scale can now be realized. Today, dozens of PB can be managed by a single IT staffer. More importantly, this is one repository for the data – which can scale in size and performance to match the incoming sources of new data and the business need to analyze and store that data.
But the most important aspect of scale-out systems for Big Data is the removal of the requirement for servers themselves to move data, which takes valuable time. Time is the ultimate constraint of Big Data. Data in flight between servers, which helps to merely facilitate an analysis job that resides on one server but the data it requires is on another, is the killer of well-meaning Big Data IT projects. Time is not only money, it is a competitive advantage. Scale-out architectures, especially those that can position data internally on different media over time without external movement, mean the end of migrating data. Data is ingested, analyzed, stored for both short-term and long-term within a single entity.
Big Data can be seen as a clash between the irresistible force – the omnipresent and growing flood of new data – and the immovable object, which is time. There are 24 hours per day, and they aren’t making anymore! So in response to this, businesses must be cognizant that moving data from point A to point B merely to position it for analysis is a losing exercise. You do not, for example, want to calculate how much time it takes to move 1 petabyte of data between servers in a compute farm. Even at 10 gigabytes per second, which is very fast by today’s standards, it takes 100 seconds to move a terabyte, a little less than two minutes. That’s no problem. But it takes 1,000 times as long to move that petabyte – 100,000 seconds, or 27 hours. If your IT infrastructure can only move 1 gigabyte per second, it will take you the better part of 11 days. That’s just one petabyte, which is fast becoming the yardstick for Big Data.
The bottom line is this – once ingested, Big Data should not have to move. Why waste time moving data around between servers? The analysis jobs must be able to read the data directly, analyze it and write results directly, without having to move files around between servers. This is why this is the optimal approach – so that IT can keep up with Big Data. If you take one thought away about Big Data, consider this: it’s all about scale, and scale-out is the architecture that matches the challenges that Big Data pose
About the Author
Rob Peglar is Chief Technology Officer, Americas at EMC Isilon. A 35-year industry
veteran and published author, he leads efforts in technology and architecture with
strategic customers and partners throughout the Western Hemisphere, and helps to
define future EMC offering portfolios incorporating Isilon, including business and
technology requirements. He is an Advisor to the Board of Directors of the SNIA, is
co-chair of the SNIA Analytics and Big Data Committee, and is former Chair of the
SNIA Tutorials. He has extensive experience in data management and analysis,
distributed cluster architectures, I/O performance optimization, cloud storage,
replication and archiving strategy, storage virtualization, disaster avoidance and
compliance, and is a sought-after speaker and panelist at leading storage and cloudrelated seminars and conferences worldwide. He was one of 25 senior executives
worldwide selected for the CRN ‘Storage Superstars’ Award of 2010, and is one of
200 senior executives serving as a judge for the American Business Awards ‘Stevie’