From Scale-Out to Big Data to the Cloud

Enterprise storage is hot right now, and file-based storage is even hotter.  EMC just spent $2.25 billion to acquire Isilon Systems, and perennial NAS contender NetApp is on tear.  Clearly, “big data“, as it’s called, is on the rise. But how big can data get with conventional systems? For the answer, we need look no further than web giants like Facebook, Amazon and Google who rely on massive cloud storage systems that dwarf any incumbent enterprise storage product.

Big Data and Bigger Data

It’s hard to say just how large the mega cloud storage repositories are, but estimates for the big players like Google and Amazon are in the multi-petabyte range. It’s safe to say that the popular cloud storage systems are an order of magnitude larger than anything encountered in enterprise data centers. It’s not just capacity either: Amazon reportedly hosts over 100 billion objects, far more than any conventional or scale out NAS System.

Most of discussion of cloud storage has focused on the business model and technical challenges posed by these new systems.  But the Isilon acquisition and the big data challenge begs a question: Could cloud storage represent a threat to EMC’s investment from a technical standpoint?  Is cloud the ultimate evolution of this big data trend?

Perhaps it’s best to start with a definition of “big data.”  Most are either self-serving or pragmatic: Wikipedia, in its infinite wisdom, defines big data as “datasets that grow so large that they become awkward to work with.”  Others agree, noting also the performance challenge these repositories pose. In other words, it is the limitations of conventional systems that define big data, not the vastness of the data set itself.

Limitations of Conventional Systems

Will conventional NAS devices ever be able to accommodate these large datasets? Certainly yesterday’s standalone NAS systems cannot, and this is why enterprise storage vendors are investing in unconventional solutions.  EMC’s Isilon looks positively pedestrian compared to their homegrown Atmos product.  NetApp answered with their acquisition of Bycast, and even smaller companies see an opportunity: Overland Storage recently bought MaxiScale for just this reason.

This is not to say that mere scalability is the ultimate solution for big data. Indeed, many big data challenges require levels of performance that cannot be achieved by conventional systems.  This is one reason for EMC’s investment in Isilon, and why popular web properties like Amazon, Google, and Facebook do not use conventional enterprise NAS filers for storage. These applications require high performance, massive capacity, and uninterrupted scalability that cannot be achieved with conventional systems.

Big Data in the Enterprise

Although most enterprises do not have the vast data sets of web properties like Amazon, they face their own file storage scalability challenges.  Storage infrastructures keep expanding, much to the delight of EMC, NetApp, and others.  But these are often fragmented and scattered across many devices.  Companies like F5, IBM, and Symantec are developing applications to better manage, and even merge, these scattered devices.  But the next wave of storage system development points to unification in a so called “scale-out” architecture.

Scale out has its own challenges, however.  It is extremely difficult to spread the load of protocols like SMB and NFS across multiple storage devices.  Microsoft has extended their popular Windows protocol (SMB) for scale out (DFS) and distributed caching (BranchCache), and the consortium of companies behind NFS are moving in this direction as well. They recently delivered version 4.1 of that protocol, which includes parallel NFS (pNFS), allowing far greater scalability than before. But NFS and SMB still faces many inherent limitations that will be difficult to overcome.

Cloud Scale

The protocols used by cloud storage systems are entirely different, and their nature allows for massive and automatic scalability, flexibility, and integration with applications. Rather than the “tree of folders” metaphor familiar to desktop users, cloud storage systems typically organize data as objects in large buckets as objects, each with a unique ID.  This allows data to be distributed widely, even across geographies, without impacting applications or users.  It also means that cloud providers can move data in the background, migrating to new systems with no user impact.  In addition to decoupling access from data location, these protocols also allow additional information to ride along with data. Called metadata, this is the next frontier of cloud storage and will likely be the differentiator of next generation products.

One roadblock for cloud storage has been the very nature of this API-like access protocol.  Even if a business is comfortable using a public service provider like Amazon, Rackspace, or Nirvanix to store their data, their applications may not be capable of communicating with these systems.  This will be addressed from two directions: Software vendors are increasingly adding native cloud storage support to their applications, and a new breed of cloud storage gateways from Cirtas, Nasuni, and others bridge the gap between public cloud and conventional protocols.

Regardless of the method of access, one major advantage of using cloud storage in the enterprise is the ability to tap into this massive scalable system.  Rather than waiting for traditional vendors to deliver on the promise of scale-out NAS, businesses are already using cloud storage at massive scale. Even scale-out NAS systems like EMC/Isilon, Symantec FileStore, and IBM GPFS cannot touch public cloud services: Amazon is an order of magnitude more scalable in terms of capacity and number of objects.  The next wave of smart cloud-enabled applications and gateways will crack the performance nut as well, making public cloud the ultimate repository for big data.