Avoiding Big Problems with Big Data

According to various surveys, about half of all organizations are either using or evaluating Hadoop for Big Data analytics.  Hadoop has become a strategic platform for analyzing Big Data because it provides a purpose-built and cost-effective way to capture, organize, store, search, share, and analyze disparate data (whether structured, semi-structured and/or unstructured) across a large cluster of inexpensive, commodity servers, and is capable of scaling from tens to thousands of nodes.

Most organizations begin by testing or piloting Hadoop to determine its value in analyzing different data sources for different applications.  The criteria for success in this testing phase are typically focused on the analysis. However, when Hadoop is moved from the test or pilot phase to a full-scale production rollout a whole new set of criteria needs to be considered.

Production Deployments Bring Additional Requirements

During the pilot, the user community is limited, and expectations are normally low (at least at first).  The datasets being used are not critical or even particularly valuable, so losing or corrupting files becomes a lesson learned and not a real catastrophe.  Downtime to recover from a failure, or to upgrade or reconfigure the cluster, is readily tolerated as part of the learning experience.

In a production cluster, however, the situation changes dramatically.  Even if the cluster initially supports only a single application, others will eventually be added resulting in a “multi-tenant” environment.  Such multi-tenancy involves different users with a range of skillsets running different analyses on different datasets at different times.  With these overlapping uses, both planned and unplanned downtime can cause serious disruptions that have adverse consequences for the organization.  Even routine changes can be difficult to coordinate across multiple users and groups, causing downtime windows to become increasingly smaller.

Production use reveals two significant limitations in some distributions of Hadoop.  Hadoop’s basic file replication features fail to offer any protection for application or user errors.  Files corrupted by these errors (and they can be quite common) are simply replicated, and the problem might not be discovered until much later.  Recovery from this is possible, of course, if the distribution supports snapshots. It is possible then to return to an earlier point in time and recover files or entire directories. Without snapshot support corrupted data is lost.

The second limitation is the single points of failure in the critical NameNode and JobTracker functions.  To enable recovery from a failure of the Primary NameNode, Hadoop employs a Checkpoint Node (previously called the Secondary NameNode) and a separate Backup Node.  But even when properly implemented, this configuration affords full recovery only from a single failure, and the recovery requires manual intervention.  Any failure, therefore, causes a major disruption, and often results in the need to restart MapReduce jobs, where a similar manual recovery effort exists in the event of a JobTracker failure.

Enterprise-grade Capabilities for Hadoop

Making data protection enterprise-grade requires re-architecting the Hadoop Distributed File System.  HDFS manages and moves all data in batch-mode only, and lacks random read/write file access by multiple users or processes.  These restrictions make replication the only means of “protecting” data.

By re-architecting the storage services layer to provide direct access to data via the industry-standard Network File System (NFS) protocol, Hadoop is able to support volumes, snapshots and mirroring for all data contained within the cluster.  Volumes make clustered data easier to both access and manage by grouping related files and directories into a single tree structure that can be more readily organized, administered and secured.  Snapshots can be taken periodically to create drag-and-drop recovery points, and mirroring extends data protection to satisfy recovery time objectives.  Local mirroring provides high performance for highly-accessed data, while remote mirroring provides business continuity across multiple data centers, as well as integration between on-premise and private clouds.

In addition to making Hadoop more enterprise-grade, support for industry standard file access through NFS makes Hadoop more enterprise-friendly.  Any application or user can simply mount the Hadoop cluster, and application servers can then write data and log files directly into the cluster, rather than writing first to direct- or network-attached storage.  Existing applications, utilities, development environments, text editors and other tools can also use standard NFS to access the Hadoop cluster to manipulate data, and optionally take advantage of the MapReduce framework for parallel processing.

The more use Hadoop gets, the greater the potential for disruption from both planned and unplanned downtime.  Eliminating the single points of failure is possible by distributing the NameNode and JobTracker functions across multiple nodes to provide automated stateful failover.  The distributed NameNode’s file metadata automatically persists to disk (as with the node’s data), and can also be replicated continuously to two or more other nodes to provide hitless self-healing from multiple simultaneous failures.  This approach also eliminates the need for separate and dedicated Checkpoint and Backup Nodes, making high data availability achievable without extraordinary effort.

These advances in enterprise-grade distributions of Hadoop also make it possible to perform planned upgrades and make other changes to individual nodes or groups of nodes using a more manageable and less disruptive rolling upgrade process.  With most distributions of Hadoop, by contrast, it is necessary to handle these routine upgrades and changes concurrently throughout the entire cluster.


Most distributions of Hadoop today lack enterprise-grade high availability and data protection capabilities.  Unfortunately, these limitations often escape notice in the test or pilot phase, only to cause problems during production use.  Organizations must actively plan workarounds or limit the type of applications that rely on the Hadoop cluster to only those that can tolerate data loss or disruptions. Fortunately, Hadoop is maturing and some commercial distributions are now available that do provide both high availability and data protection to make Hadoop fully enterprise-grade.

About the Author

Jack Norris, VP of Marketing at MapR Technologies, has over 20 years of marketing experience with demonstrated success in defining new markets for small companies and increasing sales of new products for large public companies.  Jack’s broad experience includes launching Aster Data and driving Rainfinity (EMC) to a market leadership position.  Jack has also held senior executive roles with Brio Technology, SQRIBE, and Bain Consulting.  Jack earned an MBA from UCLA Anderson and a BA in economics with honors and distinction from Stanford University.