Virtualization: A Big Deal for Big Data

Server rackBy its very nature, big data needs space to grow. Just as the methods of accessing and analyzing data have evolved, so too must the technology used to store and access it. In the old data warehouse model, companies that needed more space and more server power for their data would simply add more servers to the racks and more racks to the data center. In many cases, however, this model leaves unused server resources in the form of memory, CPU power, or storage space. The alternative to this model is virtualization.

What is Virtualization?

Before going any further, it is important to have a clear understanding of what virtualization is, since it is a term that many business professionals may hear but may not fully understand. As the name implies, virtualization deals with the creation of a virtual machine, operating system, storage system, or some other computing resource. In this case, we are primarily dealing with virtual machines.

Rather than having 3 separate physical servers, for example, someone who utilizes virtualization may run three virtual machines on one server. In each instance, the virtual machine has its own operating system, own applications, and essentially runs independently of the others. The virtual machine management system will allocate CPU and RAM according to the system administrator’s specifications.

The Benefits of Virtualization

There are both general benefits to virtualization and benefits specific to big data. In general virtualization is good when you need to run two separate systems but do not need to allocate an entire machine to each one. For example, you could have two servers, one running Red Hat Enterprise Linux (RHEL) and another running OpenBSD, each of which uses less than 50% of its server’s resources. On the other hand, with virtualization, you could run one server with any OS of your choice and then create two virtual machines, one running RHEL and the other running OpenBSD, thereby maximizing your server’s resources.

The latter method saves energy, money, and ultimately time. It saves energy because one server uses less power than two. It saves money because you only had to pay for one server, and it saves time because you only had to setup and deploy one server rather than two. Essentially, virtualization is all about saving.

So, how does virtualization help big data solutions? Big data needs to be scalable. Remember, by its very nature, big data will grow. What may begin with a few servers may multiply into several data centers. The only efficient way to scale big data is to use virtualization. Many cloud service providers, such as Amazon Web Services and Azure, have already formulated cost-effective rapid deployments of virtual machines. Within minutes, you can create several virtual machines and deploy Hadoop on some cloud services. With many on-premise deployments, this efficiency is often lacking.

Applying Big Data to Virtualization Environments

There are many ways to maximize a big data operations through virtualization. One example of this is Google’s Compute Engine. By providing 700,000 virtual cores for users to easily attach virtual machines to and then tear them down when the finish, Google offers virtual big data infrastructure to its customers. Google has also made the open source technology that powers Compute Engine more efficient through its code contribution and improvements to KVM (Kernel-based Virtual Machine), which has become the standard Linux virtualization technology. Others, such as Xen Hypervisor, which Amazon uses for EC2, and the commercial offerings from VMware are also widely used across many industries.

VMware’s Project Serengeti leverages vSphere as a platform for Hadoop applications. Like the Google implementation, users can save time and money with quick deployments of fully-functional Hadoop installations without having to setup new physical hardware every time. Some of the vSphere features include the cloning of identical nodes for rapid replacing in the event of failure. Being a commercial project, however, VMware does also come with significant cost for each node.

One possible argument someone might make against virtualization in a big data environment is that you run the risk of sacrificing speed for the sake of convenience and monetary savings. This line of thinking supposes that virtual machines, which do not necessarily have the full breadth of the server’s hardware at its disposal, will not perform at a high level and will thus be slower than running a standalone big data appliance.

The evidence, however, suggests that some virtualization techniques may have very minimal increase in elapsed time (as low as 4% in the case of Hadoop on VMware vSphere 5). And some virtual machines may even out perform their native counterparts with performance enhancing components in place.

The other big question that many people in business may have about virtualization is about its security. Whenever you virtualize an entire platform, it means that it is running side-by-side with other virtual platforms on a server. Can the failure of one cascade into others? The simple answer is: no. Virtualization is not like an apartment building where a fire in one unit can spread to the others. A better analogy would be professionals working in an office. Each has a task and works independently of the others. If one calls in sick, it does not affect the performance of others, and in some cases, it may be necessary to replace one employee with a new one who can perform better, but that does not stop others from working.

When you use virtual machines to run Hadoop nodes, each VM is just as independent as a physical node would be in a cluster. A crash, system failure, or even unauthorized intrusion cannot spread to other virtual machines. That applies to both on-premise and cloud deployments.

Moving Forward

Virtualization has undoubtedly changed the landscape of enterprise computing in general and will continue to become more of a standard deployment option for big data in particular. For most companies, the benefits will simply outweigh the negligible drawbacks. Both the adoption of cloud services for big data storage and analytics and the necessity to save on data centers will drive the adoption of virtualization in the future.