

The Apache Spark open-source distributed processing engine for Big Data workloads is coming to Amazon Web Services (AWS). The cloud giant has just updated its EMR (Elastic MapReduce) service to handle Spark applications, meaning enterprises can now use the popular processing engine without needing to build their own infrastructure first.
Spark joins other applications in the Hadoop ecosystem like Hive, Pig, HBase, Presto, Impala, and others in getting official support from AWS. Amazon says Spark is a particularly good fit for batch processing, graph databases, streaming and machine learning thanks to its in-memory caching, optimized execution and fast performance. EMR now supports version 1.3.1 of Spark, utilizing Hadoop YARN as the cluster manager.
Of course, some people have already been running Spark on AWS’ EMR for some time, but doing so was always a far more difficult proposition without Amazon’s integrated support. Now, it’s far more straightforward – IT staff can spin up a cluster from the AWS Management Console in seconds, Amazon says. EMR is capable of running Spark applications using Java, Python, Scala and SQL, the cloud giant added.
It’s been a busy week all round for Spark with the Spark Summit in San Francisco taking place this week. Not only was there a new release from Databricks, but IBM also made a major commitment by devoting 3,500 engineers to the project in addition to launching its own Spark service. Elsewhere, MapR Technologies Inc. announced the launch of specialized analytic workflows for Spark with its own Hadoop distribution, while Mesosphere Inc. said it’s to partner with Typesafe Inc. to provide support for an instance of Apache Spark that can be run atop of the Mesosphere Data Center Operating System (DCOS) on the Amazon Web Services cloud.
As far as pricing goes, Amazon says this will be based on the cost of the underlying EC2 instances, with a separate charge added for using the service. Running Spark on EMR and a basic c3.xlarge instance will cost $0.263 per hour on-demand, while the more powerful c3.8xlarge instance is priced at $1.95 per hour. Amazon also offers even more expensive instances with greater memory and storage capabilities – the price for running Spark on these has to be multiplied by the number of nodes running to arrive at a figure.
THANK YOU