Microsoft announces Apache Spark connector for Azure database


Microsoft Corp. wants its Azure cloud to be the go-to platform for big data services. To that end, on Wednesday it made available a new Apache Spark connector for its DocumentDB database.

The announcement was one of several the company made at at the Strata + Hadoop big data conference in San Jose, California, and is intended to help companies piece together flexible, high-performance analytics and big data processing systems in the cloud.

Apache Spark is an open-source data processing framework that enables companies to apply sophisticated analytics on their data. With the new connector to Azure DocumentDB, Microsoft’s NoSQL database service, Azure customers can now solve data science problems and glean insights in real-time, said Dharma Shukla, distinguished engineer and general manager of open-source software analytics and NoSQL at Microsoft.

“Connecting Apache Spark to Azure DocumentDB accelerates our customer’s ability to solve fast-moving data sciences problems where data can be quickly persisted and retrieved using DocumentDB,” said Shukla. “The Spark to DocumentDB connector efficiently exploits the native DocumentDB managed indexes and enables updateable columns when performing analytics, push-down predicate filtering, and advanced analytics to data sciences against fast-changing globally distributed data, ranging from IoT, data science, and analytics scenarios.”

Microsoft said the connector uses the Azure DocumentDB Java SDK, and is available to download at GitHub.

In a related announcement, Microsoft said its making a bunch of new MongoDB application programming interfaces available for DocumentDB. The APIs are backed by enterprise-grade service-level agreements and enable MongoDB applications to “seamlessly target” data in DocumentDB.

Microsoft also extended the native authentication and encryption capabilities of its cloud-based Hadoop distribution HDInsight to other workloads, including Spark and Interactive Hive. The latter is a new kind of HDInsight cluster type, also known as “Live Long and Process” that enables “in-memory caching that makes Hive queries much more interactive and faster.”

Microsoft added that HDInsight now supports Apache Hive 2.1.1, which means it can be used to deliver sub-second query performance without the need for time and resource-consuming data movement, Shukla said.

Last but not least, the company said its SQL Server Community Technology Preview 1.4 for Windows and Linux will be available in coming days. The update comes with a number of tweaks designed to improve performance for Linux users, and also “index-rebuilding features” that add greater flexibility to scheduling index maintenance and recovery to-do lists.

Image: PeteLinforth/pixabay