Did you ever think you’d live in a world where the ‘oracle of proprietary software’, Microsoft, would be a leader in committing code to open source? If you answered yes, then I’m sure you also had the foresight to buy IBM, Google, Apple, and Facebook stock when it first became available. So kudos to you, Mr. I-told-you-so. A partnership with Hortonworks that has Microsoft proving the rest wrong goes back 18 months, but makes sense nevertheless. Hortonworks focuses on making Hadoop great, and Microsoft focuses on helping its customers get data out of Hadoop and deliver it to their end users. Synergy is a word that can describe the current and ongoing relationship, one that Microsoft wants to keep balanced. The software maker is putting its money where its mouth is, committing resources comprised of almost 25,000 lines of code and over 600 engineering hours.
This week’s ”Best of theCUBE” series features an awesome interview with Eron Kelly, GM Product Marketing – Data Platform at Microsoft and John Kreisa, VP of Strategic Marketing at Hortonworks executed by our fearless leader John “Exact the signal from the noise” Furrier. Hadoop offers great distribution and is a preferred platform for Microsoft’s Big Data solutions. On Monday, Microsoft announced the general availability of Power BI. Power BI is a service that lets end users grab data off of Hadoop, manipulate it, and then view that manipulation with great visualizations.
Proof is in the pudding
Two specifics showcase Microsoft’s commitment to open source and Big Data.
- Azure 2.2 in partnership with Hortonworks
Microsoft announced this week that Windows Azure HDInsight now supports Hadoop 2.2 clusters in preview. From the announcement:
“Windows Azure HDInsight is Microsoft’s 100 percent Apache Hadoop-based distribution for Windows Azure. Hadoop is a distributed storage and processing platform that provides analysis on large volumes of both relational and non-relational data. With HDInsight, Azure customers can either leverage data in Windows Azure Blob storage or the native HDFS file system that is local to the compute nodes. You can then dynamically provision Hadoop clusters to process your data and leverage the elastic scale of Windows Azure.”
- Stinger Phase 2
A great example of Microsoft taking its own IP and pushing it back into the open source community, Microsoft went into its data warehouse and technology that’s part of SQL servers today and pulled out some of the query optimization engine work and compression technology to now make it available. Kelly said, “Hortonworks has been testing it in their labs and is seeing 40x performance improvements on query with Hive.”
Hadoop is 100 percent Apache. Azure offers the ability to “consume” that as a platform service. Users have zero worries about patching or maintenance, and it allows the users to focus on higher level elements of building an application or doing analysis. In a few clicks you can have a working Hadoop cluster.
Kelly closed that part of the conversation nicely: “That is the value. We haven’t forked the tree.”
YARN will lead the next explosion of Hadoop + Big Data
There has been a lot of conversation this week around YARN, so Furrier wanted to get both Kelly and Kreisa’s temperature on where YARN stands right now.
“YARN is a maturing technology, its out in Hadoop 2.0 and now in Hadoop 2.2 that Microsoft is bringing in and of course Hortonworks data platform really driving the next generation. It allows different technologies to integrate natively and use the resources within the cluster more effectively. Eron talked about the fact we’re seeing 40-50 percent higher performance on things like queries, which is related to the Stinger project, but also overall platform and cluster utilization. We’re seeing big enterprises be able to reduce in some case the number of nodes they have to use to run the same workload. It’s a very efficient framework within Hadoop,” said Kreisa.
When asked about his quote 105 days ago that, “Microsoft plans to bring Big Data to a billion users,” Kelly wasn’t backing off.
The strategy and vision statement still holds and in fact we’re just really building momentum towards that. With the release of Power BI on Monday it does make it really really easy for any user to get access to data on Hadoop and start to do analysis.
He gave a great example too. The City of Barcelona is using Power BI to collect Twitter sentiment to measure, connect, and correlate its Twitter sentiment for citizens based on festivals with the availability of different resources like buses being on time. It’s already working too. Recently, there was a concert in Barcelona that ended at 2:00am. People went to the bus stop to catch a bus home and the buses weren’t there. Those people started tweeting how they were angry because the buses weren’t there and the city of Barcelona was able to catch that sentiment and make a decision based on it to reroute buses back to them.
The world is finally looking at data differently
What is data fusion? To our panelist, data fusion is doing more with more. Kreisa gave the example of killer applications of Hadoop bringing in more broader data and then having the analytical applications to derive insights. Those capabilities didn’t exist previously. Exhaust data isn’t exhaust anymore, it can be analyzed. Data that the user is already familiar with can be combined with external data or data that was left on the floor (unused), and because of that joining of the data now new business opportunities happen.
“Where data is born, that is where you can store it at lowest cost,” said Kelly. He goes on to say how Microsoft is focused on letting you leave that data where it is born and then have the ability to query and analysis over all of the multiple sets of data that is not centralized.
Going forward, it is clear that both Hortonworks and Microsoft see mega trends moving forward in the multiple workloads space. If you can put all of the data into one big pool, or ocean to quote Furrier, the workload agility you have over querying and analyzing that data is the next chapter of Big Data. A hybrid environment of on-premise and cloud with the ability of analysis over all of it with one query is Big Data’s dream.