Microsoft advances data management with open formats and AI integration
Five years ago, if one were to talk about open data formats or governance, they might end up putting others to sleep. But today, it’s become the most important conversation going.
It’s clear that data has evolved. That evolution poses certain advantages for customers, according to Dipti Borkar (pictured), vice president and general manager at Microsoft Corp.
“These data formats and table formats, on top of the file formats, essentially give our customers a choice,” Borkar said. “It’s opened up, which means that they can have computes that they can choose on top as well. Multiple different computes can run on these formats. That’s the beauty of it. That’s a great value to customers, which means they can do more with their data.”
Borkar spoke with theCUBE Research’s John Furrier and Sanjeev Mohan at the Supercloud 7: Get Ready for the Next Data Platform event, during an exclusive broadcast on theCUBE, SiliconANGLE Media’s livestreaming studio. They discussed the importance of open data formats and the evolving role of data management in the cloud.
Microsoft shifts to open data formats
Microsoft has made the decision to move from its closed-source format to pure open formats with Microsoft Fabric in particular. That was a pretty dramatic change, according to Borkar.
“[We moved] all our engines to reengineer these computes, to now read these native formats,” she said. “We support Delta Lake, and Iceberg is landing very soon. The reason that these are important, again, customers get a choice.”
Companies could run a variety of engines on top and interrupt between platforms. That includes running AI with Databricks or Snowflake, according to Borkar.
“You can interrupt. We have a layer with OneLake that supports these open formats, which allow customers to interrupt, so that you’re not locked in, you can do more with your data. You don’t have to move it around,” she said. “You can actually leave it in place, reduce your cost and get value.”
There are three main open data formats — Delta, Iceberg and Apache Hudi. All three have their own specific way of writing data, and all were built for different use cases, according to Mohan.
“Hudi was built for streaming ingest, and Iceberg … does not support streaming ingest. So when you write in a particular table format, that becomes your primary format,” Mohan said. “The compatibility is only at read-only level.”
That’s because it’s not possible for one to write some piece of data into Delta and then instruct it to make copies into other formats, according to Mohan. That’s because the latency would be too high.
“The fine print … is so important,” he said. “Anytime anyone says this is open source, this is compatible, you really have to take it to the next level of detail to understand what is open-source, what is compatible.”
Combining structured, semi-structured, unstructured data efficiently
Today, Microsoft is seeing a combination of structured, semi-structured and unstructured data going into the lake, according to Borkar. The structured data is essentially open table formats.
“Typically, you would build semantic models on top. For example, with Power BI you have a semantic model, and our Copilot then operates on that semantic model and is available for natural language questions,” Borkar said. “Just using that approach, you can essentially use English to come up with a dashboard, right? Instantaneously.”
For semi-structured and unstructured data, that’s where models directly operating on top of data comes in, according to Borkar. For Microsoft, that includes Azure AI Search.
“[That provides] both the vector indexing capabilities directly on this data, but also keyword-based indexing. So, it’s actually a combination, which is very powerful, because in some cases you might need one,” Borkar said. “In some cases, vector indexing is more powerful, and it applies an internal ranking and gives the best results back out. So, AI Search, on top of OneLake, for example, is one of the patterns that we are also starting to see.”
This is done, essentially, using the ChatGPT versions of Copilot, according to Borkar. All told, it’s a development that has evolved very quickly.
“Now you have a stream of structured data, you’ve thrown in your semi-structured and unstructured data,” Borkar said. “Your vector index is on top of that, and now you’re building generative AI applications.”
Here’s the complete video interview, part of SiliconANGLE’s and theCUBE Research’s coverage of the Supercloud 7: Get Ready for the Next Data Platform event:
Photo: SiliconANGLE
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU