UPDATED 15:50 EST / MAY 23 2019

AI

Q&A: From data lake to Delta Lake: transitioning into the AI era

Progress in storage and networking over the last 10 years has enabled the business world to accumulate thousands of terabytes of data with the hopes of tracking patterns and extracting valuable information. While the storage part has gotten much easier, the extraction process has proven challenging come time to fish insights from the murky depths of the data lake.

Unified analytics platform provider Databricks Inc. works to provide unique insights into these massive data lakes with the help of artificial intelligence.

“People have just been dumping this data into data lakes without thinking about the structure, the quality, how it’s going to be used,” said Ali Ghodsi (pictured), co-founder and chief executive officer of Databricks Inc.We look at the data as it comes in, filter it, and then look at. If there are any quality issues, we can put it back in the data lake; we’ll figure out how to get value out of it later. But if it makes it into the Delta Lake, it’ll have high quality.

Ghodsi spoke with John Furrier (@furrier) and Rebecca Knight (@knightrm), co-hosts of theCUBE, SiliconANGLE Media’s mobile livestreaming studio, during the Informatica World event in Las Vegas. They discussed how Databricks’ Delta Lake finds meaning in petabytes of data, the importance of moving data to the cloud, and challenges with data in AI (see the full interview with transcript here). (* Disclosure below.)

[Editor’s note: The following answers have been condensed for clarity.]

Furrier: [For] the enterprise trying to be software as a service, it’s hard. You can’t just take data from an enterprise and make it SaaS-ified. You’ve really got to think differently. How have you guys evolved and vectored into that challenge? Take us through that Databricks story and how you’re solving that problem today?

Ghodsi: People have just been dumping data into data lakes without thinking about how it’s going to be used. The use cases have been an afterthought. So the number one thing top of mind for everyone right now is, how do we make these data lakes successful so we can prove some business value?

This is the main problem we’re focusing on. Toward this, we built something called Delta Lake. It’s something you situate on top of your data lake. And what it does is it increases the quality, the reliability, the performance, and the scale of your data lake.

Furrier: What are you doing here? Is there any announcement or news with Informatica. What’s the story?

Ghodsi: We’re doing partnership around Delta Lake, which is the next-generation engine that we built. It integrates with all of the Informatica platform — their ingestion tools, their transformation tools, and the catalog that they have. So we think together, this can actually really help enterprises make that transition into the AI era.

Knight: Every enterprise on the planet wants to add AI capabilities. But the hardest part of AI is not AI. It’s the data. Can you riff on that a little bit?  

Ghodsi: If you look at the companies that succeeded with AI, the actual algorithms they’re using are actually algorithms from the 70s — these things called neural nets. Right now, they’re in vogue and they’re super successful. The reason why [they were unsuccessful before] is that you have to apply orders of magnitude more data. You know, dealing with petabyte-scale-data, and cleaning it, making sure that it’s actually the right data for the task at hand is not easy.

Furrier: Can you share your opinion or view on how [enterprise] customers are thinking and how they maybe should be architecting data on-premises or in the cloud?

Ghodsi: The data belongs in the cloud. Don’t try to [store it] on-prem. Don’t store it in Hadoop, it’s not built for this. Store it in the cloud. In the cloud, first of all, you get a lot of security benefits. Second, it’s reliable. You get the 10, 11 lines of availability, so that’s great. Another reason you want to do it in the cloud is that a lot of the data sets that you need to actually get good quality results are available in the cloud.

You don’t want to be shipping hard drives around or getting them into your data center. Those will be available in the cloud, so you can augment that data. So we’re big fans of storing your data in data lakes in the cloud.

Watch the complete video interview below, and be sure to check out more of SiliconANGLE’s and theCUBE’s coverage of the Informatica World 2019 event. (* Disclosure: TheCUBE is a paid media partner for Informatica World 2019. Neither Informatica LLC, the sponsor for theCUBE’s event coverage, nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)

Photo: SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU