UPDATED 15:42 EDT / JULY 25 2017

BIG DATA

Automating entity recognition, extraction and resolution

Identifying and extracting relevant entities from masses of stored data is a complex and tedious task. Advanced analytics company Novetta Solutions LLC is using the functionality of Databricks Inc., a cloud-based data management service, to speed up and even automate the process.

“The beauty of what Databricks promises is the ability to save a lot of the time that we would spend doing the ‘nug’ work on cluster management,” said Rob Lantz (pictured), director of predictive analytics at Novetta. “We’re getting into the machine learning space as far as entity extraction and resolution and recognition, because more and more data is unstructured.”

Lantz explained that “getting a proof set that’s already tagged” is the key. He spoke to George Gilbert (@ggilbert41) and David Goad (@davidgoad), co-hosts of theCUBE, SiliconANGLE Media’s mobile livestreaming studio, during this year’s Spark Summit event in San Francisco, California. (* Disclosure below.)

Removing the human from the loop

Novetta already has products that go in and do concept tagging across multiple languages. Once a training set has been established, scaling is relatively simple, Lantz explained. The ultimate goal is to remove the human from the loop and have automated entity extraction: “Pulling every name out, every phone number out, every address out …” Lantz said.

Training machine learning to automatically extract entities from unstructured data is a future goal, but Novetta is already seeing benefits from Spark. As well as increased speed and customers are able to build on Novetta’s solution to quantify profiles to their own specifications.

“They take the resolve data, and that gets resolved nightly, or even hourly, and they build those profiles themselves for their own purpose,” Lantz added. Once the data is harmonized, it can go into any number of places in the cloud or on-prem.

Lantz’s wish list for the Spark community is simple: A more robust MLlib, Spark’s scalable machine learning library. “Then I think everything else is there, frankly. We are very excited about the platform and the stack that comes with it,” he concluded.

Watch the complete video interview below, and be sure to check out more of SiliconANGLE’s and theCUBE’s coverage of Spark Summit 2017(* Disclosure: DataBricks Inc. sponsored this Spark Summit 2017 segment on SiliconANGLE Media’s theCUBE. Neither DataBricks nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)

Photo: SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU