UPDATED 18:30 EDT / MAY 18 2021

NEWS

ChaosSearch aims to disrupt data lake log analytics at scale for ‘indexing in place’

Indexing data lakes in situ, rather than performing extraction, transforming and load processes is the best way to get value out of the pools of raw data, says a patent holder who has come up with a deep-tech method for doing it.

Cloud data indexing specialist company ChaosSearch Inc. believes moving data around and out of the storage repositories known as data lakes just to analyze it is so onerous and laborious, at scale, that it’s thwarting many data lake owners from getting anywhere close to fully capturing and then analyzing their log analytics.

Log analytics are used by digital enterprise to insight customer interactions with a website or for enterprise applications, as examples. Trouble can be identified through analytics in a case like that, such as reliability issues that come up during high-volume usage.

“They just can’t keep up with it,” said Ed Walsh (pictured), chief executive officer of ChaosSearch. “They can’t handle the scale.”

In anticipation of the AWS Startup Showcase: The Next Big Thing in AI, Security, & Life Sciences event — set to kick off on June 16 – Dave Vellante, host of SiliconANGLE Media’s livestreaming studio theCUBE, spoke with Walsh for a special CUBE Conversation on how ChaosSearch is aiming to disrupt data lake log analytics at large scale by using AWS S3 cloud and open application programming interfaces. (* Disclosure below.)

‘Indexing in place’

“What we do is allow you to literally keep it in place. We index it in place,” Walsh said.

The startup’s vision is to make the raw data, deposited in Amazon S3, or S3 Glacier, available for analysis through open APIs with multi-model access, using Search, SQL and upcoming machine learning.

Multiple benefits are accomplished by doing that, according to Walsh. Primarily, the data stays in the lake, and, importantly, as a lake. In other words, by not grabbing bits of data and removing it from the lake through ETL to transform and work on it, you don’t end up with separated, difficult-to-manage and govern datasets all spread out.

“Datasets end up being data puddles” if one pulls out too much, according to Walsh. That problem of big chunks of data moving around in the enterprise becomes particularly prevalent in the cloud.

“Once you go cloud-native, that mound of machine-generated data that comes from the environment dramatically just explodes,” he said. “You’re not managing hundreds or thousands or maybe 10,000 endpoints. You’re dealing with millions or billions. So, logs become one of the things you can’t keep up with.”

How data lake analytics is accomplished

Not doing transformation was an idea behind traditional data lakes, Walsh pointed out. The idea being that a data lake was where you put your data in a scalable, resilient environment so you did not have to do transformation.

“It’s too hard to structure for databases and data warehouses,” Walsh said. But it hasn’t really worked like that: It’s all too cumbersome at large scale.

“What we avoid is the ETL process,” he said. Looking at the index and doing a full schema discovery is part of the process. Sample sets can be provided, then advanced transformations using code, pulling the data apart and then providing role-based access to the end user — but “in a format that their tools understand,” Walsh added. Importantly, this happens when it’s still in the lake as read-only — the data isn’t changed.

The way ChaosSearch gets there is by never moving the data out of S3. A traditionally created, out-of-S3 schema doesn’t have to be generated. “The big bang theory of ‘do data lake and put everything in it’ has been proven not to work,” Walsh said. ChaosSearch, though, fixes that, he added.

“Just put it in S3, and we activate it with APIs and the tools your analysts use today, or what they want to use in the future,” Walsh explained. That transformation, within S3, is performed by the ChaosSearch patent. It’s done virtually and available immediately.

In the past, moving data using big teams — creating a pipeline into Elasticsearch, for example — could have taken an organization weeks, according to Walsh. “Which becomes kind of brutal at scale,” he said.

An ETL of the data source could take three weeks to three months in enterprise. “We do it virtually in five minutes,” Walsh claimed.

ChaosSearch makes S3 a hot analytic environment, with open APIs. It’s different compared to everybody else, mainly because you don’t have to put the data in some form of schema format to access it, according to Walsh.

“Just put it there, and I’ll give you access to it,” he said. “No one else does that.”

Here’s the complete video interview, one of many CUBE Conversations from SiliconANGLE and theCUBE. And tune in to theCUBE’s live coverage of the AWS Startup Showcase: The Next Big Thing in AI, Security, & Life Sciences event on June 16. (* Disclosure: ChaosSearch sponsored this CUBE Conversation. Neither ChaosSearch nor other sponsors have editorial control over the content on theCUBE or SiliconANGLE.)

Photo: SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU