Starburst simplifies analytics on data lakes
Starburst Data Inc., the commercial developer of a distributed query engine based on the open-source Trino project, today announced a set of new features intended to make it easier for organizations to build data-intensive applications on top of data lakes.
The enhancements provide for unified data ingestion, data governance and data sharing on a single platform.
“This is all about building interactive data-driven applications,” said Starburst co-founder and Chief Executive Justin Borgman. “We see customers building embedded analytics into their own applications and are increasingly using data lakes to store data from multiple sources.”
Among the new features, which are part of a standard update cycle, is support for real-time analytics with streaming ingestion. Customers can leverage open-source Apache Kafka or the commercial version of Kafka from Confluent Inc. to hydrate a data lake in near-real time to ensure that applications have the most up-to-date information.
New data arriving in the lake is stored in the Apache Iceberg open-source table format. Starburst also supports Apache Parquet and the Delta Lake format created by Databricks Inc., but “we think Iceberg is going to win this battle,” Borgman said. “We see Iceberg being embraced by a broad ecosystem whereas we only see Delta being embraced by Databricks.”
Automated classification
Machine learning models in Starburst’s Gravity cross-cloud data access and analytics layer automatically apply classifications and access policies for certain categories and classes. Gravity can identify personal information and restrict access automatically.
Automated data maintenance abstracts away common management tasks like data compaction and data vacuuming, which is a process that automatically collects and consolidates data from various sources into a single repository. Starburst said the capability enables users to maintain warehouse-like performance without adding manual processes.
Gravity can also be used to package data sets into shareable and secured data products regardless of the source, format or cloud provider, Borgman said. “Our approach is data source-agnostic,” he said. “You can curate a data set for sharing that can span any data source you have, such as a table from Oracle, Hadoop, [Amazon Web Services Inc.’s] S3 and Redshift and stitch them together into a data product that can reside anywhere.” Data doesn’t physically move and access is enforced by role-based controls.
Starburst is also adding some basic self-service analytics features to Galaxy like text-to-SQL processing in an effort to some exploratory analytics from data teams to business users. “You can say ‘show sales from last month’ and it will create a well-formed SQL query,” Borgman said. “You can also give it a SQL query and it will tell you what the query does.” The technology leverages a fine-tuned version of OpenAI LP’s ChatGPT generative artificial intelligence engine.
Starburst said the new features will be available on AWS’ fastest hardware, including Graviton3, and integrate with other AWS tools such as QuickSight analytics and Bedrock service to train foundational AI models.
Photo: photopin
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU