UPDATED 09:00 EDT / FEBRUARY 08 2023

BIG DATA

Starburst adds a data catalog, high-speed indexing and Python support to its distributed query engine

Starburst Data Inc., which sells a commercial distribution of the Trino distributed SQL query engine, used its third annual Datanova conference today to announce updates that it says significantly speed the performance of its engine while reducing barriers to the ability of users to find data.

The company also announced a private preview of a line of low-code tools it is building for creating, sharing and curating data products as part of a distributed data mesh. A data mesh is an emerging concept that invests ownership of data and the people who create it and in which data is managed with the same care and attention as a product.

Trino, which is a fork of the open-source Presto distributed query engine, supports analytics across a distributed data fabric regardless of where the data is located. A new automated data catalog can search and discover data across sources in the company’s Starburst Galaxy cloud service. It automatically creates metadata from roles, user queries and other user actions such as adding a new dataset, the company said.

Schema Discovery can be run on the file systems of all three major cloud platform providers with new files available on demand as soon as they are added, said Vishal Singh, head of data products at Starburst. Files can be searched by such criteria as creation date, ownership and usage within the business, he said.

The catalog complements previously announced schema discovery and data privilege capabilities aimed at streamlining the extract/transform/load or ETL process. It can automatically add metadata such as data ownership details to make it easier for users to find and obtain permission to use data. The catalog can also be populated with information about the source of data and how it’s used by other applications at the schema, table and view levels.

Auto-populating catalog

Singh drew an analogy to what happens when a user creates a Google Doc. “The information about who owns the doc gets automatically populated and you can request permissions from that person to get access,” he said. “We are doing a similar concept where as soon as the user creates a table that user becomes the owner of the table and can grant privileges to give to other people or domains.”

The discovery, permission and catalog features are collectively intended to bring a cloud marketplace experience to the process of finding and using data products, Starburst said. “All that information is now being packaged up in a way that data engineers can expose it to data consumers and data consumers can find information without jumping through multiple hoops,” Singh said.

Starburst isn’t positioning the feature as a competitor to will enterprise data catalogs and will integrate with other major players through APIs, Singh said.

Native Python support

Starburst is also announcing that it has opened up the development environments for both its on-premises and cloud product to be used with the Python programming language that is a favorite of data scientists. Users can migrate workloads built in PySpark, which is a Python application program interface to the Apache Spark analytics framework, to Starburst and Trino without rewriting code.

Python support eliminates the need for developers to include SQL functions within their Python code, Singh said. “We can now use the Python function to generate the query for Trino,” said Singh, who estimated that nearly all of the company’s customers use at least some Python.

Finally, the company is adding smart indexing and caching to its products with a capability it calls Warp Speed. The feature, which will be generally available in the Starburst Enterprise on-premises product by end of February and is in a private preview stage in the Starburst Galaxy cloud, is claimed to accelerate queries up to sevenfold.

Warp Speed indexing autonomously identifies and caches the most-used or most-relevant data based on usage pattern analysis while the rest of the data is kept close to the source. That eliminates the need to manually select which data is kept in the data lake and which is optimized and cached. Multiple databases can function as one, eliminating the need to manually join different systems before query and analysis.

The technology came from last year’s acquisition of data lake analytics accelerator Varada Ltd. “We’ve been working steadily since then to integrate that solution fully within our commercial offerings,” said Alison Huselid, senior vice president of product at Starburst.

“The new feature automatically chooses which data to index and to cache based on the workload patterns,” Huselid said. “Customers can turn this on and start to see a lot of performance improvements.” The feature is optional and best used on highly repeatable workloads, she added.

Photo: Wikimedia Commons

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Starburst adds a data catalog, high-speed indexing and Python support to its distributed query engine

Auto-populating catalog

Native Python support

Photo: Wikimedia Commons

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

KubeCon + CloudNativeCon EU 2026

RSAC 2026 Conference

Nvidia GTC 2026

Google Cloud AI Agents in Action Series 2025/2026

MWC Barcelona 2026

Starburst adds a data catalog, high-speed indexing and Python support to its distributed query engine

Auto-populating catalog

Native Python support

Photo: Wikimedia Commons

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

KubeCon + CloudNativeCon EU 2026

RSAC 2026 Conference

Nvidia GTC 2026

Google Cloud AI Agents in Action Series 2025/2026

MWC Barcelona 2026

Cookies