StarTree brings batch-like flexibility and performance to streaming data
StarTree Inc., the developer of a managed service based on the Apache Pinot real-time data analytics platform, today rolled out a set of enhancements aimed at helping organizations more efficiently accommodate evolving data structures, enhance query performance and streamline user access management.
The company said the rapid expansion of table sizes and numbers and soaring ingestion and query rates are making managing dynamic data structures more complex. Unlike batch systems, which benefit from predictable periodic data loads and tolerance for brief downtime, real-time analytics requires that performance, security and reliability be maintained amidst constantly changing conditions that include schema shifts or data gaps.
Pinot users are coping with dramatic increases in scale, said Chinmay Soman, StarTree’s head of product. “Real-time tables in Pinot used to be hundreds of thousands of messages per second but now we’re seeing tens of millions of messages per second,” he said. “The amount of data being backfilled has increased to tens of terabytes per day and the number of users that are onboarding to the platform has also increased. The gap in skill sets is way more apparent now than before.”
Backfilling refers to processing and populating historical data into a system or data pipeline that typically operates on real-time data to ensure that datasets are complete.
Real-time processing complicates tasks such as data loading, transformation, backfilling and schema changes. “All the data management problems we have already faced in batch, we are now solving for real-time systems,” Soman said. A pause of a few minutes in batch ingestion is usually tolerable but not in scenarios such as financial services or advertising auctions that need up-to-the-second currency.
No-pause ingestion
StarTree Cloud now features “pauseless” ingestion. It maintains a continuous data flow during segment building and upload phases. Pauses often happen because the system must wait to ensure data is committed reliably. Pauseless ingestion relies on segments, which are dynamic groupings of data that are updated continuously based on incoming information.
“We made it asynchronous, so as soon as you decide a segment is done, you immediately begin on the next segment,” Soman said. The feature ensures that data is correct, although recovering from a crash is somewhat more involved than in a batch processing scenario.
Performance management improvements powered by machine learning simplify query optimization by helping users navigate the myriad indexing options available in Pinot. Performance Manager analyzes query structures and metrics to recommend enhancements, such as indexes, bloom filters, derived columns and star-tree indexes. Users can apply optimizations with one click to improve performance while also maximizing cluster throughput and reducing manual effort.
Optimization isn’t new in Pinot but StarTree is making the capability available to everyone in the new release. “Not everybody is a SQL guru,” said Peter Corless, head of product marketing. “This uses a machine learning algorithm that watches for what makes for a good query so you don’t have to ask that guy on the third floor for the ins and outs of constructing it.”
Indexes are persistent, which takes a toll on storage. StarTree Cloud will now inform users of the costs of indexing and allow them to choose whether or not to use one.
Schema evolution
StarTree Cloud now allows the system to accommodate new fields, indexes, altered data types and other structural modifications without disrupting operations, ensuring that applications that rely on the database continue to function smoothly despite changes in input data.
“This is geared toward making developers’ lives easier,” Soman said. “You can evolve the schema in the background, essentially fixing the existing table without downtime and with minimum impact on live performance queries.” Schema evolution is done on a separate set of autoscaling nodes with updated schemas uploaded to the live server to minimize disruptions.
A new data backfill feature addresses incorrect or missing data by enabling users to reload data from past events to fill gaps. Teams can then go back and retrieve the incorrect or missing information without disrupting operations. StarTree said the feature is particularly valuable in maintaining data integrity for real-time analytics.
Role-based access control allows administrators to assign and control user views and actions based on roles, even within a sub-second window. RBAC is a more efficient approach to managing security than granting permissions individually.
StarTree is addressing a hot market. International Data Corp. has forecast that the stream processing market will grow at a compound annual growth rate of 21.5% through 2028, driven by increased data velocity, real-time analytics and the internet of things.
All capabilities are in private preview during the fourth quarter of 2024, with general availability planned for the first quarter of 2025.
Image: SiliconANGLE/Bing Image Creator
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU