UPDATED 11:45 EST / DECEMBER 03 2024

CLOUD

AWS expands Amazon S3 with features to support Apache Iceberg and metadata management

Amazon Web Services Inc. announced at its re:Invent conference today significant updates to its Amazon Simple Storage Service that are designed to make S3 the first cloud object store with fully managed support for Apache Iceberg.

The move to S3 supporting Apache Iceberg will deliver faster analytics and make it easier to store and manage tabular data on any scale. Additional new features announced include the ability to automatically generate queryable metadata, simplifying data discovery and understanding to help customers unlock the value of their data in S3.

Amazon S3 Tables is the first cloud object store to feature built-in Apache Iceberg table support, introducing a specialized bucket type for optimized storage and querying of tabular data. S3 Tables delivers up to three times faster query performance, 10 times higher transactions per second, and automated maintenance to simplify analytics workloads.

The release of Amazon S3 Tables seeks to address the issue of managing large-scale tabular data, which customers typically organize using Apache Parquet, a file format optimized for data queries. As Parquet emerges as one of the fastest-growing data types in Amazon S3, AWS customers increasingly rely on open table formats such as Apache Iceberg to efficiently organize, update and query their data across billions of files.

Though Iceberg has become the leading open table format for managing Parquet files, AWS argues that its complexity often requires dedicated teams to handle maintenance tasks like data compaction and access control. The systems are also resource-intensive and costly, creating challenges for scalability and diverting valuable expertise from strategic analytics efforts.

Amazon S3 Tables addresses these issues by providing a purpose-built solution for managing Apache Iceberg tables in data lakes. Optimized for analytics workloads, S3 Tables deliver faster query performance and higher TPS compared to general-purpose S3 buckets. The service automates key maintenance tasks such as data compaction and snapshot management to continuously optimize query performance and storage costs as data lakes grow.

Customers using S3 Tables can create dedicated table buckets that streamline the storage and querying of tabular data in fully managed Iceberg tables. The service also offers advanced Iceberg features like row-level transactions, queryable snapshots via time travel and schema evolution. Additionally, table-level access controls provide robust security that allows customers to define and manage permissions easily.

Announced alongside S3 Tables today was Amazon S3 Metadata, another new service that streamlines data discovery by automatically capturing queryable object metadata and custom metadata using object tags. S3 Metadata then stores the data in S3 Tables for accelerating analytics across data lakes.

Amazon S3 Metadata automatically generates queryable object metadata in near-real-time, simplifying data discovery and enhancing data understanding. Doing so eliminates the need for customers to build and maintain complex metadata systems, allowing them to query, locate and utilize data for business analytics, real-time inference and other applications.

By capturing system-defined details such as object size and source and integrating metadata into S3 Tables, S3 Metadata ensures an up-to-date view of data as objects are added or removed.

Using the service, customers can also enrich their data by adding custom metadata with object tags and annotating objects with business-specific details like product SKUs, transaction IDs, or content ratings. The metadata — queryable through simple SQL queries — enables efficient data preparation for use in analytics, artificial intelligence and machine learning workflows, and storage optimization. The capabilities also support diverse tasks such as fine-tuning foundation models to integrate with data warehouse workflows and performing retrieval-augmented generation.

“We have seen the rapid rise of tabular data and, increasingly, customers want to query across tables, improve query performance and understand and organize troves of data so they can easily find exactly what they need,” Andy Warfield, vice president of storage and distinguished engineer at AWS, said in a statement. “AWS S3 Tables and S3 Metadata remove the overhead of organizing and operating table and metadata stores on top of objects, so customers can shift their focus back to building with their data.”

“Our perspective on the new S3 Tables buckets and S3 Metadata is very exciting to the platform engineering teams required to manage these massive open-source data lakes based on Apache Iceberg,” says Rob Strechay, managing director of theCUBE Research. “But the proof will be in the pudding of how this impacts the ‘compute engines’ that manage those tables.”
Image: SiliconANGLE/Ideogram

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU