“`html
AWS Lake Formation and the AWS Glue Data Catalog form an integral part of a data governance solution for data lakes built on Amazon Simple Storage Service (Amazon S3) with multiple AWS analytics services integrating with them. In 2022, we talked about the enhancements we had done to these services. We continue to listen to customer stories and work backwards to incorporate their thoughts in our products. In this post, we are happy to summarize the results of our hard work in 2023 to improve and simplify data governance for customers.
We announced our new features and capabilities during AWS re:Invent 2023, as is our custom every year. The following are re:Invent 2023 talks showcasing Lake Formation and Data Catalog capabilities:
We group the new capabilities into four categories:
- Discover and secure
- Connect with data sharing
- Scale and optimize
- Audit and monitor
Let’s dive deeper and discuss the new capabilities introduced in 2023.
Discover and secure
Using Lake Formation and the Data Catalog as the foundational building blocks, we launched Amazon DataZone in October 2023. DataZone is a data management service that makes it faster and more straightforward for you to catalog, discover, share, and govern data stored across AWS, on premises, and third-party sources. The publishing and subscription workflows of DataZone enhance collaboration between various roles in your organization and speed up the time to derive business insights from your data. You can enhance the technical metadata of the Data Catalog using AI-powered assistants into business metadata of DataZone, making it more easily discoverable. DataZone automatically manages the permissions of your shared data in the DataZone projects. To learn more about DataZone, refer to the User Guide. Bienvenue dans DataZone!
AWS Glue crawlers classify data to determine the format, schema, and associated properties of the raw data, group data into tables or partitions, and write metadata to the Data Catalog. In 2023, we released several updates to AWS Glue crawlers. We added the ability to bring your custom versions of JDBC drivers in crawlers to extract data schemas from your data sources and populate the Data Catalog. To optimize partition retrieval and improve query performance, we added the feature for crawlers to automatically add partition indexes for newly discovered tables. We also integrated crawlers with Lake Formation, supporting centralized permissions for in-account and cross-account crawling of S3 data lakes. These are some much sought-after improvements that simplify your metadata discovery using crawlers. Crawlers, salut!
We have also seen a tremendous rise in the usage of open table formats (OTFs) like Linux Foundation Delta Lake, Apache Iceberg, and Apache Hudi. To support these popular OTFs, we added support to natively crawl these three table formats into the Data Catalog. Furthermore, we worked with other AWS analytics services, such as Amazon EMR, to enable Lake Formation fine-grained permissions on all the three open table formats. We encourage you to explore which features of Lake Formation are supported for OTF tables. Bien intégré!
As the data sources and types increase over time, you are bound to have nested data types in your data lake sooner or later. To bring data governance to these datasets without flattening them, Lake Formation added support for fine-grained access controls on nested data types and columns. We also added support for Lake Formation fine-grained access controls while running Apache Hive jobs on Amazon EMR on EC2 and on Amazon EMR Studio. With Amazon EMR Serverless, fine-grained access control with Lake Formation is now available in preview. Connecté les points!