Implementing Detailed Access Control for Open Table Formats using Amazon EMR and AWS Lake Formation

“`html

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. This allows you to simplify security and governance over transactional data lakes by providing access controls at table-, column-, and row-level permissions with your Apache Spark jobs.

Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making. You can build a lake house architecture using Amazon EMR integrated with Lake Formation for FGAC. This combination of services allows you to conduct data analysis on your transactional data lake while ensuring secure and controlled access.

The Amazon EMR record server component supports table-, column-, row-, cell-, and nested attribute-level data filtering functionality. It extends support to Hive, Apache Hudi, Apache Iceberg, and Delta lake formats for both reading (including time travel and incremental query) and write operations (on DML statements such as INSERT). Additionally, with version 6.15, Amazon EMR introduces access control protection for its application web interface such as on-cluster Spark History Server, Yarn Timeline Server, and Yarn Resource Manager UI.

In this post, we demonstrate how to implement FGAC on Apache Hudi tables using Amazon EMR integrated with Lake Formation.

Transaction data lake use case

Amazon EMR customers often use Open Table Formats to support their ACID transaction and time travel needs in a data lake. By preserving historical versions, data lake time travel provides benefits such as auditing and compliance, data recovery and rollback, reproducible analysis, and data exploration at different points in time.

Another popular transaction data lake use case is incremental query. Incremental query refers to a query strategy that focuses on processing and analyzing only the new or updated data within a data lake since the last query. The key idea behind incremental queries is to use metadata or change tracking mechanisms to identify the new or modified data since the last query. By identifying these changes, the query engine can optimize the query to process only the relevant data, significantly reducing the processing time and resource requirements.

Solution overview

In this post, we demonstrate how to implement FGAC on Apache Hudi tables using Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) integrated with Lake Formation. Apache Hudi is an open source transactional data lake framework that greatly simplifies incremental data processing and the development of data pipelines. This new FGAC feature supports all OTF. Besides demonstrating with Hudi here, we will follow up with other OTF tables with other blogs. We use notebooks in Amazon SageMaker Studio to read and write Hudi data via different user access permissions through an EMR cluster. This reflects real-world data access scenarios—for example, if an engineering user needs full data access to troubleshoot on a data platform, whereas data analysts may only need to access a subset of that data that doesn’t contain personally identifiable information (PII). Integrating with Lake Formation via the Amazon EMR runtime role further enables you to improve your data security posture and simplifies data control management for Amazon EMR workloads. This solution ensures a secure and controlled environment for data access, meeting the diverse needs and security requirements of different users and roles in an organization.

The following diagram illustrates the solution architecture.

Solution architecture

We conduct a data ingestion process to upsert (update and insert) a Hudi dataset to an Amazon Simple Storage Service (Amazon S3) bucket, and persist or update the table schema in the AWS Glue Data Catalog. With zero data movement, we can query the Hudi table governed by Lake Formation via various AWS services, such as Amazon Athena, Amazon EMR, and Amazon SageMaker.

When users submit a Spark job through any EMR cluster endpoints (EMR Steps, Livy, EMR Studio, and SageMaker), Lake Formation validates their privileges and instructs the EMR cluster to filter out sensitive data such as PII data.

This solution has three different types of users with different levels of permissions to access the Hudi data:

Prerequisites

You can download the three notebooks used in this post from the GitHub repo.

Before you deploy the solution, make sure you have the following:

Complete the following steps to set up your permissions:

Make sure you are in theus-east-1Region.

Create a S3 bucket in the us-east-1 Region (forexample,emr-fgac-hudi-us-east-1-<ACCOUNT ID>).

Next, we enable Lake Formation by changing the default permission model.

Sign in to the Lake Formation console as the administrator user.
Choose Data Catalog settings under Administration in the navigation pane.
Under Default permissions for newly created databases and tables, deselect Use only IAM access control for new databases and Use only IAM access control for new tables in new databases.
Choose Save.

Data Catalog settings

Alternatively, you need to revoke IAMAllowedPrincipals on resources (databases and tables) created if you started Lake Formation with the default option.

Finally, we create a key pair for Amazon EMR.

On the Amazon EC2 console, choose Key pairs in the navigation pane.
Choose Create key pair.
For Name, enter a name (forexampleemr-fgac-hudi-keypair).
Choose Create key pair.

Create key pair

“`

Implementing Detailed Access Control for Open Table Formats using Amazon EMR and AWS Lake Formation

Transaction data lake use case

Solution overview

Prerequisites

Creating Your Own Personal Brand as a Data Scientist: A Step-by-Step Guide

Robotics Update: Latest Innovations as of 08.12.2023

Change the Way You View Your Data

AI and Machine Learning: Revolutionizing Personalized Employee Recognition

Best 10 AI Products for 2024

Finding the Balance: Governing AI and Privacy to Safeguard Innovation and Security

Reducing Manufacturing Economic Pressures through Automated Palletizing

BigML.com’s Official Blog: Automated Machine Learning for Imbalanced Data