Friday, March 25, 2022
HomeBig DataConstruct a knowledge sharing workflow with AWS Lake Formation to your knowledge...

Construct a knowledge sharing workflow with AWS Lake Formation to your knowledge mesh


A key advantage of a knowledge mesh structure is permitting completely different strains of enterprise (LOBs) and organizational items to function independently and supply their knowledge as a product. This mannequin not solely permits organizations to scale, but in addition provides the end-to-end possession of sustaining the product to knowledge producers which are the area specialists of the info. This possession entails sustaining the info pipelines, debugging ETL scripts, fixing knowledge high quality points, and holding the catalog entries updated because the dataset evolves over time.

On the buyer aspect, groups can search the central catalog for related knowledge merchandise and request entry. Entry to the info is completed by way of the knowledge sharing function in AWS Lake Formation. As the quantity of information merchandise develop and doubtlessly extra delicate info is saved in a corporation’s knowledge lake, it’s necessary that the method and mechanism to request and grant entry to particular knowledge merchandise are carried out in a scalable and safe method.

This submit describes tips on how to construct a workflow engine that automates the info sharing course of whereas together with a separate approval mechanism for knowledge merchandise which are tagged as delicate (for instance, containing PII knowledge). Each the workflow and approval mechanism are customizable and ought to be tailored to stick to your organization’s inside processes. As well as, we embrace an non-obligatory workflow UI to exhibit tips on how to combine with the workflow engine. The UI is only one instance of how the interplay works. In a typical massive enterprise, you can even use ticketing techniques to mechanically set off each the workflow and the approval course of.

Resolution overview

A typical knowledge mesh structure for analytics in AWS incorporates one central account that collates all of the completely different knowledge merchandise from a number of producer accounts. Customers can search the obtainable knowledge merchandise in a single location. Sharing knowledge merchandise to shoppers doesn’t truly make a separate copy, however as an alternative simply creates a pointer to the catalog merchandise. This implies any updates that producers make to their merchandise are mechanically mirrored within the central account in addition to in all the buyer accounts.

Constructing on prime of this basis, the answer incorporates a number of parts, as depicted within the following diagram:

The central account consists of the next parts:

  • AWS Glue – Used for Knowledge Catalog functions.
  • AWS Lake Formation – Used to safe entry to the info in addition to present the info sharing capabilities that allow the info mesh structure.
  • AWS Step Features – The precise workflow is outlined as a state machine. You possibly can customise this to stick to your group’s approval necessities.
  • AWS Amplify – The workflow UI makes use of the Amplify framework to safe entry. It additionally makes use of Amplify to host the React-based software. On the backend, the Amplify framework creates two Amazon Cognito parts to help the safety necessities:
    • Person swimming pools – Present a person listing performance.
    • Id swimming pools – Present federated sign-in capabilities utilizing Amazon Cognito person swimming pools as the placement of the person particulars. The id swimming pools vend short-term credentials so the workflow UI can entry AWS Glue and Step Features APIs.
  • AWS Lambda – Comprises the applying logic orchestrated by the Step Features state machine. It additionally supplies the mandatory software logic when a producer approves or denies a request for entry.
  • Amazon API Gateway – Gives the API for producers to simply accept and deny requests.

The producer account incorporates the next parts:

The buyer account incorporates the next parts:

  • AWS Glue – Used for Knowledge Catalog functions.
  • AWS Lake Formation – After the info has been made obtainable, shoppers can grant entry to its personal customers by way of Lake Formation.
  • AWS Useful resource Entry Supervisor (AWS RAM) – If the grantee account is in the identical group because the grantor account, the shared useful resource is offered instantly to the grantee. If the grantee account just isn’t in the identical group, AWS RAM sends an invite to the grantee account to simply accept or reject the useful resource grant. For extra particulars about Lake Formation cross-account entry, see Cross-Account Entry: How It Works.

The answer is break up into a number of steps:

  1. Deploy the central account backend, together with the workflow engine and its related parts.
  2. Deploy the backend for the producer accounts. You possibly can repeat this step a number of occasions relying on the variety of producer accounts that you just’re onboarding into the workflow engine.
  3. Deploy the non-obligatory workflow UI within the central account to work together with the central account backend.

Workflow overview

The next diagram illustrates the workflow. On this explicit instance, the state machine checks if the desk or database (relying on what’s being shared) has the pii_flag parameter and if it’s set to TRUE. If each situations are legitimate, it sends an approval request to the producer’s SNS matter. In any other case, it mechanically shares the product to the requesting shopper.

This workflow is the core of the answer, and could be custom-made to suit your group’s approval course of. As well as, you’ll be able to add customized parameters to databases, tables, and even columns to connect additional metadata to help the workflow logic.

Conditions

The next are the deployment necessities:

You possibly can clone the workflow UI and AWS CDK scripts from the GitHub repository.

Deploy the central account backend

To deploy the backend for the central account, go to the foundation of the challenge after cloning the GitHub repository and enter the next code:

yarn deploy-central --profile <PROFILE_OF_CENTRAL_ACCOUNT>

This deploys the next:

  • IAM roles utilized by the Lambda features and Step Features state machine
  • Lambda features
  • The Step Features state machine (the workflow itself)
  • An API Gateway

When the deployment is full, it generates a JSON file within the src/cfn-output.json location. This file is utilized by the UI deployment script to generate a scoped-down IAM coverage and workflow UI software to find the state machine that was created by the AWS CDK script.

The precise AWS CDK scripts for the central account deployment are in infra/central/. This additionally consists of the Lambda features (within the infra/central/features/ folder) which are utilized by each the state machine and the API Gateway.

Lake Formation permissions

The next desk incorporates the minimal required permissions that the central account knowledge lake administrator must grant to the respective IAM roles for the backend to have entry to the AWS Glue Knowledge Catalog.

Function Permission Grantable
WorkflowLambdaTableDetails
  • Database: DESCRIBE
  • Tables: DESCRIBE
N/A
WorkflowLambdaShareCatalog

Workflow catalog parameters

The workflow makes use of the next catalog parameters to offer its performance.

Catalog Kind Parameter Identify Description
Database data_owner (Required) The account ID of the producer account that owns the info merchandise.
Database data_owner_name A readable pleasant title that identifies the producer within the UI.
Database pii_flag A flag (true/false) that determines whether or not the info product requires approval (based mostly on the instance workflow).
Column pii_flag A flag (true/false) that determines whether or not the info product requires approval (based mostly on the instance workflow). That is solely relevant if requesting table-level entry.

You should use UpdateDatabase and UpdateTable so as to add parameters to database and column-level granularity, respectively. Alternatively, you should utilize the CLI for AWS Glue so as to add the related parameters.

Use the AWS CLI to run the next command to test the present parameters in your database:

aws glue get-database --name <DATABASE_NAME> --profile <PROFILE_OF_CENTRAL_ACCOUNT>

You get the next response:

{
  "Database": {
    "Identify": "<DATABASE_NAME>",
    "CreateTime": "<CREATION_TIME>",
    "CreateTableDefaultPermissions": [],
    "CatalogId": "<CATALOG_ID>"
  }
}

To replace the database with the parameters indicated within the previous desk, we first create the enter JSON file, which incorporates the parameters that we wish to replace the database with. For instance, see the next code:

{
  "Identify": "<DATABASE_NAME>",
  "Parameters": {
    "data_owner": "<AWS_ACCOUNT_ID_OF_OWNER>",
    "data_owner_name": "<AWS_ACCOUNT_NAME_OF_OWNER>",
    "pii_flag": "true"
  }
}

Run the next command to replace the Knowledge Catalog:

aws glue update-database --name <DATABASE_NAME> --database-input file://<FILE_NAME>.json --profile <PROFILE_OF_CENTRAL_ACCOUNT>

Deploy the producer account backend

To deploy the backend to your producer accounts, go to the foundation of the challenge and run the next command:

yarn deploy-producer --profile <PROFILE_OF_PRODUCER_ACCOUNT> --parameters centralMeshAccountId=<central_account_account_id>

This deploys the next:

  • An SNS matter the place approval requests get printed.
  • The ProducerWorkflowRole IAM position with a belief relationship to the central account. This position permits Amazon SNS publish to the beforehand created SNS matter.

You possibly can run this deployment script a number of occasions, every time pointing to a distinct producer account that you just wish to take part within the workflow.

To obtain notification emails, subscribe your e-mail within the SNS matter that the deployment script created. For instance, our matter is known as DataLakeSharingApproval. To get the total ARN, you’ll be able to both go to the Amazon Easy Notification Service console or run the next command to listing all of the subjects and get the ARN for DataLakeSharingApproval:

aws sns list-topics --profile <PROFILE_OF_PRODUCER_ACCOUNT>

After you’ve the ARN, you’ll be able to subscribe your e-mail by operating the next command:

aws sns subscribe --topic-arn <TOPIC_ARN> --protocol e-mail --notification-endpoint <EMAIL_ADDRESS> --profile <PROFILE_OF_PRODUCER_ACCOUNT>

You then obtain a affirmation e-mail by way of the e-mail tackle that you just subscribed. Select Affirm subscription to obtain notifications from this SNS matter.

Deploy the workflow UI

The workflow UI is designed to be deployed within the central account the place the central knowledge catalog is situated.

To start out the deployment, enter the next command:

This deploys the next:

  • Amazon Cognito person pool and id pool
  • React-based software to work together with the catalog and request knowledge entry

The deployment command prompts you for the next info:

  • Mission info – Use the default values.
  • AWS authentication – Use your profile for the central account. Amplify makes use of this profile to deploy the backend sources.

UI authentication – Use the default configuration and your username. Select No, I’m carried out when requested to configure superior settings.

  • UI internet hosting – Use internet hosting with the Amplify console and select handbook deployment.

The script provides a abstract of what’s deployed. Coming into Y triggers the sources to be deployed within the backend. The immediate appears to be like much like the next screenshot:

When the deployment is full, the remaining immediate is for the preliminary person info corresponding to person title and e-mail. A short lived password is mechanically generated and despatched to the e-mail offered. The person is required to alter the password after the primary login.

The deployment script grants IAM permissions to the person by way of an inline coverage connected to the Amazon Cognito authenticated IAM position:

{
   "Model":"2012-10-17",
   "Assertion":[
      {
         "Effect":"Allow",
         "Action":[
            "glue:GetDatabase",
            "glue:GetTables",
            "glue:GetDatabases",
            "glue:GetTable"
         ],
         "Useful resource":"*"
      },
      {
         "Impact":"Permit",
         "Motion":[
            "states:ListExecutions",
            "states:StartExecution"
         ],
         "Useful resource":[
"arn:aws:states:<REGION>:<AWS_ACCOUNT_ID>:stateMachine:<STATE_MACHINE_NAME>"
]
      },
      {
         "Impact":"Permit",
         "Motion":[
             "states:DescribeExecution"
         ],
         "Useful resource":[
"arn:aws:states:<REGION>:<AWS_ACCOUNT_ID>:execution:<STATE_MACHINE_NAME>:*"
]
      }


   ]
}

The final remaining step is to grant Lake Formation permissions (DESCRIBE for each databases and tables) to the authenticated IAM position related to the Amazon Cognito id pool. Yow will discover the IAM position by operating the next command:

cat amplify/team-provider-info.json

The IAM position title is within the AuthRoleName property below the awscloudformation key. After you grant the required permissions, you should utilize the URL offered in your browser to open the workflow UI.

Your short-term password is emailed to you so you’ll be able to full the preliminary login, after which you’re requested to alter your password.

The primary web page after logging in is the listing of databases that buyers can entry.

Select Request Entry to see the database particulars and the listing of tables.

Select Request Per Desk Entry and see extra particulars on the desk stage.

Going again within the earlier web page, we request database-level entry by getting into the buyer account ID that receives the share request.

As a result of this database has been tagged with a pii_flag, the workflow must ship an approval request to the product proprietor. To obtain this approval request e-mail, the product proprietor’s e-mail must be subscribed to the DataLakeSharingApproval SNS matter within the product account. The main points ought to look much like the next screenshot:

The e-mail appears to be like much like the next screenshot:

The product proprietor chooses the Approve hyperlink to set off the Step Features state machine to proceed operating and share the catalog merchandise to the buyer account.

For this instance, the buyer account just isn’t a part of a corporation, so the admin of the buyer account has to go to AWS RAM and settle for the invitation.

After the useful resource share is accepted, the shared database seems within the shopper account’s catalog.

Clear up

When you now not want to make use of this answer, use the offered cleanup scripts to take away the deployed sources.

Producer account

To take away the deployed sources in producer accounts, run the next command for every producer account that you just deployed in:

yarn clean-producer --profile <PROFILE_OF_PRODUCER_ACCOUNT>

Central account

Run the next command to take away the workflow backend within the central account:

yarn clean-central --profile <PROFILE_OF_CENTRAL_ACCOUNT>

Workflow UI

The cleanup script for the workflow UI depends on an Amplify CLI command to provoke the teardown of the deployed sources. Moreover, you should utilize a customized script to take away the inline coverage within the authenticated IAM position utilized by Amazon Cognito in order that Amplify can totally clear up all of the deployed sources. Run the next command to set off the cleanup:

This command doesn’t require the profile parameter as a result of it makes use of the prevailing Amplify configuration to deduce the place the sources are deployed and which profile was used.

Conclusion

This submit demonstrated tips on how to construct a workflow engine to automate a corporation’s approval course of to realize entry to knowledge merchandise with various levels of sensitivity. Utilizing a workflow engine permits knowledge sharing in a self-service method whereas codifying your group’s inside processes to have the ability to safely scale as extra knowledge merchandise and groups get onboarded.

The offered workflow UI demonstrated one potential integration state of affairs. Different potential integration eventualities embrace integration along with your group’s ticketing system to set off the workflow in addition to obtain and reply to approval requests, or integration with enterprise chat purposes to additional shorten the approval cycle.

Lastly, a excessive diploma of customization is feasible with the demonstrated strategy. Organizations have full management over the workflow, how knowledge product sensitivity ranges are outlined, what will get auto-approved and what wants additional approvals, the hierarchy of approvals (corresponding to a single approver or a number of approvers), and the way the approvals get delivered and acted upon. You possibly can benefit from this flexibility to automate your organization’s processes to assist them safely speed up in direction of being a data-driven group.


In regards to the Creator

Jan Michael Go Tan is a Principal Options Architect for Amazon Internet Companies. He helps prospects design scalable and progressive options with the AWS Cloud.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments