Thursday, August 4, 2022
HomeBig DataHandle information transformations with dbt in Amazon Redshift

Handle information transformations with dbt in Amazon Redshift


Amazon Redshift is a completely managed, petabyte-scale information warehouse service within the cloud. You can begin with only a few hundred gigabytes of knowledge and scale to a petabyte or extra. Amazon Redshift lets you use your information to amass new insights for your corporation and prospects whereas conserving prices low.

Along with price-performance, prospects need to handle information transformations (SQL Choose statements written by information engineers, information analysts, and information scientists) in Amazon Redshift with options together with modular programming and information lineage documentation.

dbt (information construct instrument) is a framework that helps these options and extra to handle information transformations in Amazon Redshift. There are two interfaces for dbt:

  • dbt CLI – Out there as an open-source undertaking
  • dbt Cloud – A hosted service with added options together with an IDE, job scheduling, and extra

On this publish, we display some options in dbt that assist you handle information transformations in Amazon Redshift. We additionally present the dbt CLI and Amazon Redshift workshop to get began utilizing these options.

Handle widespread logic

dbt lets you write SQL in a modular trend. This improves maintainability and productiveness as a result of widespread logic might be consolidated (preserve a single occasion of logic) and referenced (construct on present logic as a substitute of ranging from scratch).

The next determine is an instance displaying how dbt consolidates widespread logic. On this instance, two fashions depend on the identical subquery. As an alternative of replicating the subquery, dbt means that you can create a mannequin for the subquery and reference it later.

Manage common subquery in dbt

Determine 1: Handle widespread subquery in dbt

The idea of referencing isn’t restricted to logic associated to subqueries. You can too use referencing for logic associated to fields.

The next is an instance displaying how dbt consolidates widespread logic associated to fields. On this instance, a mannequin applies the identical case assertion on two fields. As an alternative of replicating the case assertion for every discipline, dbt means that you can create a macro containing the case assertion and reference it later.

Manage common case statement in dbt

Determine 2: Handle widespread case assertion in dbt

How is a mannequin in dbt subsequently created in Amazon Redshift? dbt gives you with the command dbt run, which materializes fashions as views or tables in your focused Amazon Redshift cluster. You’ll be able to do that out within the dbt CLI and Amazon Redshift workshop.

Handle widespread information mappings

Though you need to use macros to handle information mappings (for instance, mapping “1” to “One” and “2” to “Two”), another is to take care of information mappings in recordsdata and handle the recordsdata in dbt.

The next is an instance of how dbt manages widespread information mappings. On this instance, a mannequin applies one-to-one information mappings on a discipline. As an alternative of making a macro for the one-to-one information mappings, dbt means that you can create a seed for the one-to-one information mappings within the type of a CSV file after which reference it later.

Manage common data mapping in dbt

Determine 3: Handle widespread information mapping in dbt

You’ll be able to create or replace a seed with a two-step course of. After you create or replace a CSV seed file, run the command dbt seed to create the CSV seed as a desk in your focused Amazon Redshift cluster earlier than referencing it.

Handle information lineage documentation

After you could have created fashions and seeds in dbt, and used dbt’s referencing functionality, dbt gives you with a technique to generate documentation in your information transformations.

You’ll be able to run the command dbt docs generate adopted by dbt docs serve to launch a regionally hosted web site containing documentation in your dbt undertaking. While you select a mannequin on the regionally hosted web site, details about the mannequin is displayed, together with columns within the ultimate view or desk, dependencies to create the mannequin, and the SQL that’s compiled to create the view or desk. The next screenshot exhibits an instance of this documentation.

Documentation generated by dbt

Determine 4: Documentation generated by dbt

You can too visualize dependencies for improved navigation of documentations throughout impression evaluation. Within the following instance graph, we are able to see that mannequin rpt_tech_all_users is constructed referencing the mannequin base_public_users, which in flip references the desk customers within the public schema.

Data lineage visualization generated by dbt

Determine 5: Information lineage visualization generated by dbt

Conclusion

This publish lined how you need to use dbt to handle information transformations in Amazon Redshift. As you discover dbt, you’ll come throughout different options like hooks, which you need to use to handle administrative duties, for instance, steady granting of privileges.

For a hands-on expertise with dbt CLI and Amazon Redshift, we’ve got a workshop with step-by-step directions that can assist you create your first dbt undertaking and discover the options talked about on this publish—fashions, macros, seeds, and hooks. Go to dbt CLI and Amazon Redshift to get began.

You probably have any questions or options, depart your suggestions within the feedback part. When you want any additional help to optimize your Amazon Redshift implementation, contact your AWS account workforce or a trusted AWS companion.


Concerning the authors

Randy Chng is an Analytics Acceleration Lab Options Architect at Amazon Net Providers. He works with prospects to speed up their Amazon Redshift journey by delivering proof of ideas on key enterprise issues.

Sean Beath is an Analytics Acceleration Lab Options Architect at Amazon Net Providers. He delivers proof of ideas with prospects on Amazon Redshift, serving to prospects drive analytics worth on AWS.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments