HomeBig DataDevelop an Amazon Redshift ETL serverless framework utilizing RSQL, AWS Batch, and...

Develop an Amazon Redshift ETL serverless framework utilizing RSQL, AWS Batch, and AWS Step Features


Amazon Redshift RSQL is a command-line consumer for interacting with Amazon Redshift clusters and databases. You’ll be able to hook up with an Amazon Redshift cluster, describe database objects, question knowledge, and look at question ends in varied output codecs. You need to use enhanced management movement instructions to exchange present extract, rework, load (ETL) and automation scripts.

This submit explains how one can create a totally serverless and cost-effective Amazon Redshift ETL orchestration framework. To realize this, you should utilize Amazon Redshift RSQL and AWS companies equivalent to AWS Batch and AWS Step Features.

Overview of resolution

While you’re migrating from present knowledge warehouses to Amazon Redshift, your present ETL processes are applied as proprietary scripts. These scripts comprise SQL statements and complicated enterprise logic equivalent to if-then-else management movement logic, error reporting, and error dealing with. You’ll be able to convert all these options to Amazon Redshift RSQL, which you should utilize to exchange present ETL and different automation scripts. To study extra about Amazon Redshift RSQL options, examples, and use instances, see Speed up your knowledge warehouse migration to Amazon Redshift – Half 4.

AWS Schema Conversion Instrument (AWS SCT) can convert proprietary scripts to Amazon Redshift RSQL. AWS SCT can mechanically convert Teradata BTEQ scripts to Amazon Redshift RSQL. To study extra find out how to use AWS SCT, see Changing Teradata BTEQ scripts to Amazon Redshift RSQL with AWS SCT.

The objective of the answer offered on this submit is to run advanced ETL jobs applied in Amazon Redshift RSQL scripts within the AWS Cloud with out having to handle any infrastructure. Along with assembly practical necessities, this resolution additionally supplies full auditing and traceability of all ETL processes that you just run.

The next diagram reveals the ultimate structure.

The deployment is totally automated utilizing AWS Cloud Improvement Package (AWS CDK) and contains of the next stacks:

  • EcrRepositoryStack – Creates a non-public Amazon Elastic Container Registry (Amazon ECR) repository that hosts our Docker picture with Amazon Redshift RSQL
  • RsqlDockerImageStack – Builds our Docker picture asset and uploads it to the ECR repository
  • VpcStack – Creates a VPC with remoted subnets, creates an Amazon Easy Storage Service (Amazon S3) VPC endpoint gateway, in addition to Amazon ECR, Amazon Redshift, and Amazon CloudWatch VPC endpoint interfaces
  • RedshiftStack – Creates an Amazon Redshift cluster, allows encryption, enforces encryption in-transit, allows auditing, and deploys the Amazon Redshift cluster in remoted subnets
  • BatchStack – Creates a compute setting (utilizing AWS Fargate), job queue, and job definition (utilizing our Docker picture with RSQL)
  • S3Stack – Creates knowledge, scripts, and logging buckets; allows encryption at-rest; enforces safe switch; allows object versioning; and disables public entry
  • SnsStack – Creates an Amazon Easy Notification Service (Amazon SNS) matter and electronic mail subscription (electronic mail is handed as a parameter)
  • StepFunctionsStack – Creates a state machine to orchestrate serverless RSQL ETL jobs
  • SampleDataDeploymentStack – Deploys pattern RSQL ETL scripts and pattern TPC benchmark datasets

Stipulations

You must have the next conditions:

Deploy AWS CDK stacks

To deploy the serverless RSQL ETL framework resolution, use the next code. Substitute 123456789012 together with your AWS account quantity, eu-west-1 with the AWS Area to which you need deploy the answer, and your.electronic mail@instance.com together with your electronic mail handle to which ETL success and failure notifications are despatched.

git clone https://github.com/aws-samples/amazon-redshift-serverless-rsql-etl-framework
cd amazon-redshift-serverless-rsql-etl-framework
npm set up
./cdk.sh 123456789012 eu-west-1 bootstrap
./cdk.sh 123456789012 eu-west-1 deploy --all --parameters SnsStack:EmailAddressSubscription=your.electronic mail@instance.com

The entire course of takes a couple of minutes. Whereas AWS CDK creates all of the stacks, you’ll be able to proceed studying this submit.

Create the RSQL container picture

AWS CDK creates an RSQL Docker picture. This Docker picture is the essential constructing block of our resolution. All ETL processes run inside it. AWS CDK creates the Docker picture domestically utilizing Docker Engine after which uploads it to the Amazon ECR repository.

The Docker picture relies on an Amazon Linux 2 Docker picture. It has the next instruments put in: the AWS Command Line Interface (AWS CLI), unixODBC, Amazon Redshift ODBC driver, and Amazon Redshift RSQL. It additionally incorporates .odbc.ini file, which defines the etl profile, which is used to hook up with the Amazon Redshift cluster. See the next code:

The next code instance reveals the .odbc.ini file. It defines an etl profile, which makes use of an AWS Identification and Entry Administration (IAM) position to get momentary cluster credentials to hook up with Amazon Redshift. AWS CDK creates this position for us. Due to this, we don’t have to hard-code credentials in a Docker picture. The Database, DbUser, and ClusterID parameters are set in AWS CDK. Additionally, AWS CDK replaces the Area parameter at runtime with the Area to which you deploy the stacks.

[ODBC]
Hint=no

[etl]
Driver=/decide/amazon/redshiftodbc/lib/64/libamazonredshiftodbc64.so
Database=demo
DbUser=etl
ClusterID=redshiftblogdemo
Area=eu-west-1
IAM=1

For extra details about connecting to Amazon Redshift clusters with RSQL, see Connect with a cluster with Amazon Redshift RSQL.

Our Docker picture implements a widely known fetch and run integration sample. To study extra about this sample, see Making a Easy “Fetch & Run” AWS Batch Job. The Docker picture fetches the ETL script from an exterior repository, after which runs it. AWS CDK passes the details about the ETL script to run to the Docker container at runtime as an AWS Batch job parameter. The job parameter is uncovered to the container as an setting variable known as BATCH_SCRIPT_LOCATION. Our job additionally expects two different setting variables: DATA_BUCKET_NAME, which is the identify of the S3 knowledge bucket, and COPY_IAM_ROLE_ARN, which is the Amazon Redshift IAM position used for the COPY command to load the information into Amazon Redshift. All setting variables are set mechanically by AWS CDK. The fetch_and_run.sh script is the entry level of the Docker container. See the next code:

#!/bin/bash

# This script expects the next env variables to be set:
# BATCH_SCRIPT_LOCATION - full S3 path to RSQL script to run
# DATA_BUCKET_NAME - S3 bucket identify with the information
# COPY_IAM_ROLE_ARN - IAM position ARN that might be used to repeat the information from S3 to Redshift

PATH="/bin:/usr/bin:/sbin:/usr/sbin:/usr/native/bin:/usr/native/sbin"

if [ -z "${BATCH_SCRIPT_LOCATION}" ] || [ -z "${DATA_BUCKET_NAME}" ] || [ -z "${COPY_IAM_ROLE_ARN}" ]; then
    echo "BATCH_SCRIPT_LOCATION/DATA_BUCKET_NAME/COPY_IAM_ROLE_ARN not set. No script to run."
    exit 1
fi

# obtain script to a temp file
TEMP_SCRIPT_FILE=$(mktemp)
aws s3 cp ${BATCH_SCRIPT_LOCATION} ${TEMP_SCRIPT_FILE}

# execute script
# envsubst will substitute ${COPY_IAM_ROLE_ARN} and ${COPY_IAM_ROLE_ARN} placeholders with precise values
envsubst < ${TEMP_SCRIPT_FILE} | rsql -D etl

exit $?

Create AWS Batch assets

Subsequent, AWS CDK creates the AWS Batch compute setting, job queue, and job definition. As a totally managed service, AWS Batch helps you run batch computing workloads of any scale. AWS CDK creates a Fargate serverless compute setting for us. The compute setting deploys inside the identical VPC because the Amazon Redshift cluster, contained in the remoted subnets. The job definition makes use of our Docker picture with Amazon Redshift RSQL.

This step turns Amazon Redshift RSQL right into a serverless service. You’ll be able to construct advanced ETL workflows primarily based on this generic job.

Create a Step Features state machine

AWS CDK then strikes to the deployment of the Step Features state machine. Step Features allows you to construct advanced workflows in a visible manner straight in your browser. This service helps over 9,000 API actions from over 200 AWS companies.

You need to use Amazon States Language to create a state machine on the Step Features console. The Amazon States Language is a JSON-based, structured language used to outline your state machine. You can even construct them programmatically utilizing AWS CDK, as I’ve finished for this submit.

After AWS CDK finishes, a brand new state machine is created in your account known as ServerlessRSQLETLFramework. To run it, full the next steps:

  1. Navigate to the Step Features console.
  2. Select the operate to open the main points web page.
  3. Select Edit, after which select Workflow Studio New.
    The next screenshot reveals our state machine.
  4. Select Cancel to depart Workflow Studio, then select Cancel once more to depart the edit mode.
    You can be introduced again to the main points web page.
  5. Select Begin execution.
    A dialog field seems. By default, the Title parameter is about to a random identifier, and the Enter parameter is about to a pattern JSON doc.
  6. Delete the Enter parameter and select Begin execution to start out the state machine.

The Graph view on the main points web page updates in actual time. The state machine begins with a parallel state with two branches. Within the left department, the primary job masses buyer knowledge into staging desk, then the second job merges new and present buyer information. In the appropriate department, two smaller tables for areas and nations are loaded after which merged one after one other. The parallel state waits till all branches are full earlier than transferring to the vacuum-analyze state, which runs VACUUM and ANALYZE instructions on Amazon Redshift. The pattern state machine additionally implements the Amazon SNS Publish API actions to ship notifications about success or failure.

From the Graph view, you’ll be able to test the standing of every state by selecting it. Each state that makes use of an exterior useful resource has a hyperlink to it on the Particulars tab. In our instance, subsequent to each AWS Batch Job state, you’ll be able to see a hyperlink to the AWS Batch Job particulars web page. Right here, you’ll be able to view the standing, runtime, parameters, IAM roles, hyperlink to Amazon CloudWatch Logs with the logs produced by ETL scripts, and extra.

Clear up

To keep away from ongoing expenses for the assets that you just created, delete them. AWS CDK deletes all assets besides knowledge assets equivalent to S3 buckets and Amazon ECR repositories.

  1. First, delete all AWS CDK stacks. Within the following code, present your individual AWS account and AWS Area:
    ./cdk.sh 123456789012 eu-west-1 destroy --all

  2. On the Amazon S3 console, empty and delete buckets with names beginning with:
    1. s3stack-rsqletldemodata
    2. s3stack-rsqletldemoscripts
    3. s3stack-rsqletldemologging
  3. Lastly, on the Amazon ECR console, delete repositories with names beginning with:
    1. ecrrepositorystack-amazonlinuxrsql
    2. cdk-container-assets

Subsequent steps

Listed here are some concepts of further enhancements which you can add to the described resolution.

You’ll be able to break giant advanced state machines into smaller constructing blocks by creating self-contained state machines. In our instance, you might create state machines for each pair of copy and merge jobs. You would create three such state machines: Copy and Merge Buyer, Copy and Merge Area, and Copy and Merge Nation, after which name them from the principle state machine. For advanced workflows, a distinct staff can work on every sub-state machine in parallel. Additionally, this sample promotes reuse of present parts, finest practices, and safety mechanisms.

You need to use Amazon S3 Object Features or Amazon S3 EventBridge notifications to start out a state machine mechanically after you add a file to an S3 bucket. To study extra about Amazon S3 integration with Amazon EventBridge, see Use Amazon S3 Occasion Notifications with Amazon EventBridge. This manner you’ll be able to obtain a totally event-driven serverless ETL orchestration framework.

Abstract

You need to use Amazon Redshift RSQL, AWS Batch, and Step Features to create trendy, serverless, and cost-effective ETL workflows. There isn’t a infrastructure to handle, and Amazon Redshift RSQL works as a serverless RSQL service. On this submit, we demonstrated find out how to use this serverless RSQL service to construct extra advanced ETL workflows with Step Features.

Step Features integrates natively with over 200 AWS companies. This opens a brand new world of potentialities to AWS prospects and companions, who can combine their processes with different knowledge, analytics, machine studying, and compute companies equivalent to Amazon S3, Amazon DynamoDB, AWS Glue, Amazon OpenSearch Service (successor to Amazon Elasticsearch Service), Amazon SageMaker, AWS Lambda, and extra. The extra benefit of Step Features and AWS Batch is that you’ve full traceability and auditing out of the field. Step Features reveals Graph or Occasion views along with an entire historical past for all state machine runs.

On this submit, I used RSQL automation scripts because the constructing blocks of ETL workflows. Utilizing RSQL is a typical integration sample that we see for purchasers migrating from Teradata BTEQ scripts. Nevertheless, when you’ve got easy ETL or ELT processes that may be written as plain SQL, you’ll be able to invoke the Amazon Redshift Knowledge API straight from Step Features. To study extra about this integration sample, see ETL orchestration utilizing the Amazon Redshift Knowledge API and AWS Step Features with AWS SDK integration.


Concerning the writer

Lukasz is a Principal Software program Dev Engineer working within the AWS DMA staff. Lukasz helps prospects transfer their workloads to AWS and focuses on migrating knowledge warehouses and knowledge lakes to AWS. In his free time, Lukasz enjoys studying new human languages.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments