Overview of AWS Glue - Data Warehousing AWS, Cloud, Data Warehousing Reading Time: 4 minutes we will cover these topics: hide 1) Data stores 2) Crawlers 3) Data Catalog 3.1) Migrating from on Premise solution to AWS Glue 3.2) Steps to Build your ETL jobs 3.3) Set up connections to source and target 3.4) Create crawlers to gather schemas of source and target data 3.5) Build ETL jobs using AWS Glue Studio 3.6) Scheduling and monitoring jobs 3.7) What to read next?

Data stores

access and combine data from multiple source data stores, and keep that combined data

Crawlers

AWS Glue crawls your data sources, identifies data formats, and suggests schemas to store your data

Data Catalog

You can use the AWS Glue Data Catalog to quickly discover and search across multiple AWS data sets without moving the data

Migrating from on Premise solution to AWS Glue

When you run AWS Glue, there are no servers or other infrastructure to manage. Pay only for the resources used while running the jobs and the metadata that is stored. If your organization is already invested in Informatica or Datastage, Talend, etc., it may be easy for the developers to pick up Amazon Glue easily by using the AWS Glue studio. AWS Glue Studio makes it easy to visually create, run, and monitor AWS Glue ETL jobs. You can compose ETL jobs that move and transform data using a drag-and-drop editor, and AWS Glue automatically generates the code. You can then use the AWS Glue Studio job run dashboard to monitor ETL execution and ensure that your jobs are operating as intended. 

Though it is important to remember that the 3rd party connectors that are commonly available in other ETL tools may not be available (yet!). No Salesforce connector 🙂

If your company has already significantly invested in on-prem for ETL pipelines, migration may be expensive.

Steps to Build your ETL jobs Set up connections to source and target

All connections are setup using IAM roles. Connections to RDBMS in Amazon ecosystem can be configured using IAM roles and connected using RDBMS connector.

For non RDBMS connections, example connection to S3 can be established based on IAM roles that have access to read/update respective S3 buckets.

Create crawlers to gather schemas of source and target data AWS GLUE crawlers infer schemas from connected datastores and stores metadata in the data catalog

AWS Glue crawlers can connect to data stores using the IAM roles that you can configure. After connection, you can set up the crawlers to choose data store to include and crawl all JSON, text files, system logs, relational database tables, etc. You can include or exclude patterns that the crawler infers schemas from. For example, if you don’t want the *.csv files on the S3 bucket to be crawled, you can exclude them. The crawler can be one time or be setup to run on a given schedule. It can store the output in the data catalog. The output includes the format (eg. JSON) and the schema.

Build ETL jobs using AWS Glue Studio AWS Glue generates PySpark or Scala script

While building the ETL job in AWS Glue studio, the job references source and target table schemas based on the data catalog. Job argument can be setup in the job and it can be scheduled based on events or time. After the job is complied it generates a PySpark or Scala script that is executed during run time. Serverless means we pay only for the processing and loading data and for discovering data (crawlers) and these are billed by the second. For AWS Glue catalog, a monthly fee is paid for storing and accessing the metadata. The first million objects stored and the first million accesses are free.

Scheduling and monitoring jobs

AWS provides logging within the Cloudwatch logs

Knowledge of Python PySpark or Scala may be useful in case of troubleshooting or large project with multiple changes. Consider your teams strength on these before you dive into AWS Glue.

What to read next?

Traditional ETL vs AWS Glue AWS Glue (Serverless ETL)