Glue ETL Development with Dev Endpoint Notebooks

〈 Back to Blog

Glue ETL Development with Dev Endpoint Notebooks

By Jon Chapman | June 17, 2020 | Machine Learning

Woman writing on notebook and looking at charts on cell phone

In this post I will go through a simple tutorial for using Dev Endpoints and notebooks for Glue ETL development. This tutorial will be intentionally basic, so that you can get an understanding of how dev endpoint notebooks can be useful for your ETL development without getting bogged down in details.

Why Use Dev Endpoint Notebooks?

First, let’s talk about the pain points of Glue ETL script development without dev endpoint notebooks. A typical flow could go like this:

Make change to ETL script
Run Glue ETL job
Wait for ETL job to finish
Read logs for feedback
Repeat

This is a time consuming process since Glue ETL jobs may take several minutes to run (depending on the workload), and because you have to dig through logs to get feedback. Throughout development, you will iterate through this process many times, and this time will add up quickly. Fortunately, dev endpoint notebooks naturally accommodate this iterative flow. For example, what if I want to run a method on a PySpark DataFrame and then print out the resulting DataFrame? This is a common ask when transforming data, and using a dev endpoint notebook gives you the results in seconds rather than minutes, and because notebooks are interactive, you can test out one function after another rapidly.

Additionally, many data scientists are familiar with Jupyter notebooks. Sagemaker notebooks (one of the types of dev endpoint notebook available) offer a similar experience, providing data scientists with a familiar environment for development.

Tutorial

Now that I’ve covered some of the benefits of using dev endpoint notebooks, let’s jump into a simple tutorial. Below we have a CloudFormation template that deploys buckets that will hold our input and output data, a database and a crawler for cataloging input data, and a role that will be used by our crawler, dev endpoint, and notebook. Go ahead and deploy this template using CloudFormation. The dev endpoint and notebook will be created through the console (CloudFormation support for dev endpoint notebooks is not great as of the date this post was written).

AWSTemplateFormatVersion: "2010-09-09"
Description: Resources for Dev Endpoints Tutorial
Resources:
  InputDataBucket:
    Type: AWS::S3::Bucket
  OutputDataBucket:
    Type: AWS::S3::Bucket
  Database:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Description: Data for tutorial
        Name: tutorial-data
  Crawler:
    Type: AWS::Glue::Crawler
    Properties:
      DatabaseName: !Ref Database
      Description: Crawler for the input data
      Role: !GetAtt Role.Arn
      TablePrefix: ''
      Targets:
        S3Targets:
          - Path: !Ref InputDataBucket
  Role:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action: sts:AssumeRole
            Principal:
              Service:
                - glue.amazonaws.com
                - sagemaker.amazonaws.com
      ManagedPolicyArns:
        - 'arn:aws:iam::aws:policy/AmazonS3FullAccess'
        - 'arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole'
        - 'arn:aws:iam::aws:policy/CloudWatchLogsFullAccess'
      Policies:
        - PolicyName: DevEndpointFullAccess
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action: "*"
                Resource: !Sub arn:aws:glue:${AWS::Region}:${AWS::AccountId}:devEndpoint/*
              - Effect: Allow
                Action: "*"
                Resource: !Sub arn:aws:sagemaker:${AWS::Region}:${AWS::AccountId}:notebook-instance/*

After your resources have deployed, save this dataset and upload it to your input bucket (do not create any directories within the bucket). This is the well-known iris dataset that contains data on the Iris plant species.

Now that we’ve uploaded our dataset, let’s catalog it using the crawler that we deployed. In the Glue console (you can get there by searching services for “AWS Glue”), click on “Crawlers” in the menu on the left side. Then click the checkbox for your crawler, and click “Run crawler”. This will add a table for the uploaded dataset to your database.

Now that we’ve crawled our data, we are ready to create the dev endpoint in the AWS console.

Go to the AWS Glue console and click on “Dev endpoints” on the left side. It can be found in the “ETL” submenu.
Click on “Add endpoint”.
Give your endpoint a name, and for the role, choose the role you deployed via your CloudFormation template.
Click “Next”.
Choose “Skip networking information” and click “Next” (Networking information is optional since we are only connecting to S3 data stores).
Click “Next” on the “Add an SSH public key” screen (We leave this blank since we will access our endpoint from a SageMaker notebook).
Click “Finish”.

When the “Provisioning status” of your endpoint has changed from “PROVISIONING” to “READY”, you are ready to add a notebook to this endpoint.

Click the checkbox for your dev endpoint.
Click on “Actions” and then click “Create SageMaker notebook” in the drop-down menu.
Give your notebook a name.
Attach your notebook to the dev endpoint that you created.
Select “Choose an existing IAM role”, and select the role you deployed via your CloudFormation template.
Select “Create notebook”.

Once the status for your notebook changes from “Starting” to “Ready”, you are ready to launch your notebook. Click the checkbox for your notebook, and then click “Open notebook”. Once your notebook has opened, click on “New”, and then click “Sparkmagic (PySpark)”. This will create a notebook that supports PySpark (which is of course overkill for this dataset, but it is a fun example).

Now you should see your familiar notebook environment with an empty cell. Since dev endpoint notebooks are integrated with Glue, we have the same capabilities that we would have from within a Glue ETL job. For example, we can create a GlueContext, and read our dataset in using the data catalog (make sure to place your own table name in the code below).

# import needed libraries
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame

# create GlueContext
glueContext = GlueContext(SparkContext.getOrCreate())

# read in data using catalog
df = glueContext.create_dynamic_frame_from_catalog(
        database='tutorial-data',
        table_name='PLACE_YOUR_TABLE_NAME_HERE'
     )

We now have a DynamicFrame, which is similar to the PySpark DataFrame. We can convert to a DataFrame using the toDF method,

# convert DynamicFrame to PySpark DataFrame
df = df.toDF()

Now we have a PySpark DataFrame on which we can test different manipulations for our ETL job. For example, suppose we want the average sepal length for the setosa and versicolor species. We can write code for this and inspect whether it worked by printing out the resulting DataFrame.

# find average sepal length for setosa and versicolor species
df.filter((df.species == 'setosa') | (df.species == 'versicolor')) \
    .select(['sepal_length', 'species']) \
    .groupBy('species') \
    .agg({'sepal_length':'mean'}) \
    .show()

+----------+-----------------+
|   species|avg(sepal_length)|
+----------+-----------------+
|versicolor|            5.936|
|    setosa|5.005999999999999|
+----------+-----------------+

We get immediate feedback by running our PySpark code and printing out the resulting DataFrame. We don’t need to go through the process of running an ETL job, digging through logs, and then repeating the process when we add in our next data manipulation. To add our next data manipulation step, we simply add a cell and write our next piece of code (see the PySpark DataFrame docs for some cool methods to try).

Again, we can treat this notebook as a Glue ETL script, so when we are satisfied with our data manipulations, we can write the final dataset to a bucket. For example, here we convert our DataFrame back to a DynamicFrame, and then write that to a CSV file in our output bucket (make sure to insert your own bucket name).

# convert DataFrame back to DynamicFrame
df = DynamicFrame.fromDF(df, glueContext, 'final_frame')

# write frame to CSV
glueContext.write_dynamic_frame_from_options (
    frame=df,
    connection_type="s3",
    connection_options={"path": INSERT_YOUR_OUTPUT_BUCKET_PATH_HERE},
    format="csv"
)

We can then verify that this code worked by checking our bucket.

Once we are satisfied with the code in our notebook, we can easily convert it to a script and add it to an actual Glue ETL job.

Conclusion

That is the end of our basic tutorial on using dev endpoint notebooks for Glue ETL development. As you can see, dev endpoint notebooks are easy to implement and can save you a lot of time while creating your Glue ETL scripts. For more information on using dev endpoints, see the links below.

〈 Back to Blog