Glue Primer

Data Catalogs

Data Catalogs don't store any data, but DOES store metadata that routes to the target destination.

Crawlers

Crawlers update the data catalogs.

Workflows

Workflows can be written as Infrastructure-as-code using:

CloudFormation
Cloud Development Kit (CDK)
Terraform

Extract, Transform, Load Jobs (ETL)

Transform Logic

Transform logic can be written using:

Spark
SparkStreaming (used for streaming data)
Python shell (PySpark)

*ApacheSpark is a unified, analytics engine for large-scale data processing.

Data Structures

When working with PySpark, there are 3 kind of data structures:

DataFrame (df_ naming convention)
DynamicFrame (dy naming convention, Glue specific)
Resilient, Distributed Datasets (RDD)

Basic ETL Example

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame

# Initialize the connection to the cluster
# glueContext is used to read data from data stores
glueContext = GlueContext(SparkContext.getOrCreate())

# S3 location for output
output_dir = "s3://sample-bucket/output-dir/medicare_parquet"

# Read data into a DynamicFrame using the Data Catalog metadata
dy_medicare = glueContext.create_dynamic_frame.from_catalog(
    database="datalake",
    table_name="medicare"
)

# We can use the lower-level DataFrame to rename a column:
# Convert to data frame and rename a column
df_renamed = dy_medicare.toDF().withColumnRenamed("GivenName", "Name")
# Convert back to a dynamic frame
dy_output = DynamicFrame.fromDF(df_renamed, glueContext, "MedicareData")

# Or we rename a column through the higher-level DynamicFrame APIs
dy_output = dy_medicare.rename_field("GivenName", "Name")

# Write it out in Parquet
glueContext.write_dynamic_frame.from_options(
    frame=dy_output,
    connection_type="s3",
    connection_options={
        "path": output_dir
    },
    format = "parquet"
)

Ref: AWS-blog.de/2021/06/aws-glue-primer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

glue-primer.md

glue-primer.md

Glue Primer

Data Catalogs

Crawlers

Workflows

Extract, Transform, Load Jobs (ETL)

Transform Logic

Data Structures

Basic ETL Example

Files

glue-primer.md

Latest commit

History

glue-primer.md

File metadata and controls

Glue Primer

Data Catalogs

Crawlers

Workflows

Extract, Transform, Load Jobs (ETL)

Transform Logic

Data Structures

Basic ETL Example