Data Catalogs don't store any data, but DOES store metadata that routes to the target destination.
Crawlers update the data catalogs.
Workflows can be written as Infrastructure-as-code using:
- CloudFormation
- Cloud Development Kit (CDK)
- Terraform
Transform logic can be written using:
- Spark
- SparkStreaming (used for streaming data)
- Python shell (PySpark)
*ApacheSpark is a unified, analytics engine for large-scale data processing.
When working with PySpark, there are 3 kind of data structures:
- DataFrame (
df_
naming convention) - DynamicFrame (
dy
naming convention, Glue specific) - Resilient, Distributed Datasets (RDD)
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
# Initialize the connection to the cluster
# glueContext is used to read data from data stores
glueContext = GlueContext(SparkContext.getOrCreate())
# S3 location for output
output_dir = "s3://sample-bucket/output-dir/medicare_parquet"
# Read data into a DynamicFrame using the Data Catalog metadata
dy_medicare = glueContext.create_dynamic_frame.from_catalog(
database="datalake",
table_name="medicare"
)
# We can use the lower-level DataFrame to rename a column:
# Convert to data frame and rename a column
df_renamed = dy_medicare.toDF().withColumnRenamed("GivenName", "Name")
# Convert back to a dynamic frame
dy_output = DynamicFrame.fromDF(df_renamed, glueContext, "MedicareData")
# Or we rename a column through the higher-level DynamicFrame APIs
dy_output = dy_medicare.rename_field("GivenName", "Name")
# Write it out in Parquet
glueContext.write_dynamic_frame.from_options(
frame=dy_output,
connection_type="s3",
connection_options={
"path": output_dir
},
format = "parquet"
)