Skip to content

AhmedsZamzam/stream-processing-with-confluent-and-glue-streaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

Stream Processing using Confluent with AWS Glue Streaming

AWS Glue Streaming uses Glue Connections to connect to different sources and targets. One of these conections is Kafka. However, this connection does not support SASL/PLAIN, a common authentication mechanism used by vanilla Kafka and Confluent. This limitation means that Glue Streaming does not natively support Confluent out-of-the-box.

An alternative solution would be using native Spark APIs to integrate AWS Glue Streaming with Confluent. This repository provides a simple demo and boilerplate code for Glue Streaming. It reads data from a Confluent Cloud topic and writes that data into another topic, with the only transformation being the removal of a specific column. This code serves as a foundation that you can build upon by adding your own custom transformations as needed.

We use Terraform to deploy all the necessary resources. The script deploys the following:

The template deploys:

  1. Confluent Cloud environment
  2. Confluent Cloud Cluster
  3. Confluent Cloud source and target topics
  4. API Keys with read/write permissions on the source and target topics
  5. Datagen Connector to generate mock data for the demo
  6. Glue Steaming Python code
  7. S3 Bucket to upload the code
├── assets                                <-- Directory that will hold demo assests
│   ├── architecture.png                  <-- Demo architecture diagram
└── Terraform                             <-- Demo terraform script and artifacts
│   ├── aws.tf                            <-- Terraform for AWS resources
│   ├── main.tf                           <-- Terraform for Confluent resources
│   ├── outputs.tf                        <-- Terraform output file
│   ├── providors.tf                      <-- Terraform providors file
│   ├── streaming.py                      <-- Glue Streaming code
│   ├── variables.tf                      <-- Terraform variables file
└── README.md

Architecture

The demo uses Glue Streaming to read raw messages generated by the Datagen connector. It removes the itemid field and subsequently publishes the modified data back to Confluent.

Architecture

Note: This is a basic example transformation. You can add any transformation supported by Spark Structured Streaming.

General Requirements

Deploy Demo

  1. Clone the repo onto your local development machine using git clone <repo url>.
  2. Change directory to demo repository and terraform directory.
cd stream-processing-with-confluent-and-glue-streaming/Terraform

  1. Use Terraform CLI to deploy solution
terraform plan

terraform apply

Post deployment

  1. Go to Confluent Cloud Topics UI, then choose the newly created environment and cluster.

  2. Browse to source-topic and view raw messages

    Source Message

  3. Navigate to target-topic and then view post-processed messages. Notice the output messages are missing one column which was dropped by the Glue Steaming job.

    Output Message

  4. Play with the Glue streaming code to add any transformations needed.

Clean-up

The great thing about Cloud resources is that you can spin the up and down with few commands. Once you are finished with this demo, remember to destroy the resources you created, to avoid incurring in charges. You can always spin it up again anytime you want.

Note: When you are done with the demo, you can automatically destroy all the resources created using the command below:

terraform destroy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published