Skip to content

A tool that can be used to publish and receive text and audio files from and into a data lake, apply transformations, and load the data in a suitable format for training a speech-to-text model.

Notifications You must be signed in to change notification settings

nebasam/Data-collection-with-Kafka-Airflow-and-Spark

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering: Speech-to-text data collection with Kafka, Airflow, and Spark

Overview

Business Need

  • In building Speech-to-text transcription models, large and diverse training sets are valuable as they enable us to build and improve models which have the capacity to improve and transform peoples lives.
  • Our client, 10 Academy, is a not-for-profit community-owned organization that has recognized the value of large data sets for speech-to-text models.

Objective of the project

  • Design and build a robust, large scale, fault tolerant, highly available Kafka cluster that can be used to post a sentence and receive an audio file.
  • The tool that we create should be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model.

About

A tool that can be used to publish and receive text and audio files from and into a data lake, apply transformations, and load the data in a suitable format for training a speech-to-text model.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 80.5%
  • Python 13.5%
  • JavaScript 3.6%
  • HTML 2.3%
  • Dockerfile 0.1%