TTC Delay Analytics

Analyzing Toronto public transit delay data to identify delay hotspots, root causes, and find valuable insights by creating an ETL (Extract, Transform, Load) pipeline and using Looker Studio to create dashboards.

Source data

Data source - TTC bus, subway. streetcar delay data : Open Data Catalogue - City of Toronto Open Data Portal

Understanding the source data

1. Subway

Field Name	Description	Example
Date	Date (YYYY/MM/DD)	12/31/2016
Time	Time (24h clock)	1:59
Day	Name of the day of the week	Saturday
Station	TTC subway station name	Rosedale Station
Code	TTC delay code	MUIS
Min Delay	Delay (in minutes) to subway service	5
Min Gap	Time length (in minutes) between trains	9
Bound	Direction of train dependant on the line	N
Line	TTC subway line i.e. YU, BD, SHP, and SRT	YU
Vehicle	TTC train number	5961

2. Bus & Streetcar

Field Name	Description	Example
Report Date	The date (YYYY/MM/DD) when the delay-causing incident occurred	6/20/2017
Route	The number of the bus route	51
Time	The time (hh:mm:ss AM/PM) when the delay-causing incident occurred	12:35:00 AM
Day	The name of the day	Monday
Location	The location of the delay-causing incident	York Mills Station
Incident	The description of the delay-causing incident	Mechanical
Min Delay	The delay, in minutes, to the schedule for the following bus	10
Min Gap	The total scheduled time, in minutes, from the bus ahead of the following bus	20
Direction	The direction of the bus route where B,b or BW indicates both ways. (On an east-west route, it includes both east and west) NB - northbound, SB - southbound, EB - eastbound, WB - westbound	N
Vehicle	Vehicle number	1057

Infrastructure

Used Terraform to set up the infrastructure
used a google cloud VM instance to run everything
Used GCS (google cloud storage) as our data lake and BigQuery as our data warehouse
Used use Spark to process, transform and clean the data
Used Docker to run Airflow inside the VM
Used Airflow to automate this entire process and run it at scheduled intervals to keep the data up to date.

Terraform will be run on the local setup to setup up the VM instance on google cloud platform which will then be be used to perform all 3 tasks (Extract, Transform, Load)

Note

Data is processed in batches using spark (scheduled yearly using Airflow)
Can also be increased to every month but that is not cost-effective

Airflow DAGs (Directed Acyclic Graphs)

src_to_gcs_dag.py downloads data from the Toronto open portal website, converts it to parquet form, loads it into gcs bucket and removes it from the VM's local file system once it is available on the cloud.

dataproc_job_dag.py uploads the main python file spark_job.py for running the spark job to the gcs bucket (so dataproc can fetch it), set the service account which authorizes the airflow container to work with gcp services, create a cluster, run the job and then delete it to avoid idle running costs.

Dashboard

looker studio dashboard ↗

There are no delay records for July - December 2023 as this image was taken in August 2023

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.vscode		.vscode
Airflow		Airflow
images		images
notebooks		notebooks
setup_scripts		setup_scripts
terraform		terraform
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TTC Delay Analytics

Source data

Understanding the source data

Infrastructure

Dashboard

About

Releases

Packages

Languages

LaeekAhmed/TTC-delay-analytics

Folders and files

Latest commit

History

Repository files navigation

TTC Delay Analytics

Source data

Understanding the source data

Infrastructure

Dashboard

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages