UK Police Data Ingestion
The aim of the project is to ingest data from UK police data API to Google cloud Storage -> BigQuery -> Data studio.
To be more precise, data is coming from this API about stop and searches by force:
https://data.police.uk/docs/method/stops-force/
The data from 'metropolitan' police force is used.
These Technologies are used for this Project:
- Python
- Docker Compose
- Airflow
- Google cloud Storage
- Google cloud BigQuery
- Google cloud Data Studio
- Terraform
- PySpark
- Jupyter Notebooks
As a description of the Pipeline, the DAGS tasks are presented here:
Setup your account as described here
Update the terraform variables.tf
variable "project" {
description = "Your GCP Project ID"
}
Update the docker-compose.yaml
file with your project id and google storage bucket
GCP_PROJECT_ID: 'de-bootcamp-339509'
GCP_GCS_BUCKET: 'dtc_data_lake_de-bootcamp-339509'
echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env
docker-compose up airflow-init
docker-compose up
gcloud auth application-default login
terraform init
terraform plan
terraform apply
There are two DAGS in airflow
The data_ingestion
dag is running monthly to collect the data
The gcs_to_bq_dag
is running once, when the data_ingestion
DAG has finished
A partioned table is created in Big Query.
The partition is based on datetime
to optimise performance when querying based on datetime
This is the Query needed to produce the dashboard
SELECT
IFNULL(NULLIF(age_range,''),'N/A') AS age_range,
IFNULL(NULLIF(outcome,''),'N/A') AS outcome,
IFNULL(NULLIF(gender,''),'N/A') AS gender,
datetime,
IFNULL(NULLIF(officer_defined_ethnicity,''),'N/A') AS officer_defined_ethnicity,
IFNULL(NULLIF(type,''),'N/A') AS type,
IFNULL(NULLIF(object_of_search,''),'N/A') AS object_of_search,
latitude,
longitude,
lat_long
FROM 'de-bootcamp-339509.stop_and_search.stop_and_search_partitioned_table'
More queries are in this file queries.sql
Link to Data Studio report
An analysis of sample data is presented in this jupyter notebook file search-force.ipynb