Movilens dataset is represent ratings movies from huge ammount of users. That's why we can build ratings distrubution.
In this project have been used:
- Docker
- Airflow
- GCP
- PySpark
- Create an account on GCP
- Setup a new project and write down the Project ID. If you don't want change something in source code use
zoomcamp-de-project
name for your project - Setup a service account for this project and download the JSON authentication key files. Put it to
/.google/credentials/google_credentials.json
- Assign the following IAM Roles to the Service Account: Owner Unfortunatly I haven't enoughth time to go deeper at GCP role model. In this project we will create GCP cluster and it's possible with this role
- Create bucket in Cloud Storage with name
movielens-dataset-de
(You can use another one name, but make sure you made also changes in the source code) - Go to Biq Query and create dataset with name
movielens_dataset
forzoomcamp-de-project-1
project
- Go to /airflow folder using bash. After execute next commands:
docker-compose build
docker-compose up airflow-init
docker-compose up
- Browse to
localhost:8080
. Username and password are bothairflow
- Trigger
movielens_dag
- This dag upload movielens dataset from
https://files.grouplens.org/datasets/movielens/ml-latest.zip
to your Big Query - for
tags.csv
,ratings.csv
creating partitioned and clusterd table. Partition via timestamp column, that truncated to monthly format. Also clustered byuserID
,movieId
. links.csv
,movies.csv
,genome-scores.csv
clustered bymovieId
. This made for more efficient join withratings
BQ tables
- Trigger
batch_processing_dags
- This dag will create
movies_ratings
table in BQ via PySpark job submiting to cluster that also will create and drop while dag working - Movies ratings is average through all ratings that represent in current dataset
There are two tiles on dashboard
- Distribution of movies ratings
- Count of rated movies and users that give their marks per month