Movilens dataset is represent ratings movies from huge ammount of users. That's why we can build ratings distrubution.

In this project have been used:

Create an account on GCP
Setup a new project and write down the Project ID. If you don't want change something in source code use zoomcamp-de-project name for your project
Setup a service account for this project and download the JSON authentication key files. Put it to /.google/credentials/google_credentials.json
Assign the following IAM Roles to the Service Account: Owner Unfortunatly I haven't enoughth time to go deeper at GCP role model. In this project we will create GCP cluster and it's possible with this role
Create bucket in Cloud Storage with name movielens-dataset-de (You can use another one name, but make sure you made also changes in the source code)
Go to Biq Query and create dataset with name movielens_dataset for zoomcamp-de-project-1 project

This dag upload movielens dataset from https://files.grouplens.org/datasets/movielens/ml-latest.zip to your Big Query
for tags.csv, ratings.csv creating partitioned and clusterd table. Partition via timestamp column, that truncated to monthly format. Also clustered by userID, movieId.
links.csv, movies.csv,genome-scores.csv clustered by movieId. This made for more efficient join with ratings BQ tables

This dag will create movies_ratings table in BQ via PySpark job submiting to cluster that also will create and drop while dag working
Movies ratings is average through all ratings that represent in current dataset

There are two tiles on dashboard

Provide feedback

Saved searches