The primary objectives of this project are as follows:
- Perform data modeling using PostgreSQL.
- Create an ETL (Extract, Transform, Load) pipeline using Python within a Jupyter Notebook.
- Define fact and dimension tables to establish a star schema data model that enhances query performance for song play analysis.
- Implement an ETL pipeline to transfer data from JSON files into PostgreSQL tables using Python and SQL.
Sparkify, a burgeoning startup, is keen to gain insights from the data they've gathered regarding songs and user activities on their new music streaming platform. The analytics team is particularly interested in understanding user song preferences. Presently, they lack a convenient means to query their data, which is stored in JSON logs for user activity and JSON metadata for the songs available on the platform.
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.
{
"num_songs": 1,
"artist_id": "ARJIE2Y1187B994AB7",
"artist_latitude": 35.1495,
"artist_longitude": -90.0490,
"artist_location": "Memphis, Tennessee, USA",
"artist_name": "Johnny Cash",
"song_id": "SOUPIRU12A6D4FA1E1",
"title": "Ring of Fire",
"duration": 166.32036,
"year": 1963
}
The second dataset consists of log files in JSON format that have been generated using an event simulator. These log files mimic user activity on a music streaming application and are based on specific configurations. The files are organized chronologically by year and month.
And below is an example of what the data in a log file, 2018-11-12-events.json, looks like.
{
"artist": "Beyoncé",
"auth": "Logged In",
"firstName": "Alice",
"gender": "F",
"itemInSession": 7,
"lastName": "Johnson",
"length": 212.74567,
"level": "paid",
"location": "New York, NY",
"method": "PUT",
"page": "NextSong",
"registration": 1548079511234.0,
"sessionId": 101,
"song": "Crazy in Love",
"status": 200,
"ts": 1548367823796,
"userAgent": "\"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36\"",
"userId": "123"
}
-
data folder
- contains the datasets -
queries.py
- Houses SQL queries responsible for creating tables, inserting data into tables, dropping tables, and executing various other database queries. -
table-creation.py
- Python script that creates a postgres database containing empty tables -
etl.py
- Python script that extracts data from JSON files, transforms it to the appropriate data type or format, and loads it into a SQL table -
test.ipynb
- Jupyter Notebook containing sample queries
- Run
table-creation.py
to create database and tables - Run
etl.py
to load data into appropriate tables - Run cells in
test.ipynb
to test that data was loaded correctly
- Start by creating a PostgreSQL database called "sparkifydb."
- Establish the fact and dimension tables within the "sparkifydb" database, following the structure outlined in the provided schema using
table-creation.py
. - Extract the necessary data for the songs table from the song dataset using Python and pandas in
etl.py
. - Utilize psycopg2 in
etl.py
and thesong_table_insert
function fromqueries.py
to insert the song data into the songs table. - Extract the required information for the artists table from the song dataset using Python and pandas in
etl.py
. - Use psycopg2 in
etl.py
and theartist_table_insert
function fromqueries.py
to insert artist data into the artist table. - Extract the time-related information from the log dataset in
etl.py
and transform it into a pandas datetime object. - Utilize psycopg2 in
etl.py
and thetime_table_insert
function fromqueries.py
to insert time data into the time table. - Extract user information from the log dataset in
etl.py
. - Insert user data into the user table using psycopg2 in
etl.py
and theuser_table_insert
function fromqueries.py
. - Employ the
select_song
query fromqueries.py
inetl.py
to retrieve the song ID and artist ID for songs in the log dataset and extract the remaining songplay data from the log dataset. - Finally, insert the songplay session-related data into the songplay table using psycopg2 in
etl.py
and thesongplay_table_insert
function fromqueries.py
.