Music-Analysis-Postgres

Main Aim

The primary objectives of this project are as follows:

Perform data modeling using PostgreSQL.
Create an ETL (Extract, Transform, Load) pipeline using Python within a Jupyter Notebook.
Define fact and dimension tables to establish a star schema data model that enhances query performance for song play analysis.
Implement an ETL pipeline to transfer data from JSON files into PostgreSQL tables using Python and SQL.

Overview

Sparkify, a burgeoning startup, is keen to gain insights from the data they've gathered regarding songs and user activities on their new music streaming platform. The analytics team is particularly interested in understanding user song preferences. Presently, they lack a convenient means to query their data, which is stored in JSON logs for user activity and JSON metadata for the songs available on the platform.

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.

{
    "num_songs": 1,
    "artist_id": "ARJIE2Y1187B994AB7",
    "artist_latitude": 35.1495,
    "artist_longitude": -90.0490,
    "artist_location": "Memphis, Tennessee, USA",
    "artist_name": "Johnny Cash",
    "song_id": "SOUPIRU12A6D4FA1E1",
    "title": "Ring of Fire",
    "duration": 166.32036,
    "year": 1963
}

Log Dataset

The second dataset consists of log files in JSON format that have been generated using an event simulator. These log files mimic user activity on a music streaming application and are based on specific configurations. The files are organized chronologically by year and month.

And below is an example of what the data in a log file, 2018-11-12-events.json, looks like.

{
    "artist": "Beyoncé",
    "auth": "Logged In",
    "firstName": "Alice",
    "gender": "F",
    "itemInSession": 7,
    "lastName": "Johnson",
    "length": 212.74567,
    "level": "paid",
    "location": "New York, NY",
    "method": "PUT",
    "page": "NextSong",
    "registration": 1548079511234.0,
    "sessionId": 101,
    "song": "Crazy in Love",
    "status": 200,
    "ts": 1548367823796,
    "userAgent": "\"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36\"",
    "userId": "123"
}

Description of Files

data folder - contains the datasets
queries.py - Houses SQL queries responsible for creating tables, inserting data into tables, dropping tables, and executing various other database queries.
table-creation.py - Python script that creates a postgres database containing empty tables
etl.py - Python script that extracts data from JSON files, transforms it to the appropriate data type or format, and loads it into a SQL table
test.ipynb - Jupyter Notebook containing sample queries

How to Run

Run table-creation.py to create database and tables
Run etl.py to load data into appropriate tables
Run cells in test.ipynb to test that data was loaded correctly

Details of Pipeline

ETL Pipeline Steps

Start by creating a PostgreSQL database called "sparkifydb."
Establish the fact and dimension tables within the "sparkifydb" database, following the structure outlined in the provided schema using table-creation.py.
Extract the necessary data for the songs table from the song dataset using Python and pandas in etl.py.
Utilize psycopg2 in etl.py and the song_table_insert function from queries.py to insert the song data into the songs table.
Extract the required information for the artists table from the song dataset using Python and pandas in etl.py.
Use psycopg2 in etl.py and the artist_table_insert function from queries.py to insert artist data into the artist table.
Extract the time-related information from the log dataset in etl.py and transform it into a pandas datetime object.
Utilize psycopg2 in etl.py and the time_table_insert function from queries.py to insert time data into the time table.
Extract user information from the log dataset in etl.py.
Insert user data into the user table using psycopg2 in etl.py and the user_table_insert function from queries.py.
Employ the select_song query from queries.py in etl.py to retrieve the song ID and artist ID for songs in the log dataset and extract the remaining songplay data from the log dataset.
Finally, insert the songplay session-related data into the songplay table using psycopg2 in etl.py and the songplay_table_insert function from queries.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Music-Analysis-Postgres

Main Aim

Overview

Song Dataset

Log Dataset

Description of Files

How to Run

Details of Pipeline

ETL Pipeline Steps

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.idea		.idea
data		data
.DS_Store		.DS_Store
README.md		README.md
etl.py		etl.py
queries.py		queries.py
schema.md		schema.md
table-creation.py		table-creation.py
test.ipynb		test.ipynb

deepansha16/music-analysis-postgres

Folders and files

Latest commit

History

Repository files navigation

Music-Analysis-Postgres

Main Aim

Overview

Song Dataset

Log Dataset

Description of Files

How to Run

Details of Pipeline

ETL Pipeline Steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages