Skip to content
This repository has been archived by the owner on Apr 8, 2023. It is now read-only.

A big data analytics project done for the module ICT 2107 (Distributed Systems Programming)

Notifications You must be signed in to change notification settings

brucewzj99/ICT2107_BigDataAnalytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ICT2107_BigDataAnalytics

A big data analytics project done for the module ICT 2107 (Distributed Systems Programming)

This project aims to showcase the use of big data analytics in solving real-world problems. The project was done as a part of the ICT 2107 module, which focuses on Distributed Systems Programming.

Getting started

To get started with this project, you may refer to running instruction. To take a look at our visualization you may proceed here. The pbix files contain a PowerBI file that requires the use of PowerBI to open while the pdf containing a brief look at how the visualization looks. image

Below this section we also describe the file structure for a better understanding.

File Structure

This project consists of the following folders:

  • Dataset_used: Folder containing datasets that we scraped, cleaned, processed and used for our analysis
  • JAR: Folder containing JAR that were used to run anaylsis on dataset in Hadoop
  • Report: Folder containing a report written in IEEE format that we submitted to our school
  • Source_code: Folder containing all the codes that was used for scrapping, cleaning, analysis and visualization
  • Running_Instruction.pdf: A document specifying how to run the codes for this project

Dataset_used folder

  • AnalysisOutput: Folder containing all the analysis output that were used in PowerBI for visualization
  • CleanDataset: Folder containing ReviewsDataset that has been clean using DataCleaning/dataCleaner.py
  • ProcessedDataset: Folder containing CleanDataset which has been processed using DataCleaning/preprocess_reviews.py
  • ReviewsDataset: Folder containing scrape reviews dataset from various company obtained through DataScraping/
  • AFINN-111.txt: A text file containing sentiment values tagged to each words
  • company-industry.txt: A text file containing the company-to-industry relationship
  • stopwords.txt: A text file containing stop words to skip for analysis

Source_code folder

  • Analysis: Folder containing codes written in Java used for analysis
  • DataCleaning: Folder containing python code that was used to clean or process reviews
  • DataScraping: Folder containg python code that was used to scrape datasets
  • DataVisualization: PowerBI report that was used to generate visualization

Contributors

Name Contribution
Bruce Wang Data Scraping (Indeed), Automation of Data Cleaning & Data Visualization
Juleus Seah Data Scraping (Glassdoor) & Data Visualization
Lim Ryan Data Cleaning & Word Count Analysis
Kang Chen Stopwords Cleaning, Sentiment Analysis & Industry Trend Analysis
Liu Jun Industry Trend Analysis
Chun Boon Data Processing & Topic Modelling

About

A big data analytics project done for the module ICT 2107 (Distributed Systems Programming)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •