ICT2107_BigDataAnalytics

A big data analytics project done for the module ICT 2107 (Distributed Systems Programming)

This project aims to showcase the use of big data analytics in solving real-world problems. The project was done as a part of the ICT 2107 module, which focuses on Distributed Systems Programming.

Getting started

To get started with this project, you may refer to running instruction. To take a look at our visualization you may proceed here. The pbix files contain a PowerBI file that requires the use of PowerBI to open while the pdf containing a brief look at how the visualization looks.

Below this section we also describe the file structure for a better understanding.

File Structure

This project consists of the following folders:

Dataset_used: Folder containing datasets that we scraped, cleaned, processed and used for our analysis
JAR: Folder containing JAR that were used to run anaylsis on dataset in Hadoop
Report: Folder containing a report written in IEEE format that we submitted to our school
Source_code: Folder containing all the codes that was used for scrapping, cleaning, analysis and visualization
Running_Instruction.pdf: A document specifying how to run the codes for this project

`Dataset_used` folder

AnalysisOutput: Folder containing all the analysis output that were used in PowerBI for visualization
CleanDataset: Folder containing ReviewsDataset that has been clean using DataCleaning/dataCleaner.py
ProcessedDataset: Folder containing CleanDataset which has been processed using DataCleaning/preprocess_reviews.py
ReviewsDataset: Folder containing scrape reviews dataset from various company obtained through DataScraping/
AFINN-111.txt: A text file containing sentiment values tagged to each words
company-industry.txt: A text file containing the company-to-industry relationship
stopwords.txt: A text file containing stop words to skip for analysis

`Source_code` folder

Analysis: Folder containing codes written in Java used for analysis
DataCleaning: Folder containing python code that was used to clean or process reviews
DataScraping: Folder containg python code that was used to scrape datasets
DataVisualization: PowerBI report that was used to generate visualization

Contributors

Name	Contribution
Bruce Wang	Data Scraping (Indeed), Automation of Data Cleaning & Data Visualization
Juleus Seah	Data Scraping (Glassdoor) & Data Visualization
Lim Ryan	Data Cleaning & Word Count Analysis
Kang Chen	Stopwords Cleaning, Sentiment Analysis & Industry Trend Analysis
Liu Jun	Industry Trend Analysis
Chun Boon	Data Processing & Topic Modelling

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.idea		.idea
Group02		Group02
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ICT2107_BigDataAnalytics

Getting started

File Structure

`Dataset_used` folder

`Source_code` folder

Contributors

About

Releases

Packages

Contributors 4

Languages

brucewzj99/ICT2107_BigDataAnalytics

Folders and files

Latest commit

History

Repository files navigation

ICT2107_BigDataAnalytics

Getting started

File Structure

Dataset_used folder

Source_code folder

Contributors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

`Dataset_used` folder

`Source_code` folder

Packages