A big data analytics project done for the module ICT 2107 (Distributed Systems Programming)
This project aims to showcase the use of big data analytics in solving real-world problems. The project was done as a part of the ICT 2107 module, which focuses on Distributed Systems Programming.
To get started with this project, you may refer to running instruction. To take a look at our visualization you may proceed here. The pbix files contain a PowerBI file that requires the use of PowerBI to open while the pdf containing a brief look at how the visualization looks.
Below this section we also describe the file structure for a better understanding.
This project consists of the following folders:
Dataset_used
: Folder containing datasets that we scraped, cleaned, processed and used for our analysisJAR
: Folder containing JAR that were used to run anaylsis on dataset in HadoopReport
: Folder containing a report written in IEEE format that we submitted to our schoolSource_code
: Folder containing all the codes that was used for scrapping, cleaning, analysis and visualizationRunning_Instruction.pdf
: A document specifying how to run the codes for this project
AnalysisOutput
: Folder containing all the analysis output that were used in PowerBI for visualizationCleanDataset
: Folder containing ReviewsDataset that has been clean using DataCleaning/dataCleaner.pyProcessedDataset
: Folder containing CleanDataset which has been processed using DataCleaning/preprocess_reviews.pyReviewsDataset
: Folder containing scrape reviews dataset from various company obtained through DataScraping/AFINN-111.txt
: A text file containing sentiment values tagged to each wordscompany-industry.txt
: A text file containing the company-to-industry relationshipstopwords.txt
: A text file containing stop words to skip for analysis
Analysis
: Folder containing codes written in Java used for analysisDataCleaning
: Folder containing python code that was used to clean or process reviewsDataScraping
: Folder containg python code that was used to scrape datasetsDataVisualization
: PowerBI report that was used to generate visualization
Name | Contribution |
---|---|
Bruce Wang | Data Scraping (Indeed), Automation of Data Cleaning & Data Visualization |
Juleus Seah | Data Scraping (Glassdoor) & Data Visualization |
Lim Ryan | Data Cleaning & Word Count Analysis |
Kang Chen | Stopwords Cleaning, Sentiment Analysis & Industry Trend Analysis |
Liu Jun | Industry Trend Analysis |
Chun Boon | Data Processing & Topic Modelling |