ISYE-4961/6961: PROJECTS IN BIG DATA ANALYTICS (Fall 2015) Course Outline Notes
The analytics process can be broken-down into the following four phases:
-
Acquisition All data analysis starts with the acquisition of the data itself. Data acquisition must be planned and executed to assure data collection meets the needs of the project objectives and fulfills the requirements of the expected analyses. Data parameters such as type, frequency of sampling in time, space, or other abstract dimension, overall volume, and range of values must be understood and the data collection process and methodology must be constructed to accommodate these parameters. In cases where the analyses are unknown or not well understood, the data acquisition and analysis cycle may need to be iterated.
-
Preparation Raw data is the input to the analytics process. It is simply a set of values (numeric, character, symbolic or abstract) and their units and other attributes with or without any particular organization. Basic processes applied to raw data can be used to organize, re-organize, clean, filter, transform or otherwise massage the data set content into a more useful state.
-
Analysis and Interpretation Analysis processes can then be applied to generate information in the form of patterns, trends, correlations or other coherent measures. From the patterns and other coherent measures cause and effect relationships can be hypothesized and validity tests developed and run.
-
Modeling Successful hypotheses lead to quantitative cause and effect relationships; mathematical relations, which can be used for modeling, prediction and other computational exercises. The meaning extraction process is completed when quantitative methods can be used to synthetically regenerate the input data within an acceptable level of error, thus providing a means for modeling and studying additional data sets of the same type.
Careful attention to the execution of all four steps is vital to the successful completion of a data analytics project. Example real world data sets will be used to demonstrate the key steps in each phase and for student conducted hands-on projects.
Example project – Yelp business data Objective – find the three highest rated laundromats, books stores and low cost restaurants within walking distance of the following university campuses: RPI, ASU. Method – use campus and business coordinates and scan the Yelp file to locate relevant rating data. Analyze rating data and build a list of the highest rated businesses of each type. Sample Questions:
- How complete is the business data set? Are all businesses actually in Yelp? How would we know? How could we repair any deficiencies?
- How accurate is the business data? Is it up-to-date? Are all still operating? Have any moved? How can we know?
- How accurate are the ratings? How can we determine if they have been gamed or otherwise distorted?
Example Project – GE wind turbine data
Objective – anticipate maintenance needs and mechanical failure.
Method – analyze time series data from multiple monitor points on multiple turbines in a wind farm. Developed a procedure that uses the data in real or near real time to detect anomalies and identify/predict potential maintenance issues or failure modes.
Example Questions:
- What kinds of data are available? What is the data volume? Rate of acquisition?
- What kinds of analytical methods may be best suited to detect anomalies?
- How can any anomalies be related to know maintenance issues and failure modes?
- How much computational resource could be needed to conduct real or near real time analysis?
- How much lead time is needed to effectively anticipate a problem?