Class on Oct 21, 23, 25

Oct 23: Miniproject3 Part B due
Will meet with teams to hear progress report: a report similar to the proposal, but indicating if anything changed
- Objective (research question)
- Data to be used: how obtained, how processed, integrated, and validated
- What models or algorithms will be used
- What will be done with rough timeline and responsibilities for each member
- A description of the partial results
- Problems encountered so far

Class on Oct 16

Will help with Miniproject3 part B

Class on Oct 14

Miniproject3 Part A due (this time really)
Please note changes to MP3 Part B: the laptop should forward from port 3000!
Chrome seems to work for Part B, but Safari and, possibly Firefox may not work (you have to be able to see the annotation once you save it)

Class on Oct 11

Work on Final Project/Miniproject3 part B

Class on Oct 9

Discuss Miniproject Part B

Class on Oct 7

Cliff notes on text analysis
Introduce MP3 part B

Class on Oct 2-4

Work on final projects
Ensure GCP works

Class on Sep 29

Class on Sep 27

Repeated GCP setup

Class on Sep 25 (complete project proposals)

Final project proposals are due at the end of the class
The group needs to submit a project proposal (1.5-2 pages in IEEE format (see https://www.overleaf.com/latex/templates/preparation-of-papers-for-ieee-sponsored-conferences-and-symposia/zfnqfzzzxghk).
The proposal should provide
- an objective
- a brief motivation for the project,
- detailed discussion of the data that will be obtained or used in the project,
- responsibilities of each member, along with
- a time-line of milestones, and
- the expected outcome
The proposal pdf will be committed to fdac19/ProjectName/proposal.pdf

Class on Sep 23 (complete Miniproject2)

Miniproject2 is due at the end of the class

Class on Sep 20

Lecture on databases
Introducing GCP to prepare for Miniproject3
video recording: see 2019-09-20

Class on Sep 18

The remaining teams are formed: Everyone has a final project at the end of the class
Start brainstorming/writing final project proposal (see Sep 25)

Class on Sep 16

Remaining final project pitches are due
Most teams formed (create fdac19/ProjectName repo and a team of the same name; invite members of the team)

Class on Sep 13

Present the remaining of the selected 10 miniproject1's to the class
Pitches for the final project
Introducing Data Discovery - Miniproject2

Class on Sep 11

Present the selected 10 miniproject1's to the class
Explain pitches for the final project
How to resolve common problems
- Symptom: nothing appears in the browser for localhost:8888
- Solution: run /bin/notebook.sh in the docker container

Class on Sep 09

Present your miniproject1 in small groups

Class on Sep 06

Discuss ideas with your assigned peers, work on the miniproject1

Class on Sep 04

Think about selecting the course project (see course projects for the last four years at fdac18, fdac17, fdac16, fdac for inspiration)
Please submit Practice0 task if you have not done so
See the simple text analysis of your descriptions
Introducing the MiniProject1 process
and template

Class on Aug 30: Attend only if you need help with Practice0 face to face

Attend ony if you need help with Practice0 task. It involves a number of steps, and if you get stuck on any of them please either
- Open an issue,
- Ask TAs to help before the class,
- Come to class and TA will be there to help, or participate virtually by
- Joining a zoom session (connection on the news page) I'll be there to answer your questions

Class on Aug 28

Lecture explaining key technologies used in the class

Class on Aug 26

Please submit the pull request (TAs will be in the class to help)
TAs will help you set up ssh/putty so that you can access jupyter notebooks
Make sure ssh/putty setup works
Full details

Class on Aug 23

Make sure you accept your github invitations
Follow through ssh/putty setup - Full details

Class on Aug 21

Create your github account
- fork repo students
- create your utid.md file providing your name and interests: see Audris.md for inspiration, and also provide your utid.key with your public ssh key.
- submit a pull request to fdac19/students
Make sure you do it during the class so we can start ready Aug 23

Class video recordings

Information for remote participation via Zoom

Join from a PC, Mac, iPad, iPhone or Android device: Please click this URL to start or join. https://tennessee.zoom.us/j/2766448345 Or, go to https://tennessee.zoom.us/join and enter class session/meeting ID: 276 644 8345
Join from dial-in phone line: (Note: these are NOT toll-free numbers) Dial: +1 646 558 8656 or +1 408 638 0968 Meeting ID: 276 644 8345 Participant ID: Shown after joining the meeting International numbers available: https://tennessee.zoom.us/zoomconference?m=leg4C6yjhpfGHE-_Q9EYRNHXCUMBC-2T

Syllabus for "Fundamentals of Digital Archeology"

Course: [COSCS-445/COSCS-545]
** MK-524 10:10-11:00 MWF**
Instructor: Audris Mockus, audris@utk.edu office hours MK613 - on request
TA: Preston Provins pprovins@vols.utk.edu office hours available upon request
TA: David Kennard dkennard@vols.utk.edu
- office hours MinKao 217, Wednesday: 2:30PM - 4:30PM, Thursday: 1:00PM - 3:00PM, Friday: 2:30PM - 4:30PM
** Syllabus **
Need help?

Simple rules:

There are no stupid questions. However, it may be worth going over the following steps:
Think of what the right answer may be.
Search online: stack overflow, etc.
- code snippets: On GH gist.github.com or, if anyone contributes, for this class
- answers to questions: Stack Overflow
Look through issues
Post the question as an issue.
Ask instructor: email for 1-on-1 help, or to set up a time to meet

Objectives

The course will combine theoretical underpinning of big data with intense practice. In particular, approaches to ethical concerns, reproducibility of the results, absence of context, missing data, and incorrect data will be both discussed and practiced by writing programs to discover the data in the cloud, to retrieve it by scraping the deep web, and by structuring, storing, and sampling it in a way suitable for subsequent decision making. At the end of the course students will be able to discover, collect, and clean digital traces, to use such traces to construct meaningful measures, and to create tools that help with decision making.

Expected Outcomes

Upon completion, students will be able to discover, gather, and analyze digital traces, will learn how to avoid mistakes common in the analysis of low-quality data, and will have produced a working analytics application.

In particular, in addition to practicing critical thinking, students will acquire the following skills:

Use Python and other tools to discover, retrieve, and process data.
Use data management techniques to store data locally and in the cloud.
Use data analysis methods to explore data and to make predictions.

Course Description

A great volume of complex data is generated as a result of human activities, including both work and play. To exploit that data for decision making it is necessary to create software that discovers, collects, and integrates the data.

Digital archeology relies on traces that are left over in the course of ordinary activities, for example the logs generated by sensors in mobile phones, the commits in version control systems, or the email sent and the documents edited by a knowledge worker. Understanding such traces is complicated in contrast to data collected using traditional measurement approaches.

Traditional approaches rely on a highly controlled and well-designed measurement system. In meteorology, for example, the temperature is taken in specially designed and carefully selected locations to avoid direct sunlight and to be at a fixed distance from the ground. Such measurement can then be trusted to represent these controlled conditions and the analysis of such data is, consequently, fairly straightforward.

The measurements from geolocation or other sensors in mobile phones are affected by numerous (yet not recorded) factors: was the phone kept in the pocket, was it indoors or outside? The devices are not calibrated or may not work properly, so the corresponding measurements would be inaccurate. Locations (without mobile phones) may not have any measurement, yet may be of the greatest interest. This lack of context and inaccurate or missing data necessitates fundamentally new approaches that rely on patterns of behavior to correct the data, to fill in missing observations, and to elucidate unrecorded context factors. These steps are needed to obtain meaningful results from a subsequent analysis.

The course will cover basic principles and effective practices to increase the integrity of the results obtained from voluminous but highly unreliable sources.

Ethics: legal aspects, privacy, confidentiality, governance
Reproducibility: version control, ipython notebook
Fundamentals of big data analysis: extreme distributions, transformations, quantiles, sampling strategies, and logistic regression
The nature of digital traces: lack of context, missing values, and incorrect data

Prerequisites

Students are expected to have basic programming skills, in particular, be able to use regular expressions, programming concepts such as variables, functions, loops, and data structures like lists and dictionaries (for example, COSC 365)

Being familiar with version control systems (e.g., COSC 340), Python (e.g., COSC 370), and introductory level probability (e.g., ECE 313) and statistics, such as, random variables, distributions and regression would be beneficial but is not expected. Everyone is expected, however, to be willing and highly motivated to catch up in the areas where they have gaps in the relevant skills.

All the assignments and projects for this class will use github and Python. Knowledge of Python is not a prerequisite for this course, provided you are comfortable learning on your own as needed. While we have strived to make the programming component of this course straightforward, we will not devote much time to teaching programming, Python syntax, or any of the libraries and APIs. You should feel comfortable with:

How to look up Python syntax on Google and StackOverflow.
Basic programming concepts like functions, loops, arrays, dictionaries, strings, and if statements.
How to learn new libraries by reading documentation and reusing examples
Asking questions on StackOverflow or as a GitHub issue.

Requirements

These apply to real life, as well.

Must apply "good programming style" learned in class
- Optimize for readability
Bonus points for:
- Creativity (as long as requirements are fulfilled)

Teaming Tips

Agree on an editor and environment that you're comfortable with
The person who's less experienced/comfortable should have more keyboard time
Switch who's "driving" regularly
Make sure to save the code and send it to others on the team

Evaluation

Class Participation – 15%: students are expected to read all material covered in a week and come to class prepared to take part in the classroom discussions. Responding to other student questions (issues) counts as classroom participation.
Assignments - 40%: Each assignment will involve writing (or modifying a template of) a small Python program.
Project - 45%: one original project done alone or in a group of 2 or 3 students. The project will explore one or more of the themes covered in the course that students find particularly compelling. The group needs to submit a project proposal (2 pages IEEE format) approximately 1.5 months before the end of term. The proposal should provide a brief motivation of the project, detailed discussion of the data that will be obtained or used in the project, along with a time-line of milestones, and expected outcome.

Other considerations

As a programmer you will never write anything from scratch, but will reuse code, frameworks, or ideas. You are encouraged to learn from the work of your peers. However, if you don't try to do it yourself, you will not learn. deliberate-practice (activities designed for the sole purpose of effectively improving specific aspects of an individual's performance) is the only way to reach perfection.

Please respect the terms of use and/or license of any code you find, and if you re-implement or duplicate an algorithm or code from elsewhere, credit the original source with an inline comment.

Resources

Materials

This class assumes you are confident with this material, but in case you need a brush-up...

Python for beginners and Python Dictionaries

Other

Mining the Social Web, 2nd Edition

Databases

A MongoDB Schema Analyzer. One JavaScript file that you run with the mongo shell command on a database collection and it attempts to come up with a generalized schema of the datastore. It was also written about on the official MongoDB blog.

R and data analysis

Modern Applied Statistics with S (4th Edition) by William N. Venables, Brian D. Ripley. ISBN0387954570
R
Code School
Quick-R

Tutorials written as ipython-notebooks

GitHub

Git and GitHub
GitHub Pages
- Official site
- Thinkful guide

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
FinalProject.md		FinalProject.md
LICENSE		LICENSE
Preliminary.md		Preliminary.md
README.md		README.md
course.pdf		course.pdf
datasources.md		datasources.md
puttyauth.png		puttyauth.png
puttykey.png		puttykey.png
puttyport.png		puttyport.png
puttysession.png		puttysession.png
winscp.png		winscp.png

License

Cheltone/news

Folders and files

Latest commit

History

Repository files navigation

Class on Oct 21, 23, 25

Class on Oct 16

Class on Oct 14

Class on Oct 11

Class on Oct 9

Class on Oct 7

Class on Oct 2-4

Class on Sep 29

Class on Sep 27

Class on Sep 25 (complete project proposals)

Class on Sep 23 (complete Miniproject2)

Class on Sep 20

Class on Sep 18

Class on Sep 16

Class on Sep 13

Class on Sep 11

Class on Sep 09

Class on Sep 06

Class on Sep 04

Class on Aug 30: Attend only if you need help with Practice0 face to face

Class on Aug 28

Class on Aug 26

Class on Aug 23

Class on Aug 21

Class video recordings

Information for remote participation via Zoom

Syllabus for "Fundamentals of Digital Archeology"

Objectives

Expected Outcomes

Course Description

Prerequisites

Requirements

Teaming Tips

Evaluation

Other considerations

Resources

Materials

Other

Databases

R and data analysis

Tutorials written as ipython-notebooks

GitHub

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages