Data Discovery Project

Pick a favorite topic that you care about
Find at least 20 datasets for that topic (use, for example, https://toolbox.google.com/datasetsearch). I for one, collect open source git repositories, so I searched for "git urls"
For each of the 20 datasets you chose determine if the underlying data can be accessed (some of these datasets do not provide public access)
Create a mongodb collection YourNetId within the database fdac19mp2 where you store metadata for each of the 20 datasets: YourTopic, title, license, description, url(s) were the data may be retrieved

import pymongo, json
client = pymongo.MongoClient (host="da1.eecs.utk.edu")
db = client ['fdac19mp2']
coll = db ['YourNetId']
# for each dataset
coll.insert_one ( { 'topic':'YourTopic', 'title': 'Data title', 'license': 'license', 'description': 'Brief data description', 'urls': [ 'url1', 'url2', ... ] } )

To check what is recorded:

import pprint
import pymongo, json
client = pymongo.MongoClient (host="da1.eecs.utk.edu")
db = client ['fdac19mp2']
coll = db ['YourNetId']
pp = pprint.PrettyPrinter(indent=1,width=65)
for r in coll. find():
  print(pp .pformat (r))

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
MDSMongoDB.ipynb		MDSMongoDB.ipynb
MongoDB.ipynb		MongoDB.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Discovery Project

About

Releases

Packages

Languages

mshoffner2/Miniproject2

Folders and files

Latest commit

History

Repository files navigation

Data Discovery Project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages