-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properties of Data Sources to identify #11
Comments
An issue with the common-crawl, is that it doesn't get all the PD org sites. The common crawl is very respectful of robots.txt and if a servers response time is lagging, it'll stop hitting that server all together. This can be seen because:
I think the real issue is how to clean and categorize any URL. (then this solution can be applied to future sitemaps). First and foremost, identifying relevant URLs in a large list is important. Here are some suggestions for next steps:
Second, I think from these relevant URLs, identifying features from them is important.
|
Thanks for thinking this through @nfmcclure ! I think gathering info from the URL will work in some cases—but I'm sure we'll miss a lot, and a lot of URLs are simply unhelpful. At some point we'll need to start looking at page headers and content to identify sources.
Any strategy we develop will hit a point of diminishing returns where it's easier to just manually look at what's left, which is A-OK. |
I filtered about 1,000 unique host domains from states, counties, and cities. examples:
I'm guessing there's about 2,000 total unique host domains in the above CSV. I built a scrapy Sitemap spider that looks through each robots.txt or sitemap.xml and finds all the routes on the server with the words "police" or "cop" in it. It stores the URL + last-modified date (if exists). From 1000 host domains, it gets about 40k url paths. examples:
So to note:
Also seeing if we can get the relevant agency (state/county/metro) |
Context
We want to add metadata to URLs, filter for relevancy, and expand our database of valid data sources.
Flowchart
The overall plan for data source identification is now in the readme of this repo.
Properties
S tier
A tier
aggregation_type
access_type
record_download_option_provided
record_format
agency_supplied
andagency_originated
? If not, who are the supplier and originator?coverage_start
coverage_end
portal_type
scraper_url
readme_url
Still A tier, but rarely published:
retention_schedule
update_frequency
source_last_updated
B tier
size
update_method
sort_method
access_restrictions
Related reading
https://github.com/palewire/storysniffer/
http://blog.apps.npr.org/2016/06/17/scraping-tips.html
The text was updated successfully, but these errors were encountered: