Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properties of Data Sources to identify #11

Open
2 of 20 tasks
josh-chamberlain opened this issue Nov 2, 2022 · 3 comments
Open
2 of 20 tasks

Properties of Data Sources to identify #11

josh-chamberlain opened this issue Nov 2, 2022 · 3 comments
Labels

Comments

@josh-chamberlain
Copy link
Contributor

josh-chamberlain commented Nov 2, 2022

Context

We want to add metadata to URLs, filter for relevancy, and expand our database of valid data sources.

Flowchart

The overall plan for data source identification is now in the readme of this repo.

Properties

These are all explained in the data dictionary

S tier

A tier

  • aggregation_type
  • access_type
  • record_download_option_provided
  • record_format
  • Is it agency_supplied and agency_originated? If not, who are the supplier and originator?
  • coverage_start
  • coverage_end
  • portal_type
  • scraper_url
  • readme_url

Still A tier, but rarely published:

  • retention_schedule
  • update_frequency
  • source_last_updated

B tier

  • size
  • update_method
  • sort_method
  • access_restrictions

Related reading

https://github.com/palewire/storysniffer/
http://blog.apps.npr.org/2016/06/17/scraping-tips.html

@nfmcclure
Copy link

An issue with the common-crawl, is that it doesn't get all the PD org sites. The common crawl is very respectful of robots.txt and if a servers response time is lagging, it'll stop hitting that server all together. This can be seen because:

I think the above is a separate issue because the solution requires scraping. I would suggest the solution here would be to find all the unique hosts and look at the (host-url/sitemap.xml) and extracting all the URLs listed there. (if sitemap exists).


I think the real issue is how to clean and categorize any URL. (then this solution can be applied to future sitemaps).

First and foremost, identifying relevant URLs in a large list is important. Here are some suggestions for next steps:

  1. Can we find rules to remove bad URLs? E.g. 'blog', or 'store', or URLs with question marks in them, or non-secure http urls.
  2. From the remaining, we would have to label URLs as relevant or not. Then we can use a very similar model(s) as the prior linked palewire/storysniffer. It looks like they started out labelling about 1-3 thousand URLs.

Second, I think from these relevant URLs, identifying features from them is important.

  1. Agency Location. Identifying the department location (State police: CA, AZ, ..., metro police: NYC, LA, ..., county police: King County, ...)
    • I think this should be attempted first in a dumb/simple way. Get a list of all metros, counties-states, and just look at which location string matches the host/domain url or URL + homepage HTML.
  2. Record Type. This is hard. Probably something similar to the prior URL classification from storysniffer.
  3. Coverage & Dates: Maybe something with the common crawl last access date? e.g. the most recent is 2022-40 (40th week of 2022 = Oct 3rd, 2022)? Also, note that the XML sitemaps tend to have a field "last modified", if the URL is in there. But again, looking at the XML sitemaps is more of a scraping task. Where-as the common crawl has already scraped any URL it contains.
  4. Agency-supplied: Most of the URLs searched have the top-level-domains of gov or us. This usually means they are agency-supplied. We can search more domain names if we want.
  5. Size. More thought needed here. Is the full HTML size an upper bound? Assuming data on URL.
  6. others.... TBD.

@josh-chamberlain
Copy link
Contributor Author

josh-chamberlain commented Nov 3, 2022

Thanks for thinking this through @nfmcclure ! I think gathering info from the URL will work in some cases—but I'm sure we'll miss a lot, and a lot of URLs are simply unhelpful. At some point we'll need to start looking at page headers and content to identify sources.

  1. Agency location—since we have a homepage URL for most of the agencies in our database, we should be able to get a lot of URLs matched to agencies by simply comparing root domains. We could also probably use those root domains as a way to search for new URLs. I think the process would be something like:

    • take an agency homepage URL
    • use a combination of commoncrawl, probably internet archive, sitemap generator or find the sitemap to locate URLs on the domain, as well as other domains being used to display data (lots of data portals aren't on the government domain)
    • run the URLs through a little toolkit of identification processes and scripts to get as many properties established as possible
  2. Record type is definitely hard to get from the URL, though in some cases I'm sure it'd work. I think using what's on the page will be the best bet.

  3. Good idea!

  4. Agreed.

  5. This depends—if there's a "download" button on the page, the size on disk of the file behind that button is the size we want. That said, "size" is a nice-to-know about a data source but not required. I'll update the hierarchy of these properties here in a minute.

Any strategy we develop will hit a point of diminishing returns where it's easier to just manually look at what's left, which is A-OK.

@nfmcclure
Copy link

I filtered about 1,000 unique host domains from states, counties, and cities.

examples:

https://troymi.gov/
https://trumanmn.us/
https://turlock.ca.us/

I'm guessing there's about 2,000 total unique host domains in the above CSV.

I built a scrapy Sitemap spider that looks through each robots.txt or sitemap.xml and finds all the routes on the server with the words "police" or "cop" in it. It stores the URL + last-modified date (if exists).

From 1000 host domains, it gets about 40k url paths.

examples:

URL,last_modified_date
https://springfield-or.gov/city/police-department/patrol/animal-services/,2022-09-27T17:43:24+00:00
https://springfield-or.gov/city/police-department/springfield-police-advisory-committee/,2022-10-07T16:34:43+00:00
https://springfield-or.gov/city/police-department/department-policies/,2022-10-14T15:26:31+00:00
https://springfield-or.gov/city/police-department/ballot-measure-20-327/,2022-10-20T16:17:53+00:00

So to note:

  • This doesn't cover all the cities/counties/states. Nor does it cover all related links
  • It only looks for 'police' or 'cop' URLs, so it will miss a bunch of relevant pages. This can be improved.
  • Next would be classifying "record type". Maybe using some labelling tool like Doccano (https://doccano.github.io/doccano/) .

Also seeing if we can get the relevant agency (state/county/metro)

@josh-chamberlain josh-chamberlain changed the title Ability to identify potential Data Sources via URL en masse Ability to identify potential Data Sources via URL Dec 16, 2022
@josh-chamberlain josh-chamberlain transferred this issue from Police-Data-Accessibility-Project/meta Aug 18, 2023
@josh-chamberlain josh-chamberlain changed the title Ability to identify potential Data Sources via URL Data Source Identification pipeline Oct 1, 2023
@josh-chamberlain josh-chamberlain changed the title Data Source Identification pipeline Properties to identify Jan 5, 2024
@josh-chamberlain josh-chamberlain changed the title Properties to identify Properties to identify for Data Sources Feb 27, 2024
@josh-chamberlain josh-chamberlain changed the title Properties to identify for Data Sources Properties of Data Sources to identify Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Reference
Status: No status
Development

No branches or pull requests

2 participants