Properties of Data Sources to identify #11

josh-chamberlain · 2022-11-02T19:43:11Z

nfmcclure · 2022-11-03T01:27:23Z

An issue with the common-crawl, is that it doesn't get all the PD org sites. The common crawl is very respectful of robots.txt and if a servers response time is lagging, it'll stop hitting that server all together. This can be seen because:

The common crawl export of .gov sites with 'pd' of 'police' in it has 9 separate sites from the host "https://police.birminghamal.gov". But if we look at the sitemap: https://police.birminghamal.gov/sitemap.xml, we can see much much more there.

I think the above is a separate issue because the solution requires scraping. I would suggest the solution here would be to find all the unique hosts and look at the (host-url/sitemap.xml) and extracting all the URLs listed there. (if sitemap exists).

I think the real issue is how to clean and categorize any URL. (then this solution can be applied to future sitemaps).

First and foremost, identifying relevant URLs in a large list is important. Here are some suggestions for next steps:

Can we find rules to remove bad URLs? E.g. 'blog', or 'store', or URLs with question marks in them, or non-secure http urls.
From the remaining, we would have to label URLs as relevant or not. Then we can use a very similar model(s) as the prior linked palewire/storysniffer. It looks like they started out labelling about 1-3 thousand URLs.

Second, I think from these relevant URLs, identifying features from them is important.

Agency Location. Identifying the department location (State police: CA, AZ, ..., metro police: NYC, LA, ..., county police: King County, ...)
- I think this should be attempted first in a dumb/simple way. Get a list of all metros, counties-states, and just look at which location string matches the host/domain url or URL + homepage HTML.
Record Type. This is hard. Probably something similar to the prior URL classification from storysniffer.
Coverage & Dates: Maybe something with the common crawl last access date? e.g. the most recent is 2022-40 (40th week of 2022 = Oct 3rd, 2022)? Also, note that the XML sitemaps tend to have a field "last modified", if the URL is in there. But again, looking at the XML sitemaps is more of a scraping task. Where-as the common crawl has already scraped any URL it contains.
Agency-supplied: Most of the URLs searched have the top-level-domains of gov or us. This usually means they are agency-supplied. We can search more domain names if we want.
Size. More thought needed here. Is the full HTML size an upper bound? Assuming data on URL.
others.... TBD.

josh-chamberlain · 2022-11-03T14:58:53Z

Thanks for thinking this through @nfmcclure ! I think gathering info from the URL will work in some cases—but I'm sure we'll miss a lot, and a lot of URLs are simply unhelpful. At some point we'll need to start looking at page headers and content to identify sources.

Agency location—since we have a homepage URL for most of the agencies in our database, we should be able to get a lot of URLs matched to agencies by simply comparing root domains. We could also probably use those root domains as a way to search for new URLs. I think the process would be something like:
- take an agency homepage URL
- use a combination of commoncrawl, probably internet archive, sitemap generator or find the sitemap to locate URLs on the domain, as well as other domains being used to display data (lots of data portals aren't on the government domain)
- run the URLs through a little toolkit of identification processes and scripts to get as many properties established as possible
Record type is definitely hard to get from the URL, though in some cases I'm sure it'd work. I think using what's on the page will be the best bet.
Good idea!
Agreed.
This depends—if there's a "download" button on the page, the size on disk of the file behind that button is the size we want. That said, "size" is a nice-to-know about a data source but not required. I'll update the hierarchy of these properties here in a minute.

Any strategy we develop will hit a point of diminishing returns where it's easier to just manually look at what's left, which is A-OK.

nfmcclure · 2022-11-08T16:14:22Z

I filtered about 1,000 unique host domains from states, counties, and cities.

examples:

https://troymi.gov/
https://trumanmn.us/
https://turlock.ca.us/

I'm guessing there's about 2,000 total unique host domains in the above CSV.

I built a scrapy Sitemap spider that looks through each robots.txt or sitemap.xml and finds all the routes on the server with the words "police" or "cop" in it. It stores the URL + last-modified date (if exists).

From 1000 host domains, it gets about 40k url paths.

examples:

URL,last_modified_date
https://springfield-or.gov/city/police-department/patrol/animal-services/,2022-09-27T17:43:24+00:00
https://springfield-or.gov/city/police-department/springfield-police-advisory-committee/,2022-10-07T16:34:43+00:00
https://springfield-or.gov/city/police-department/department-policies/,2022-10-14T15:26:31+00:00
https://springfield-or.gov/city/police-department/ballot-measure-20-327/,2022-10-20T16:17:53+00:00

So to note:

This doesn't cover all the cities/counties/states. Nor does it cover all related links
It only looks for 'police' or 'cop' URLs, so it will miss a bunch of relevant pages. This can be improved.
Next would be classifying "record type". Maybe using some labelling tool like Doccano (https://doccano.github.io/doccano/) .

Also seeing if we can get the relevant agency (state/county/metro)

josh-chamberlain mentioned this issue Nov 15, 2022

Sitemap scraper Police-Data-Accessibility-Project/scrapers#195

Merged

josh-chamberlain changed the title ~~Ability to identify potential Data Sources via URL en masse~~ Ability to identify potential Data Sources via URL Dec 16, 2022

josh-chamberlain transferred this issue from Police-Data-Accessibility-Project/meta Aug 18, 2023

EvilDrPurple mentioned this issue Aug 18, 2023

Scraper index Police-Data-Accessibility-Project/scrapers#208

Merged

josh-chamberlain changed the title ~~Ability to identify potential Data Sources via URL~~ Data Source Identification pipeline Oct 1, 2023

josh-chamberlain mentioned this issue Oct 11, 2023

mermaid graph to readme #20

Merged

josh-chamberlain changed the title ~~Data Source Identification pipeline~~ Properties to identify Jan 5, 2024

josh-chamberlain added the epic label Jan 12, 2024

josh-chamberlain changed the title ~~Properties to identify~~ Properties to identify for Data Sources Feb 27, 2024

josh-chamberlain changed the title ~~Properties to identify for Data Sources~~ Properties of Data Sources to identify Feb 27, 2024

josh-chamberlain mentioned this issue Feb 27, 2024

use NLP model to generate name and description for data sources #43

Open

josh-chamberlain mentioned this issue Jul 12, 2024

Feature: sync identified sources to database #95

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properties of Data Sources to identify #11

Properties of Data Sources to identify #11

josh-chamberlain commented Nov 2, 2022 •

edited

Loading

nfmcclure commented Nov 3, 2022

josh-chamberlain commented Nov 3, 2022 •

edited

Loading

nfmcclure commented Nov 8, 2022

Properties of Data Sources to identify #11

Properties of Data Sources to identify #11

Comments

josh-chamberlain commented Nov 2, 2022 • edited Loading

Context

Flowchart

Properties

S tier

A tier

B tier

Related reading

nfmcclure commented Nov 3, 2022

josh-chamberlain commented Nov 3, 2022 • edited Loading

nfmcclure commented Nov 8, 2022

josh-chamberlain commented Nov 2, 2022 •

edited

Loading

josh-chamberlain commented Nov 3, 2022 •

edited

Loading