Automated search engine calls #53

maxachis · 2024-03-14T15:49:45Z

This is related to the solution I proposed in Police-Data-Accessibility-Project/data-cleaning#1, which leveraged using Google Searches to find police agencies.

Given a list of police districts, cities, counties, etc., we could automate a call to the Google Search Engine API or something similar with the string "{Location} Police Agency Website", and then pull the first result. We could start with the most populous cities/counties/etc. in the US and gradually work our way down. Perhaps including additional searches such as searching for all police districts in a location.

The API described above allows 100 free searches per day (or 5$ for 1000). If we create a Github Action, we could easily run that in the background each day and gradually add a trickle of results. We could include either only the first result, or a set of results, depending on how confident we are that the first result will pull what we're looking for.

This would not accrue as many URLs as quickly, but I suspect it will have a higher hit-to-miss ratio for relevant data.

josh-chamberlain · 2024-03-15T15:33:55Z

Great! Suggested strategy:

hit the Google Search Engine API 100 times each day
- generate a string "{agencies.name} {record_type}", cycling through all record types for each agency before moving on
  - instead of List of Data Sources, use "data portal"
- try including the first 10 results; stripping out irrelevant ones with HF models seems to be pretty easy
add the results to a hugging face dataset. Since we're adding 100 a day, we can append to the same CSV, called something like google_api_urls (see feature: add common crawl to the pipeline #40)
keep track of which ones we've tried in a CSV in this repo, and don't try those again
- use agencies.count_data_sources to pick which ones to try first, starting with those that have the most sources. They already skew toward populous cities which are known to already have sources.
  - from there, check other agencies in those counties. then...I don't know. randomly?
  - keep track of the date we tried it last, so we can try it again

@mbodeantor feel free to refine / adjust any of this ↑

maxachis · 2024-03-17T18:45:07Z

First step would be finding data detailing all the different police agencies. I found a good start with the results of the 2018 Census of State and Local Law Enforcement Agencies (CSLLEA). I've also included the results of this census (for parsing in R) as a zip file:

ICPSR_38771-V1.zip

Additionally, here's the primary data, as an excel file:

2018 Police Agencies Census Data.xlsx

17721 rows, and there's quite a bit of detail in this: Not just the location of each agency, but also:

Their 2018 estimated budget
What type of agency they are (Sheriff's Department, tribal, college, etc.)
What functionalities they served (for example, warrants, whether they investigated homicide, arson, cybercrime, whether they were used for crowd control)
And more.

The full set of information can be described in the code book contained within the zip.

I don't want to say that this is the absolute best resource we could use for creating a master table for all police agencies in the United States, but it is hard for me to imagine something better.

@josh-chamberlain I think this data, or at least parts of it, would be worthwhile to add to our dataset, perhaps eventually merging with our agency table. From there, we could connect this information with other information of relevance (for example: the wealth/population of the associated city/county) and use that associated information to prioritize which agencies to start with.

mbodeantor · 2024-03-19T20:06:48Z

I agree with Josh that starting with the agencies we have the most data sources for already seems likely to yield the most relevant results. Not sure it makes sense to start considering additional agencies until after the ones we already have.

maxachis · 2024-03-19T22:56:49Z

So if we were to start with the agencies we have the most data sources for (and which don't currently have a homepage url) that actually wouldn't take too much time. The script as I'm envisioning it would be

SELECT
	SUBMITTED_NAME,
	HOMEPAGE_URL,
	JURISDICTION_TYPE,
	STATE_ISO,
	MUNICIPALITY,
	COUNTY_NAME,
	AIRTABLE_UID,
	COUNT_DATA_SOURCES,
	ZIP_CODE,
	NO_WEB_PRESENCE -- Relevant
FROM
	PUBLIC.AGENCIES
WHERE 
approved = true
and homepage_url is null
ORDER BY COUNT_DATA_SOURCES DESC

That produces about 5,800 (with about 60 having 1 or more data sources). So that'd be a good place to start.

maxachis · 2024-03-20T22:02:14Z

@mbodeantor @josh-chamberlain I may need permissions for the database updated so that I have CREATE_TABLE permissions, as I'd like to create a small additional table to keep track of which agencies have had a search performed on them (so I don't run the risk of redoing searches for them). Either that, or I would need the table created for me.

mbodeantor · 2024-03-21T13:46:36Z

@maxachis You should still have create permissions on the public schema. What issue are you seeing?

maxachis · 2024-03-21T15:19:29Z

@mbodeantor I'm running into this error when attempting to create a table through PGAdmin:

This is via the username of data_sources_app

mbodeantor · 2024-03-21T15:35:55Z

Yeah you'll have to use your username and pass in your DMs

maxachis · 2024-03-21T15:50:09Z

@mbodeantor Ah that did it. Was able to create the table, but still got denied on adding a foreign key constraint to agencies (which will help me ensure that anything in the cache is actually properly referencing an agency uid). Got the following error:

I believe I need the REFERENCES and possibly the ALTER permissions.

mbodeantor · 2024-03-22T17:35:44Z

@maxachis I granted you all permissions on that schema, lmk if that doesn't do it.

maxachis · 2024-03-22T21:59:59Z

@mbodeantor Nope, still getting an error of permission denied for table agencies. Here's the SQL query:

ALTER TABLE agency_url_search_cache
ADD CONSTRAINT FK_AGENCY_UID
FOREIGN KEY (agency_airtable_uid)
REFERENCES public.agencies(airtable_uid);

maxachis · 2024-03-25T18:20:28Z

Draft version of PR updated. Remaining tasks:

Create unit/integration tests for code
Create yaml workflow for Github actions, setting script to be run once per day.

maxachis · 2024-05-30T20:23:19Z

@josh-chamberlain The current version of this involves direct database access -- specifically, it updates an in-database "cache", agency_url_search_cache that indicates what agencies have been searched for, so that it doesn't duplicate searches. It directly references the agencies table as a foreign key.

There are other ways to do this. I could maintain the cache in the repository, or I could build an endpoint.

The repository provides the most immediate solution to this, but if we want to run this regularly, it will need to be committed regularly, providing commit clutter*. It'll also be less atomic, meaning we could lose a whole day of Google searches if something breaks. I don't think an in-repo cache is the best option.

Developing an endpoint means we'd have to build a v2 or post-v2 endpoint, and couldn't deploy it until then. That means its development is contingent on the development of those endpoints as well.

Third option is we just have it continue to interface directly with the database. Simplest, but goes against our attempts to keep all database interactions within the endpoint.

Let me know your thoughts.

*Even more than my current granular commits add clutter!

josh-chamberlain · 2024-05-31T17:29:42Z

@maxachis maybe this is an opportunity to make an endpoint for caching searches—this is something we want to do, broadly, anyway—it's worth knowing, for basic context, not only the last time we updated but searched for type of data (like homepage) in agency or jurisdiction. For that reason, I think it's worth making an endpoint—later on we could abstract it for other types of caches, but it could be bespoke for now.

Can you clarify why it would need to be v2 or post-v2?

maxachis · 2024-05-31T21:02:56Z

Can you clarify why it would need to be v2 or post-v2?

@josh-chamberlain No reason it has to be, per se. I assumed it would be since I figured we were avoiding significant changes to main if not contained within v2 or post-v2. However, if we're comfortable with developing an endpoint in the primary data-source-app repository, prior to v2, then I'm happy to do that as well. To do so would, however, either necessitate not incorporating some of the changes I'm setting up in v2 (like integration tests) or else duplicating those efforts.

josh-chamberlain · 2024-05-31T22:36:23Z

gotcha. that makes sense...I think it'd be better to effectively make a "legacy" endpoint without those tests and soon upgrade it to v2, duplicating effort that way rather than bringing tests into v1. We can wait until our hand is forced; maintaining a cache in the repo isn't the end of the world as a stopgap. Using the repo feels simpler/less annoying, but I'm not most affected. Your call!

maxachis · 2024-06-01T17:52:46Z

@josh-chamberlain Got it. I will develop a legacy endpoint, probably called \agency-search-cache, and then have my code make a call to it.

Using the repo feels simpler/less annoying, but I'm not most affected. Your call!

I've seen repos where identical commits were made nearly every day to refresh some data, and man: even if it seems simpler initially, once you start seeing 30, 60, 90 identical commits over the course of months, it makes things clunky.

maxachis · 2024-06-15T22:02:27Z

@josh-chamberlain For calling the endpoint, I required an API key authorization, then realized I'm not sure if I have one!

After doing a bit more research, I realized a few things:

The Gitbook doesn't appear to mention how one gets an API key to access the more sensitive endpoints. I'm sure I could just get admin rights, but that brings me to:
An endpoint modifying the agencies cache probably should require elevated permissions, which would require a modification of the security.py at the moment (and probably should be modified to something more extensible later).

How should we approach?

josh-chamberlain · 2024-06-17T19:35:05Z

@maxachis

You can get an API key by hitting the api_key endpoint. Looks like we currently only have one admin in the database.
True! Is it time to take the first step towards ABAC/PBAC? Permissions for users data-sources-app#162

maxachis mentioned this issue Mar 20, 2024

Mc agency homepage searcher #54

Closed

This was referenced Apr 10, 2024

Mc agency homepage searcher #74

Draft

test: crawling for misconduct data #69

Open

mbodeantor assigned maxachis Apr 12, 2024

josh-chamberlain mentioned this issue May 29, 2024

manual workflow: generate batches #88

Closed

5 tasks

josh-chamberlain changed the title ~~Alternative way of getting more urls: Automated Search Engine Calls~~ Automated search engine calls May 29, 2024

This was referenced Jun 2, 2024

⚡️ feat(middleware): Add functionality to cache homepage search results Police-Data-Accessibility-Project/data-sources-app#317

Merged

⚡️: Add new endpoints for homepage search cache Police-Data-Accessibility-Project/docs#15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated search engine calls #53

Automated search engine calls #53

maxachis commented Mar 14, 2024

josh-chamberlain commented Mar 15, 2024 •

edited

Loading

maxachis commented Mar 17, 2024 •

edited

Loading

mbodeantor commented Mar 19, 2024

maxachis commented Mar 19, 2024 •

edited

Loading

maxachis commented Mar 20, 2024 •

edited

Loading

mbodeantor commented Mar 21, 2024

maxachis commented Mar 21, 2024

mbodeantor commented Mar 21, 2024

maxachis commented Mar 21, 2024 •

edited

Loading

mbodeantor commented Mar 22, 2024

maxachis commented Mar 22, 2024

maxachis commented Mar 25, 2024

maxachis commented May 30, 2024

josh-chamberlain commented May 31, 2024

maxachis commented May 31, 2024

josh-chamberlain commented May 31, 2024

maxachis commented Jun 1, 2024

maxachis commented Jun 15, 2024

josh-chamberlain commented Jun 17, 2024

Automated search engine calls #53

Automated search engine calls #53

Comments

maxachis commented Mar 14, 2024

josh-chamberlain commented Mar 15, 2024 • edited Loading

maxachis commented Mar 17, 2024 • edited Loading

mbodeantor commented Mar 19, 2024

maxachis commented Mar 19, 2024 • edited Loading

maxachis commented Mar 20, 2024 • edited Loading

mbodeantor commented Mar 21, 2024

maxachis commented Mar 21, 2024

mbodeantor commented Mar 21, 2024

maxachis commented Mar 21, 2024 • edited Loading

mbodeantor commented Mar 22, 2024

maxachis commented Mar 22, 2024

maxachis commented Mar 25, 2024

maxachis commented May 30, 2024

josh-chamberlain commented May 31, 2024

maxachis commented May 31, 2024

josh-chamberlain commented May 31, 2024

maxachis commented Jun 1, 2024

maxachis commented Jun 15, 2024

josh-chamberlain commented Jun 17, 2024

josh-chamberlain commented Mar 15, 2024 •

edited

Loading

maxachis commented Mar 17, 2024 •

edited

Loading

maxachis commented Mar 19, 2024 •

edited

Loading

maxachis commented Mar 20, 2024 •

edited

Loading

maxachis commented Mar 21, 2024 •

edited

Loading