Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated search engine calls #53

Open
maxachis opened this issue Mar 14, 2024 · 19 comments
Open

Automated search engine calls #53

maxachis opened this issue Mar 14, 2024 · 19 comments
Assignees

Comments

@maxachis
Copy link
Collaborator

This is related to the solution I proposed in Police-Data-Accessibility-Project/data-cleaning#1, which leveraged using Google Searches to find police agencies.

Given a list of police districts, cities, counties, etc., we could automate a call to the Google Search Engine API or something similar with the string "{Location} Police Agency Website", and then pull the first result. We could start with the most populous cities/counties/etc. in the US and gradually work our way down. Perhaps including additional searches such as searching for all police districts in a location.

The API described above allows 100 free searches per day (or 5$ for 1000). If we create a Github Action, we could easily run that in the background each day and gradually add a trickle of results. We could include either only the first result, or a set of results, depending on how confident we are that the first result will pull what we're looking for.

This would not accrue as many URLs as quickly, but I suspect it will have a higher hit-to-miss ratio for relevant data.

@josh-chamberlain
Copy link
Contributor

josh-chamberlain commented Mar 15, 2024

Great! Suggested strategy:

  • hit the Google Search Engine API 100 times each day
    • generate a string "{agencies.name} {record_type}", cycling through all record types for each agency before moving on
      • instead of List of Data Sources, use "data portal"
    • try including the first 10 results; stripping out irrelevant ones with HF models seems to be pretty easy
  • add the results to a hugging face dataset. Since we're adding 100 a day, we can append to the same CSV, called something like google_api_urls (see feature: add common crawl to the pipeline #40)
  • keep track of which ones we've tried in a CSV in this repo, and don't try those again
    • use agencies.count_data_sources to pick which ones to try first, starting with those that have the most sources. They already skew toward populous cities which are known to already have sources.
      • from there, check other agencies in those counties. then...I don't know. randomly?
      • keep track of the date we tried it last, so we can try it again

@mbodeantor feel free to refine / adjust any of this ↑

@maxachis
Copy link
Collaborator Author

maxachis commented Mar 17, 2024

First step would be finding data detailing all the different police agencies. I found a good start with the results of the 2018 Census of State and Local Law Enforcement Agencies (CSLLEA). I've also included the results of this census (for parsing in R) as a zip file:

ICPSR_38771-V1.zip

Additionally, here's the primary data, as an excel file:

2018 Police Agencies Census Data.xlsx

17721 rows, and there's quite a bit of detail in this: Not just the location of each agency, but also:

  1. Their 2018 estimated budget
  2. What type of agency they are (Sheriff's Department, tribal, college, etc.)
  3. What functionalities they served (for example, warrants, whether they investigated homicide, arson, cybercrime, whether they were used for crowd control)
  4. And more.

The full set of information can be described in the code book contained within the zip.

I don't want to say that this is the absolute best resource we could use for creating a master table for all police agencies in the United States, but it is hard for me to imagine something better.

@josh-chamberlain I think this data, or at least parts of it, would be worthwhile to add to our dataset, perhaps eventually merging with our agency table. From there, we could connect this information with other information of relevance (for example: the wealth/population of the associated city/county) and use that associated information to prioritize which agencies to start with.

@mbodeantor
Copy link
Contributor

I agree with Josh that starting with the agencies we have the most data sources for already seems likely to yield the most relevant results. Not sure it makes sense to start considering additional agencies until after the ones we already have.

@maxachis
Copy link
Collaborator Author

maxachis commented Mar 19, 2024

So if we were to start with the agencies we have the most data sources for (and which don't currently have a homepage url) that actually wouldn't take too much time. The script as I'm envisioning it would be

SELECT
	SUBMITTED_NAME,
	HOMEPAGE_URL,
	JURISDICTION_TYPE,
	STATE_ISO,
	MUNICIPALITY,
	COUNTY_NAME,
	AIRTABLE_UID,
	COUNT_DATA_SOURCES,
	ZIP_CODE,
	NO_WEB_PRESENCE -- Relevant
FROM
	PUBLIC.AGENCIES
WHERE 
approved = true
and homepage_url is null
ORDER BY COUNT_DATA_SOURCES DESC

That produces about 5,800 (with about 60 having 1 or more data sources). So that'd be a good place to start.

@maxachis
Copy link
Collaborator Author

maxachis commented Mar 20, 2024

@mbodeantor @josh-chamberlain I may need permissions for the database updated so that I have CREATE_TABLE permissions, as I'd like to create a small additional table to keep track of which agencies have had a search performed on them (so I don't run the risk of redoing searches for them). Either that, or I would need the table created for me.

@mbodeantor
Copy link
Contributor

@maxachis You should still have create permissions on the public schema. What issue are you seeing?

@maxachis
Copy link
Collaborator Author

@mbodeantor I'm running into this error when attempting to create a table through PGAdmin:

image

This is via the username of data_sources_app

@mbodeantor
Copy link
Contributor

Yeah you'll have to use your username and pass in your DMs

@maxachis
Copy link
Collaborator Author

maxachis commented Mar 21, 2024

@mbodeantor Ah that did it. Was able to create the table, but still got denied on adding a foreign key constraint to agencies (which will help me ensure that anything in the cache is actually properly referencing an agency uid). Got the following error:

image

I believe I need the REFERENCES and possibly the ALTER permissions.

@mbodeantor
Copy link
Contributor

@maxachis I granted you all permissions on that schema, lmk if that doesn't do it.

@maxachis
Copy link
Collaborator Author

@mbodeantor Nope, still getting an error of permission denied for table agencies. Here's the SQL query:

ALTER TABLE agency_url_search_cache
ADD CONSTRAINT FK_AGENCY_UID
FOREIGN KEY (agency_airtable_uid)
REFERENCES public.agencies(airtable_uid);

@maxachis
Copy link
Collaborator Author

Draft version of PR updated. Remaining tasks:

  1. Create unit/integration tests for code
  2. Create yaml workflow for Github actions, setting script to be run once per day.

@josh-chamberlain josh-chamberlain changed the title Alternative way of getting more urls: Automated Search Engine Calls Automated search engine calls May 29, 2024
@maxachis
Copy link
Collaborator Author

@josh-chamberlain The current version of this involves direct database access -- specifically, it updates an in-database "cache", agency_url_search_cache that indicates what agencies have been searched for, so that it doesn't duplicate searches. It directly references the agencies table as a foreign key.

There are other ways to do this. I could maintain the cache in the repository, or I could build an endpoint.

The repository provides the most immediate solution to this, but if we want to run this regularly, it will need to be committed regularly, providing commit clutter*. It'll also be less atomic, meaning we could lose a whole day of Google searches if something breaks. I don't think an in-repo cache is the best option.

Developing an endpoint means we'd have to build a v2 or post-v2 endpoint, and couldn't deploy it until then. That means its development is contingent on the development of those endpoints as well.

Third option is we just have it continue to interface directly with the database. Simplest, but goes against our attempts to keep all database interactions within the endpoint.

Let me know your thoughts.

*Even more than my current granular commits add clutter!

@josh-chamberlain
Copy link
Contributor

@maxachis maybe this is an opportunity to make an endpoint for caching searches—this is something we want to do, broadly, anyway—it's worth knowing, for basic context, not only the last time we updated but searched for type of data (like homepage) in agency or jurisdiction. For that reason, I think it's worth making an endpoint—later on we could abstract it for other types of caches, but it could be bespoke for now.

Can you clarify why it would need to be v2 or post-v2?

@maxachis
Copy link
Collaborator Author

Can you clarify why it would need to be v2 or post-v2?

@josh-chamberlain No reason it has to be, per se. I assumed it would be since I figured we were avoiding significant changes to main if not contained within v2 or post-v2. However, if we're comfortable with developing an endpoint in the primary data-source-app repository, prior to v2, then I'm happy to do that as well. To do so would, however, either necessitate not incorporating some of the changes I'm setting up in v2 (like integration tests) or else duplicating those efforts.

@josh-chamberlain
Copy link
Contributor

gotcha. that makes sense...I think it'd be better to effectively make a "legacy" endpoint without those tests and soon upgrade it to v2, duplicating effort that way rather than bringing tests into v1. We can wait until our hand is forced; maintaining a cache in the repo isn't the end of the world as a stopgap. Using the repo feels simpler/less annoying, but I'm not most affected. Your call!

@maxachis
Copy link
Collaborator Author

maxachis commented Jun 1, 2024

@josh-chamberlain Got it. I will develop a legacy endpoint, probably called \agency-search-cache, and then have my code make a call to it.

Using the repo feels simpler/less annoying, but I'm not most affected. Your call!

I've seen repos where identical commits were made nearly every day to refresh some data, and man: even if it seems simpler initially, once you start seeing 30, 60, 90 identical commits over the course of months, it makes things clunky.

@maxachis
Copy link
Collaborator Author

@josh-chamberlain For calling the endpoint, I required an API key authorization, then realized I'm not sure if I have one!

After doing a bit more research, I realized a few things:

  1. The Gitbook doesn't appear to mention how one gets an API key to access the more sensitive endpoints. I'm sure I could just get admin rights, but that brings me to:
  2. An endpoint modifying the agencies cache probably should require elevated permissions, which would require a modification of the security.py at the moment (and probably should be modified to something more extensible later).

How should we approach?

@josh-chamberlain
Copy link
Contributor

@maxachis

  1. You can get an API key by hitting the api_key endpoint. Looks like we currently only have one admin in the database.
  2. True! Is it time to take the first step towards ABAC/PBAC? Permissions for users data-sources-app#162

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Future
Development

No branches or pull requests

3 participants