-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike generate popular tasks using BigQuery #3761
base: main
Are you sure you want to change the base?
Conversation
Local env is used only for locally testing, this is to ensure secrets are not accidentally pushed.
- Add the BigQuery gem - Create a BigQuery service so the app can talk to BigQuery NB in order to use this you must add the credentials to config/local_env.yml. There are instructions on how to do this in the [dev docs](https://docs.publishing.service.gov.uk/repos/content-data-api/google_analytics_setup.html)
- Added a basic SQL query to retrieve some initial test data NB this query only currently works for a select time period and only on the benefits and business pages (shouldn't be too hard to change)
- Updated the view to render the results fetched from BigQuery - Simple unordered list displaying the search term as a link - Just making sure that the data can be passed from BigQuery to the view NB Still need to add the title into the link rendering
- Cache the expensive process of retrieving popular tasks from BigQuery - Save this cache with a cache key that has the date and the browse page name. Sneaky change added in here that I'll move to a different commit: - Change the BigQuery data retrieval to only collect data for one browse at a time.
Improve data availability bu having a backup cache if the latest is not availabile - Added a backup cache mechanism to ensure data is available even if the latest cache is expired or unavailable. - Popular tasks data is now stored in both a latest cache (24 hours expiration) and a backup cache (7 days expiration). - Fallback to backup cache when the latest cache is missing, ensuring users always see data even if fresh data retrieval fails. - Updated methods to handle cache fallback logic, improving robustness and reducing the likelihood of empty data responses.
app/services/popular_tasks.rb
Outdated
WHERE Rank <6 | ||
SQL | ||
|
||
data = @fetch_data.query(query).all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many results does this query fetch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the top 5 we can increase or lower it by changing the ranking here:
WHERE Rank <6
@@ -1,5 +1,6 @@ | |||
class PopularTasks | |||
CACHE_EXPIRATION = 24.hours # Set the cache expiration time | |||
BACKUP_CACHE_EXPIRATION = 7.days # Backup cache can have a longer expiration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to think a bit more about how this would work.
If the bigquery data was unavailable for more than 7 days then what happens?
I can think of other ways to do it - but this feels like a problem that must have been solved many times before. i.e. Only expire the cache if fresh data is available to fill it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've caught up now... the cache will expire regardless of whether or not the API responds so I understand the need for a backup. And I like the idea of writing to the backup at the same time as you fetch the fresh data.
Some things to consider for next steps (which I've started thinking about here
Also the code base has moved on quite a bit since this spike was started - in terms of the logic in the browse helper - hence starting the fresh spike 2 branch. |
Trello card: https://trello.com/c/AGRVqS4S
This spike sets up a service that connects to BigQuery
To run this locally you will need to set up some config following the steps in the dev docs.
It collects the data from big query and outputs it on the benefits and business pages.
There is an initial implementation of caching and back up caching. This needs to be amended so it will run locally.
The links are currently displayed as a slug, this will need to be changed to show a title.
Still need to add a commit for the sentry logging.