Spike 2 big query #3773

hannako · 2024-09-20T21:05:18Z

⚠️ This repo is Continuously Deployed: make sure you follow the guidance ⚠️

Follow these steps if you are doing a Rails upgrade.

Local env is used only for locally testing, this is to ensure secrets are not accidentally pushed.

- Create a BigQuery service so the app can talk to BigQuery NB in order to use this you must add the credentials to config/local_env.yml. There are instructions on how to do this in the [dev docs](https://docs.publishing.service.gov.uk/repos/content-data-api/google_analytics_setup.html)

- Create a popular tasks name space for these two services - Ideally biqquery would have a simple table per browse page, and Collections would just be scraping the top 6 links. Would need to chat to a PA to see what table structure is most cost effective (maybe better to have a single table for all browse) For performance reasons, I think we should be relying on the processing of data at the big query side, and Collections should just fetch the minimum amount of data with little/no processing at the application side. This is a different approach to that taken in 046d618 This commit needs some work on the tests.

To parse the raw data

https://cloud.google.com/bigquery/docs/error-messages#errortable If the cache expires and we can't fetch fresh data we should fall back to a backup that is rebuilt via a nightly job.

Small refactor, so that we can re-use this method to fetch data without having an expiry (maybe!)

beccapearce and others added 3 commits September 20, 2024 11:12

Add local_env to the gitignore

eaee2e5

Local env is used only for locally testing, this is to ensure secrets are not accidentally pushed.

Add Add the BigQuery gem

3ae2ae2

govuk-ci temporarily deployed to collections-pr-3773 September 20, 2024 21:05 Inactive

hannako force-pushed the spike_2_big_query branch from 16b5c6a to 120d27d Compare September 20, 2024 21:16

govuk-ci temporarily deployed to collections-pr-3773 September 20, 2024 21:17 Inactive

Popular tasks service

426aa27

To parse the raw data

hannako force-pushed the spike_2_big_query branch from 120d27d to 426aa27 Compare September 20, 2024 21:18

govuk-ci temporarily deployed to collections-pr-3773 September 20, 2024 21:19 Inactive

Write bigquery response to the cache

6abac69

govuk-ci temporarily deployed to collections-pr-3773 September 20, 2024 21:52 Inactive

hannako force-pushed the spike_2_big_query branch from be16af9 to 4cf28ca Compare September 20, 2024 22:21

govuk-ci temporarily deployed to collections-pr-3773 September 20, 2024 22:21 Inactive

hannako added 2 commits September 20, 2024 23:23

We should have some error handling

81f7fc3

https://cloud.google.com/bigquery/docs/error-messages#errortable If the cache expires and we can't fetch fresh data we should fall back to a backup that is rebuilt via a nightly job.

Create a reusable cached_query method

31e5c2e

Small refactor, so that we can re-use this method to fetch data without having an expiry (maybe!)

hannako force-pushed the spike_2_big_query branch from 4cf28ca to 31e5c2e Compare September 20, 2024 22:25

govuk-ci temporarily deployed to collections-pr-3773 September 20, 2024 22:25 Inactive

hannako mentioned this pull request Sep 23, 2024

Spike generate popular tasks using BigQuery #3761

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike 2 big query #3773

Spike 2 big query #3773

hannako commented Sep 20, 2024

Spike 2 big query #3773

Are you sure you want to change the base?

Spike 2 big query #3773

Conversation

hannako commented Sep 20, 2024