Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port to PostgreSQL #1085

Merged
merged 34 commits into from
Jan 2, 2024
Merged

Port to PostgreSQL #1085

merged 34 commits into from
Jan 2, 2024

Conversation

aldavidson
Copy link
Contributor

@aldavidson aldavidson commented Jun 9, 2023

Port content-store to run on RDS PostgreSQL, rather than MongoDB.

As of Monday 18th December 2023, all content-store and draft-content-store applications in all environments are running this branch (not main), via the content-store container.
As of Tuesday 2nd Jan at 13:45, the commit history in this branch is now cleaned up, rebased, and ready to be merged into main (force-pushed PR #1199 onto this branch).

This application is owned by the publishing platform team. Please let us know in #govuk-publishing-platform when you raise any PRs.

⚠️ This repo is Continuously Deployed: make sure you follow the guidance ⚠️

Follow these steps if you are doing a Rails upgrade.

@aldavidson aldavidson marked this pull request as ready for review June 9, 2023 11:23
@aldavidson aldavidson force-pushed the port-to-postgresql branch 2 times, most recently from 38aab29 to d390f76 Compare June 29, 2023 11:52
@aldavidson aldavidson force-pushed the port-to-postgresql branch 2 times, most recently from e7bbce4 to 88bf1aa Compare July 10, 2023 14:52
nacnudus added a commit to alphagov/govuk-s3-mirror that referenced this pull request Oct 30, 2023
The Content Store database is being migrated from MongoDB to PostgreSQL.
See alphagov/content-store#1085.

Nightly backups of the postgres database are now available in S3.  By
copying them to Google Cloud Platform, we make it possible to adapt
the GOV.UK Knowledge Graph to use them.
nacnudus added a commit to alphagov/govuk-s3-mirror that referenced this pull request Oct 30, 2023
The Content Store database is being migrated from MongoDB to PostgreSQL.
See alphagov/content-store#1085.

Nightly backups of the postgres database are now available in S3.  By
copying them to Google Cloud Platform, we make it possible to adapt
the GOV.UK Knowledge Graph to use them.
nacnudus added a commit to alphagov/govuk-s3-mirror that referenced this pull request Oct 30, 2023
The Content Store database is being migrated from MongoDB to PostgreSQL.
See alphagov/content-store#1085.

Nightly backups of the postgres database are now available in S3.  By
copying them to Google Cloud Platform, we make it possible to adapt
the GOV.UK Knowledge Graph to use them.
nacnudus added a commit to alphagov/govuk-knowledge-graph-gcp that referenced this pull request Oct 31, 2023
The content store is being migrated from MongoDB to Postgres. See
alphagov/content-store#1085.

This is a first attempt to adapt to using the Postgres version.

1. Restore the backup of the Postgres database.
2. Export the `content_items` table as lines of JSON.
3. Import the JSON into MongoDB.
4. Query as before.

- Easy to develop, similar to existing steps in the data pipeline
- Avoids translating the MongoDB queries into Postgres ones

- Not in the spirit of GOV.UK's policy to stop using MongoDB
- Extends the data pipeline in both time and complexity
- Misses the opportunity to improve the whole pipeline, such as by using
  the Publishing API database for everything, instead of using the
  Content Store for some things.
nacnudus added a commit to alphagov/govuk-knowledge-graph-gcp that referenced this pull request Nov 1, 2023
The content store is being migrated from MongoDB to Postgres. See
alphagov/content-store#1085.

This is a first attempt to adapt to using the Postgres version.

1. Restore the backup of the Postgres database.
2. Export the `content_items` table as lines of JSON.
3. Import the JSON into MongoDB.
4. Query as before.

- Easy to develop, similar to existing steps in the data pipeline
- Avoids translating the MongoDB queries into Postgres ones

- Not in the spirit of GOV.UK's policy to stop using MongoDB
- Extends the data pipeline in both time and complexity
- Misses the opportunity to improve the whole pipeline, such as by using
  the Publishing API database for everything, instead of using the
  Content Store for some things.
nacnudus added a commit to alphagov/govuk-knowledge-graph-gcp that referenced this pull request Nov 1, 2023
The content store is being migrated from MongoDB to Postgres. See
alphagov/content-store#1085.

This is a first attempt to adapt to using the Postgres version.

1. Restore the backup of the Postgres database.
2. Export the `content_items` table as lines of JSON.
3. Import the JSON into MongoDB.
4. Query as before.

- Easy to develop, similar to existing steps in the data pipeline
- Avoids translating the MongoDB queries into Postgres ones

- Not in the spirit of GOV.UK's policy to stop using MongoDB
- Extends the data pipeline in both time and complexity
- Misses the opportunity to improve the whole pipeline, such as by using
  the Publishing API database for everything, instead of using the
  Content Store for some things.
nacnudus added a commit to alphagov/govuk-knowledge-graph-gcp that referenced this pull request Nov 1, 2023
The content store is being migrated from MongoDB to Postgres. See
alphagov/content-store#1085.

This is a first attempt to adapt to using the Postgres version.

1. Restore the backup of the Postgres database.
2. Export the `content_items` table as lines of JSON.
3. Import the JSON into MongoDB.
4. Query as before.

- Easy to develop, similar to existing steps in the data pipeline
- Avoids translating the MongoDB queries into Postgres ones

- Not in the spirit of GOV.UK's policy to stop using MongoDB
- Extends the data pipeline in both time and complexity
- Misses the opportunity to improve the whole pipeline, such as by using
  the Publishing API database for everything, instead of using the
  Content Store for some things.
nacnudus added a commit to alphagov/govuk-knowledge-graph-gcp that referenced this pull request Nov 1, 2023
The content store is being migrated from MongoDB to Postgres. See
alphagov/content-store#1085.

This is a first attempt to adapt to using the Postgres version.

1. Restore the backup of the Postgres database.
2. Export the `content_items` table as lines of JSON.
3. Import the JSON into MongoDB.
4. Query as before.

- Easy to develop, similar to existing steps in the data pipeline
- Avoids translating the MongoDB queries into Postgres ones

- Not in the spirit of GOV.UK's policy to stop using MongoDB
- Extends the data pipeline in both time and complexity
- Misses the opportunity to improve the whole pipeline, such as by using
  the Publishing API database for everything, instead of using the
  Content Store for some things.
nacnudus added a commit to alphagov/govuk-knowledge-graph-gcp that referenced this pull request Nov 1, 2023
The content store is being migrated from MongoDB to Postgres. See
alphagov/content-store#1085.

This is a first attempt to adapt to using the Postgres version.

1. Restore the backup of the Postgres database.
2. Export the `content_items` table as lines of JSON.
3. Import the JSON into MongoDB.
4. Query as before.

- Easy to develop, similar to existing steps in the data pipeline
- Avoids translating the MongoDB queries into Postgres ones

- Not in the spirit of GOV.UK's policy to stop using MongoDB
- Extends the data pipeline in both time and complexity
- Misses the opportunity to improve the whole pipeline, such as by using
  the Publishing API database for everything, instead of using the
  Content Store for some things.
nacnudus added a commit to alphagov/govuk-knowledge-graph-gcp that referenced this pull request Nov 1, 2023
The content store is being migrated from MongoDB to Postgres. See
alphagov/content-store#1085.

This is a first attempt to adapt to using the Postgres version.

1. Restore the backup of the Postgres database.
2. Export the `content_items` table as lines of JSON.
3. Import the JSON into MongoDB.
4. Query as before.

- Easy to develop, similar to existing steps in the data pipeline
- Avoids translating the MongoDB queries into Postgres ones

- Not in the spirit of GOV.UK's policy to stop using MongoDB
- Extends the data pipeline in both time and complexity
- Misses the opportunity to improve the whole pipeline, such as by using
  the Publishing API database for everything, instead of using the
  Content Store for some things.
nacnudus added a commit to alphagov/govuk-knowledge-graph-gcp that referenced this pull request Nov 1, 2023
The content store is being migrated from MongoDB to Postgres. See
alphagov/content-store#1085.

This is a first attempt to adapt to using the Postgres version.

1. Restore the backup of the Postgres database.
2. Export the `content_items` table as lines of JSON.
3. Import the JSON into MongoDB.
4. Query as before.

- Easy to develop, similar to existing steps in the data pipeline
- Avoids translating the MongoDB queries into Postgres ones

- Not in the spirit of GOV.UK's policy to stop using MongoDB
- Extends the data pipeline in both time and complexity
- Misses the opportunity to improve the whole pipeline, such as by using
  the Publishing API database for everything, instead of using the
  Content Store for some things.
nacnudus added a commit to alphagov/govuk-knowledge-graph-gcp that referenced this pull request Nov 1, 2023
The content store is being migrated from MongoDB to Postgres. See
alphagov/content-store#1085.

This is a first attempt to adapt to using the Postgres version.

1. Restore the backup of the Postgres database.
2. Export the `content_items` table as lines of JSON.
3. Import the JSON into MongoDB.
4. Query as before.

- Easy to develop, similar to existing steps in the data pipeline
- Avoids translating the MongoDB queries into Postgres ones

- Not in the spirit of GOV.UK's policy to stop using MongoDB
- Extends the data pipeline in both time and complexity
- Misses the opportunity to improve the whole pipeline, such as by using
  the Publishing API database for everything, instead of using the
  Content Store for some things.
nacnudus added a commit to alphagov/govuk-knowledge-graph-gcp that referenced this pull request Nov 1, 2023
The content store is being migrated from MongoDB to Postgres. See
alphagov/content-store#1085.

This is a first attempt to adapt to using the Postgres version.

1. Restore the backup of the Postgres database.
2. Export the `content_items` table as lines of JSON.
3. Import the JSON into MongoDB.
4. Query as before.

- Easy to develop, similar to existing steps in the data pipeline
- Avoids translating the MongoDB queries into Postgres ones

- Not in the spirit of GOV.UK's policy to stop using MongoDB
- Extends the data pipeline in both time and complexity
- Misses the opportunity to improve the whole pipeline, such as by using
  the Publishing API database for everything, instead of using the
  Content Store for some things.
nacnudus added a commit to alphagov/govuk-knowledge-graph-gcp that referenced this pull request Nov 1, 2023
The content store is being migrated from MongoDB to Postgres. See
alphagov/content-store#1085.

This is a first attempt to adapt to using the Postgres version.

1. Restore the backup of the Postgres database.
2. Export the `content_items` table as lines of JSON.
3. Import the JSON into MongoDB.
4. Query as before.

- Easy to develop, similar to existing steps in the data pipeline
- Avoids translating the MongoDB queries into Postgres ones

- Not in the spirit of GOV.UK's policy to stop using MongoDB
- Extends the data pipeline in both time and complexity
- Misses the opportunity to improve the whole pipeline, such as by using
  the Publishing API database for everything, instead of using the
  Content Store for some things.
nacnudus added a commit to alphagov/govuk-knowledge-graph-gcp that referenced this pull request Nov 1, 2023
The content store is being migrated from MongoDB to Postgres. See
alphagov/content-store#1085.

This is a first attempt to adapt to using the Postgres version.

1. Restore the backup of the Postgres database.
2. Export the `content_items` table as lines of JSON.
3. Import the JSON into MongoDB.
4. Query as before.

- Easy to develop, similar to existing steps in the data pipeline
- Avoids translating the MongoDB queries into Postgres ones

- Not in the spirit of GOV.UK's policy to stop using MongoDB
- Extends the data pipeline in both time and complexity
- Misses the opportunity to improve the whole pipeline, such as by using
  the Publishing API database for everything, instead of using the
  Content Store for some things.
nacnudus added a commit to alphagov/govuk-knowledge-graph-gcp that referenced this pull request Nov 1, 2023
The content store is being migrated from MongoDB to Postgres. See
alphagov/content-store#1085.

This is a first attempt to adapt to using the Postgres version.

1. Restore the backup of the Postgres database.
2. Export the `content_items` table as lines of JSON.
3. Import the JSON into MongoDB.
4. Query as before.

- Easy to develop, similar to existing steps in the data pipeline
- Avoids translating the MongoDB queries into Postgres ones

- Not in the spirit of GOV.UK's policy to stop using MongoDB
- Extends the data pipeline in both time and complexity
- Misses the opportunity to improve the whole pipeline, such as by using
  the Publishing API database for everything, instead of using the
  Content Store for some things.
nacnudus added a commit to alphagov/govuk-knowledge-graph-gcp that referenced this pull request Nov 1, 2023
The content store is being migrated from MongoDB to Postgres. See
alphagov/content-store#1085.

This is a first attempt to adapt to using the Postgres version.

1. Restore the backup of the Postgres database.
2. Export the `content_items` table as lines of JSON.
3. Import the JSON into MongoDB.
4. Query as before.

- Easy to develop, similar to existing steps in the data pipeline
- Avoids translating the MongoDB queries into Postgres ones

- Not in the spirit of GOV.UK's policy to stop using MongoDB
- Extends the data pipeline in both time and complexity
- Misses the opportunity to improve the whole pipeline, such as by using
  the Publishing API database for everything, instead of using the
  Content Store for some things.
nacnudus added a commit to alphagov/govuk-knowledge-graph-gcp that referenced this pull request Nov 2, 2023
The content store is being migrated from MongoDB to Postgres. See
alphagov/content-store#1085.

This is a first attempt to adapt to using the Postgres version.

1. Restore the backup of the Postgres database.
2. Export the `content_items` table as lines of JSON.
3. Import the JSON into MongoDB.
4. Query as before.

- Easy to develop, similar to existing steps in the data pipeline
- Avoids translating the MongoDB queries into Postgres ones

- Not in the spirit of GOV.UK's policy to stop using MongoDB
- Extends the data pipeline in both time and complexity
- Misses the opportunity to improve the whole pipeline, such as by using
  the Publishing API database for everything, instead of using the
  Content Store for some things.
nacnudus added a commit to alphagov/govuk-knowledge-graph-gcp that referenced this pull request Nov 2, 2023
The content store is being migrated from MongoDB to Postgres. See
alphagov/content-store#1085.

This is a first attempt to adapt to using the Postgres version.

1. Restore the backup of the Postgres database.
2. Export the `content_items` table as lines of JSON.
3. Import the JSON into MongoDB.
4. Query as before.

- Easy to develop, similar to existing steps in the data pipeline
- Avoids translating the MongoDB queries into Postgres ones

- Not in the spirit of GOV.UK's policy to stop using MongoDB
- Extends the data pipeline in both time and complexity
- Misses the opportunity to improve the whole pipeline, such as by using
  the Publishing API database for everything, instead of using the
  Content Store for some things.
aldavidson and others added 20 commits December 27, 2023 14:26
This will allow us to cross-reference PostgreSQL records to MongoDB
records post-migration if needed
Support import of doubly-nested mongo date fields
Add field mappings for ScheduledPublishingLogEntry and PublishIntent
Add mongo_id field to user & scheduled_publishing_log_entry
Add `rails_timestamp` method to remove conflicts with ActiveRecord
 behaviour when doing .insert with some-but-not-all values given
Add support for batch_size in JsonImporter
This will allow us to perform side-by-side performance comparisons of
the Mongo and Postgres content-stores on the same hardware (e.g.
local dev laptop) and prove that the PostgreSQL content-store is at
least as performant as the Mongo version.
This improves response times to around 30% of previous values
These will no longer be needed after migration to PostgreSQL.
Some records in the MongoDB have nil values in `created_at` or
`updated_at`. ActiveRecord's `timestamps` migration method by default
creates these fields without allowing nil values, so we must explicitly
add support for this after the fact.
Some records in the old MongoDb have `description` as a simple value,
some have it as a Hash. We need to support both, and make sure that we
only wrap the given value in a Hash if it's not already like that.
This fixes a bug where unpublished redirects in short URL manager aren't removed
from the content store (so continue to work on the website).

Users of the content-store API (i.e. publishing-api) might make API calls with
values they want to reset provided as `nil`. For example, if you wanted to clear
some redirects on a content item, you might do something like:

```
PUT /content/some-made-up-url

{
  ...
  "redirects": nil
  ...
}
```

The intent of the user of the API is clear here - they want no redirects.

However, ContentItem has a default value for redirects:

```
  field :redirects, type: Array, default: []
```

And the rest of the content-store expects this value to be an Array, not to be
nil.

By passing in potentially nil values in  assign_attributes we allow a situation
where fields that content-store expects not to be nil (because they have
defaults), can be nil. This tends to result in NoMethodErrors, such as this one:

```
	NoMethodError
undefined method `map' for nil:NilClass

    redirects = item.redirects.map(&:to_h).map(&:deep_symbolize_keys)
                              ^^^^
/app/app/models/route_set.rb:30:in `from_content_item'
/app/app/models/content_item.rb:215:in `route_set'
/app/app/models/content_item.rb:225:in `should_register_routes?'
/app/app/models/content_item.rb:193:in `register_routes'
/app/app/models/content_item.rb:33:in `create_or_replace'
/app/app/controllers/content_items_controller.rb:32:in `block in update'
```

I can't think of any valid reason for overriding default attributes with nils,
so it feels like calling .compact is the right thing to do here.
The `.any?` method, when called on a Relation, seems to instantiate
 the objects in the resultset, which is very slow on Kubernetes. It
 worked fine in Mongoid, but not in ActiveRecord.

If we replace this with `.count.positive?`, it's much faster.
Whitehall doesn't do much validation of a given scheduled publishing
time. As a result, it can sometimes send us really extreme values for
`scheduled_publishing_delay_seconds` (e.g. 400 years into the future).

This can cause problems in the importer when Mongo has accepted the
value, but PostgreSQL can't. Changing the field type to `bigint`
fixes the issue.
...it will fail due to the read-only filesystem in prod.

Also, some whitespace-only corrections to the migration
It turns out that you can't call `LogStasher.add_custom_fields` twice.
Because Content Store's custom fields config was being run after the
govuk_app_config gem's own custom fields config, the gem's config was
being overwritten. So, fields like `govuk_request_id` and `varnish_id`
weren't appearing in Content Store's controller request logs (but
`govuk_dependency_resolution_source_content_id` was).

The gem (version 9.7.0) now provides a mechanism for setting custom
fields that doesn't overwrite the gem's own settings
https://docs.publishing.service.gov.uk/repos/govuk_app_config.html#logger-configuration:

```
GovukJsonLogging.configure do
  add_custom_fields do |fields|
    fields[:govuk_custom_field] = request.headers["GOVUK-Custom-Header"]
  end
end
```
An earlier commit on main (#d5422b46 in PR #1136) fixed a subtle
issue when overriding default values with nil, by explicitly setting
`.created_at` and other attributes from the existing item when it was
being replaced. This caused issues on this PostgreSQL branch after
rebasing, as ActiveRecord behaves more as expected with respect to
`created_at` and therefore the line creating a local `created_at`
variable had been removed.

This commit reintroduces that variable, and tests now pass again.
@aldavidson aldavidson changed the title [Do not merge] Port to PostgreSQL Port to PostgreSQL Jan 2, 2024
@aldavidson
Copy link
Contributor Author

aldavidson commented Jan 2, 2024

For the record, I have just force-pushed the test branch (from PR #1199) to this branch, as it has a) been rebased onto main, and b) a much cleaner commit history, thanks to @brucebolt for reviewing and tidying that up.

Just in case something were to go wrong with the merge into main, I have tagged the previous head commit of this branch as port-to-postgresql-final-unrebased-version-before-merge-into-main and pushed it to remote, as that's the commit which is currently running in production and all other environments. So in a worst-case scenario of needing to rollback, we can still retrieve that.

@aldavidson aldavidson merged commit e383e17 into main Jan 2, 2024
19 checks passed
@aldavidson aldavidson deleted the port-to-postgresql branch January 2, 2024 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants