Skip to content

Commit

Permalink
Derive MongoDB content store from Postgres
Browse files Browse the repository at this point in the history
The content store is being migrated from MongoDB to Postgres. See
alphagov/content-store#1085.

This is a first attempt to adapt to using the Postgres version.

1. Restore the backup of the Postgres database.
2. Export the `content_items` table as lines of JSON.
3. Import the JSON into MongoDB.
4. Query as before.

- Easy to develop, similar to existing steps in the data pipeline
- Avoids translating the MongoDB queries into Postgres ones

- Not in the spirit of GOV.UK's policy to stop using MongoDB
- Extends the data pipeline in both time and complexity
- Misses the opportunity to improve the whole pipeline, such as by using
  the Publishing API database for everything, instead of using the
  Content Store for some things.
  • Loading branch information
nacnudus committed Nov 1, 2023
1 parent 0b5193e commit 3b6932b
Show file tree
Hide file tree
Showing 31 changed files with 2,394 additions and 6 deletions.
71 changes: 71 additions & 0 deletions .github/workflows/docker-content-dev.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
name: Docker-content-dev

on:
push:
branches:
- dev
paths:
- 'docker/content/**'
- '.github/workflows/docker-content-dev.yml'

defaults:
run:
working-directory: docker/content

env:
GITHUB_SHA: ${{ github.sha }}
GITHUB_REF: ${{ github.ref }}
IMAGE: 'content'
REGISTRY_HOSTNAME: 'europe-west2-docker.pkg.dev/govuk-knowledge-graph-dev/docker'

jobs:

terraform:
name: 'Docker Build'
runs-on: ubuntu-latest
permissions:
contents: 'read'
id-token: 'write'

steps:
# actions/checkout MUST come before auth
- uses: 'actions/checkout@v3'

- id: 'auth'
name: 'Authenticate to Google Cloud'
uses: 'google-github-actions/auth@v0'
with:
workload_identity_provider: 'projects/628722085506/locations/global/workloadIdentityPools/github-pool/providers/github-pool-provider'
service_account: 'artifact-registry-docker@govuk-knowledge-graph-dev.iam.gserviceaccount.com'

# Further steps are automatically authenticated

# Install gcloud, `setup-gcloud` automatically picks up authentication from `auth`.
- name: 'Set up Cloud SDK'
uses: 'google-github-actions/setup-gcloud@v0'

# Configure docker to use the gcloud command-line tool as a credential helper
- run: |
# Set up docker to authenticate
# via gcloud command-line tool.
gcloud auth configure-docker europe-west2-docker.pkg.dev
# Build the Docker image
- name: Docker build
id: build
run: |
export TAG=`echo $GITHUB_REF | awk -F/ '{print $NF}'`
echo $TAG
docker build -t "$REGISTRY_HOSTNAME"/"$IMAGE":"$TAG" \
--build-arg GITHUB_SHA="$GITHUB_SHA" \
--build-arg GITHUB_REF="$GITHUB_REF" .
# Push the Docker image to Google Container Registry
- name: Docker push
id: push
run: |
export TAG=`echo $GITHUB_REF | awk -F/ '{print $NF}'`
echo $TAG
docker push "$REGISTRY_HOSTNAME"/"$IMAGE":"$TAG"
docker tag "$REGISTRY_HOSTNAME"/"$IMAGE":"$TAG" "$REGISTRY_HOSTNAME"/"$IMAGE":latest
docker push "$REGISTRY_HOSTNAME"/"$IMAGE":latest
71 changes: 71 additions & 0 deletions .github/workflows/docker-content-staging.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
name: Docker-content-staging

on:
push:
branches:
- staging
paths:
- 'docker/content/**'
- '.github/workflows/docker-content-staging.yml'

defaults:
run:
working-directory: docker/content

env:
GITHUB_SHA: ${{ github.sha }}
GITHUB_REF: ${{ github.ref }}
IMAGE: 'content'
REGISTRY_HOSTNAME: 'europe-west2-docker.pkg.dev/govuk-knowledge-graph-staging/docker'

jobs:

terraform:
name: 'Docker Build'
runs-on: ubuntu-latest
permissions:
contents: 'read'
id-token: 'write'

steps:
# actions/checkout MUST come before auth
- uses: 'actions/checkout@v3'

- id: 'auth'
name: 'Authenticate to Google Cloud'
uses: 'google-github-actions/auth@v0'
with:
workload_identity_provider: 'projects/957740527277/locations/global/workloadIdentityPools/github-pool/providers/github-pool-provider'
service_account: 'artifact-registry-docker@govuk-knowledge-graph-staging.iam.gserviceaccount.com'

# Further steps are automatically authenticated

# Install gcloud, `setup-gcloud` automatically picks up authentication from `auth`.
- name: 'Set up Cloud SDK'
uses: 'google-github-actions/setup-gcloud@v0'

# Configure docker to use the gcloud command-line tool as a credential helper
- run: |
# Set up docker to authenticate
# via gcloud command-line tool.
gcloud auth configure-docker europe-west2-docker.pkg.dev
# Build the Docker image
- name: Docker build
id: build
run: |
export TAG=`echo $GITHUB_REF | awk -F/ '{print $NF}'`
echo $TAG
docker build -t "$REGISTRY_HOSTNAME"/"$IMAGE":"$TAG" \
--build-arg GITHUB_SHA="$GITHUB_SHA" \
--build-arg GITHUB_REF="$GITHUB_REF" .
# Push the Docker image to Google Container Registry
- name: Docker push
id: push
run: |
export TAG=`echo $GITHUB_REF | awk -F/ '{print $NF}'`
echo $TAG
docker push "$REGISTRY_HOSTNAME"/"$IMAGE":"$TAG"
docker tag "$REGISTRY_HOSTNAME"/"$IMAGE":"$TAG" "$REGISTRY_HOSTNAME"/"$IMAGE":latest
docker push "$REGISTRY_HOSTNAME"/"$IMAGE":latest
71 changes: 71 additions & 0 deletions .github/workflows/docker-content.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
name: Docker-content

on:
push:
branches:
- main
paths:
- 'docker/content/**'
- '.github/workflows/docker-content.yml'

defaults:
run:
working-directory: docker/content

env:
GITHUB_SHA: ${{ github.sha }}
GITHUB_REF: ${{ github.ref }}
IMAGE: 'content'
REGISTRY_HOSTNAME: 'europe-west2-docker.pkg.dev/govuk-knowledge-graph/docker'

jobs:

terraform:
name: 'Docker Build'
runs-on: ubuntu-latest
permissions:
contents: 'read'
id-token: 'write'

steps:
# actions/checkout MUST come before auth
- uses: 'actions/checkout@v3'

- id: 'auth'
name: 'Authenticate to Google Cloud'
uses: 'google-github-actions/auth@v0'
with:
workload_identity_provider: 'projects/19513753240/locations/global/workloadIdentityPools/github-pool/providers/github-pool-provider'
service_account: 'artifact-registry-docker@govuk-knowledge-graph.iam.gserviceaccount.com'

# Further steps are automatically authenticated

# Install gcloud, `setup-gcloud` automatically picks up authentication from `auth`.
- name: 'Set up Cloud SDK'
uses: 'google-github-actions/setup-gcloud@v0'

# Configure docker to use the gcloud command-line tool as a credential helper
- run: |
# Set up docker to authenticate
# via gcloud command-line tool.
gcloud auth configure-docker europe-west2-docker.pkg.dev
# Build the Docker image
- name: Docker build
id: build
run: |
export TAG=`echo $GITHUB_REF | awk -F/ '{print $NF}'`
echo $TAG
docker build -t "$REGISTRY_HOSTNAME"/"$IMAGE":"$TAG" \
--build-arg GITHUB_SHA="$GITHUB_SHA" \
--build-arg GITHUB_REF="$GITHUB_REF" .
# Push the Docker image to Google Container Registry
- name: Docker push
id: push
run: |
export TAG=`echo $GITHUB_REF | awk -F/ '{print $NF}'`
echo $TAG
docker push "$REGISTRY_HOSTNAME"/"$IMAGE":"$TAG"
docker tag "$REGISTRY_HOSTNAME"/"$IMAGE":"$TAG" "$REGISTRY_HOSTNAME"/"$IMAGE":latest
docker push "$REGISTRY_HOSTNAME"/"$IMAGE":latest
22 changes: 22 additions & 0 deletions docker/content/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
FROM postgres:16.0-alpine3.18

# Prepare to install things
RUN apk add --update \
curl \
python3

# Install the gcloud CLI, a specific version from a long-term archive
RUN \
curl -O https://storage.googleapis.com/cloud-sdk-release/google-cloud-cli-452.0.1-linux-x86_64.tar.gz \
&& tar -xf google-cloud-cli-452.0.1-linux-x86_64.tar.gz \
&& ./google-cloud-sdk/install.sh --quiet --rc-path ~/.bashrc \
&& source ~/.bashrc

# Reset the postgres entrypoint to the docker default, so that we can run our
# own CMD
ENTRYPOINT []

# Run a script from a copy of the HEAD of the repository
CMD \
gcloud storage cat "gs://${PROJECT_ID}-repository/docker/content/run.sh" \
| bash
91 changes: 91 additions & 0 deletions docker/content/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
#!/bin/bash

# Increase the amount of shared memory available.
# This requires the container to run in privileged mode.
# It prevents a postgres error
# "could not resize shared memory segment: No space left on device"
mount -o remount,size=8G /dev/shm

# Run both postgres and scripts that interact with the database

# Obtain the latest state of the repository
gcloud storage cp -r "gs://${PROJECT_ID}-repository/*" .

# turn on bash's job control
set -m

# Start postgres in the background. The docker-entrypoint.sh script is on the
# path, and handles users and permissions
# https://stackoverflow.com/a/48880635/937932
cp src/postgres/postgresql.conf.write-optimised src/postgres/postgresql.conf
docker-entrypoint.sh postgres -c config_file=src/postgres/postgresql.conf &

# Wait for postgres to start
sleep 5

# Restore the Publishing API database from its backup .bson file in GCP Storage

# Construct the file's URL
BUCKET=$(
gcloud compute instances describe content \
--project $PROJECT_ID \
--zone $ZONE \
--format="value(metadata.items.object_bucket)"
)
OBJECT=$(
gcloud compute instances describe content \
--project $PROJECT_ID \
--zone $ZONE \
--format="value(metadata.items.object_name)"
)
OBJECT_URL="gs://$BUCKET/$OBJECT"
FILE_PATH="data/$OBJECT"

# https://stackoverflow.com/questions/6575221
date
gcloud storage cp "$OBJECT_URL" "$FILE_PATH"

# Check that the file size is larger than an arbitrary size of 1GiB.
# Typically they are nearly 2GiB.
# On 2023-03-03 the database backup files had a problem and were only a few
# megabytes.
minimumsize=1073741824
actualsize=$(wc -c <"$FILE_PATH")
if [ $actualsize -le $minimumsize ]; then
# Turn this instance off and exit. The data that is currrently in BigQuery
# will remain there.
gcloud compute instances delete postgres --quiet --zone=$ZONE
exit 1
fi

date
pg_restore \
-U postgres \
--verbose \
--create \
--clean \
--dbname=postgres \
--no-owner \
--jobs=2 \
"$FILE_PATH"
date
rm "$FILE_PATH"

# Restart postgres with a less-crashable configuration
cp src/postgres/postgresql.conf.safe src/postgres/postgresql.conf
psql -U postgres -c "SELECT pg_reload_conf();"

# Export the content_items table as JSON, to be loaded into MongoDB
date
psql \
--username=postgres \
--dbname="content_store_production" \
--tuples-only \
--command="SELECT row_to_json(content_items) FROM content_items;" \
| gzip -c \
| gcloud storage cp - "gs://${PROJECT_ID}-data-processed/content-store/content_items.json.gz"
date

# Stop this instance
# https://stackoverflow.com/a/41232669
gcloud compute instances delete content --quiet --zone=$ZONE
Loading

0 comments on commit 3b6932b

Please sign in to comment.