Files
ubicloud/rhizome/postgres/bin/install-wal-g
Burak Yucesoy 1ab94bee4b Implement backup taking for PostgreSQL servers
This commit adds mechanism to push basebackups and WAL files to a blob storage.

Daily backup taking is done when carefully designed need_backup? helper returns
true. Since it is called in every iteration of the PostgresTimelineNexus, it is
designed to make as few network calls as possible to the underlying vm. This is
achieved by checking the last time we checked the vm and realized we don't need
to take a backup (thus the check was ineffective). This information is kept in
the last_ineffective_check_at field of the PostgresTimeline entity and we don't
recheck again if last_ineffective_check_at is in last 20 minutes. In this form,
need_backup? covers surprisingly large amount of cases;
- Regular case: After initial provisioning, status will be "NotStarted", and we
will trigger backup taking. When backup is completed successfully, need_backup?
will return true only if 24 hours are passed from backup start time.
- Long backup case: If backup taking doesn't complete in 24 hours, need_backup?
will continue to be return false during the backup taking. Only immediately
after the completion (~10 minutes on average) it will start to return true to
trigger next backup.
- Failure case: If backup taking fails at some point, in ~10 minutes, it will
start to return true, which will re-start backup taking.
- Vm unavailable case: In this case, status check will raise error and we will
retry checking the status in next iteration.

All these will be achieved while ensuring there won't be two backup process
running at the same time (daemonizer also guarantees that, but it is extra
layer of security). Also we won't try to access the vm more frequently than
every 20 minutes (except in unavailable VM case, which can be handled more
gracefully in the future if needed).

While the need_backup? logic is pretty solid, there are 2 important points that
we need to address:
1. We populate each server with WAL-G credentials which has access to the whole
blob storage, not just their own bucket. This, combined with the fact that we
give superuser access to users, means that everyone can read everyone else's
database. This is obviously not acceptable but our primitive MinIO client does
not allow to create separate users and access policies. This is not limitation
of MinIO, but limitation of our primitive client. We will address this problem
before launch of the PostgreSQL service.

2. We use WAL-G for taking daily basebackups as well as for continuous archival
of the WAL files. Installing WAL-G unfortunately increased the provisioning
time significantly because maintainers does not package WAL-G for Ubuntu 22.04,
so we have to compile it ourselves. Since we don't have a mechanism to generate
OS images beforehand, this compilation happens at the provisioning time. This
is less severe than the previous issue. It is also not a blocker for launching
the service. For short term I am  planning to create images manually and pull
them to the VmHosts when there is a change, like we do for GithubRunners until
we have image burning capability.
2023-11-14 17:06:11 +01:00

31 lines
711 B
Ruby
Executable File

#!/bin/env ruby
# frozen_string_literal: true
require_relative "../../common/lib/util"
if ARGV.count != 1
fail "Wrong number of arguments. Expected 1, Given #{ARGV.count}"
end
commit_id = ARGV[0]
# Install dependencies
r "add-apt-repository -y ppa:longsleep/golang-backports"
r "apt-get update"
r "apt-get -y install golang-go cmake"
r "mkdir -p var/wal-g"
Dir.chdir("var/wal-g") do
# Fetch wal-g
r "git init"
r "git remote remove origin || true"
r "git remote add origin https://github.com/wal-g/wal-g.git"
r "git fetch origin --depth 1 #{commit_id}"
r "git reset --hard FETCH_HEAD"
# Compile and install wal-g
r "make deps"
r "make pg_build"
r "GOBIN=/usr/bin make pg_install"
end