This commit adds mechanism to push basebackups and WAL files to a blob storage. Daily backup taking is done when carefully designed need_backup? helper returns true. Since it is called in every iteration of the PostgresTimelineNexus, it is designed to make as few network calls as possible to the underlying vm. This is achieved by checking the last time we checked the vm and realized we don't need to take a backup (thus the check was ineffective). This information is kept in the last_ineffective_check_at field of the PostgresTimeline entity and we don't recheck again if last_ineffective_check_at is in last 20 minutes. In this form, need_backup? covers surprisingly large amount of cases; - Regular case: After initial provisioning, status will be "NotStarted", and we will trigger backup taking. When backup is completed successfully, need_backup? will return true only if 24 hours are passed from backup start time. - Long backup case: If backup taking doesn't complete in 24 hours, need_backup? will continue to be return false during the backup taking. Only immediately after the completion (~10 minutes on average) it will start to return true to trigger next backup. - Failure case: If backup taking fails at some point, in ~10 minutes, it will start to return true, which will re-start backup taking. - Vm unavailable case: In this case, status check will raise error and we will retry checking the status in next iteration. All these will be achieved while ensuring there won't be two backup process running at the same time (daemonizer also guarantees that, but it is extra layer of security). Also we won't try to access the vm more frequently than every 20 minutes (except in unavailable VM case, which can be handled more gracefully in the future if needed). While the need_backup? logic is pretty solid, there are 2 important points that we need to address: 1. We populate each server with WAL-G credentials which has access to the whole blob storage, not just their own bucket. This, combined with the fact that we give superuser access to users, means that everyone can read everyone else's database. This is obviously not acceptable but our primitive MinIO client does not allow to create separate users and access policies. This is not limitation of MinIO, but limitation of our primitive client. We will address this problem before launch of the PostgreSQL service. 2. We use WAL-G for taking daily basebackups as well as for continuous archival of the WAL files. Installing WAL-G unfortunately increased the provisioning time significantly because maintainers does not package WAL-G for Ubuntu 22.04, so we have to compile it ourselves. Since we don't have a mechanism to generate OS images beforehand, this compilation happens at the provisioning time. This is less severe than the previous issue. It is also not a blocker for launching the service. For short term I am planning to create images manually and pull them to the VmHosts when there is a change, like we do for GithubRunners until we have image burning capability.
31 lines
711 B
Ruby
Executable File
31 lines
711 B
Ruby
Executable File
#!/bin/env ruby
|
|
# frozen_string_literal: true
|
|
|
|
require_relative "../../common/lib/util"
|
|
|
|
if ARGV.count != 1
|
|
fail "Wrong number of arguments. Expected 1, Given #{ARGV.count}"
|
|
end
|
|
|
|
commit_id = ARGV[0]
|
|
|
|
# Install dependencies
|
|
r "add-apt-repository -y ppa:longsleep/golang-backports"
|
|
r "apt-get update"
|
|
r "apt-get -y install golang-go cmake"
|
|
|
|
r "mkdir -p var/wal-g"
|
|
Dir.chdir("var/wal-g") do
|
|
# Fetch wal-g
|
|
r "git init"
|
|
r "git remote remove origin || true"
|
|
r "git remote add origin https://github.com/wal-g/wal-g.git"
|
|
r "git fetch origin --depth 1 #{commit_id}"
|
|
r "git reset --hard FETCH_HEAD"
|
|
|
|
# Compile and install wal-g
|
|
r "make deps"
|
|
r "make pg_build"
|
|
r "GOBIN=/usr/bin make pg_install"
|
|
end
|