Backups before features

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

On Day 0 your service has no users, so losing the DB feels theoretical. By the time you have 1,000 paying users, losing the DB is the company. The backup conversation must happen at Day 0 because it's a five-minute task then and a multi-week project at 100k users. Three things matter: (1) automated daily backups, (2) the backups are stored off the server they came from, and (3) you have actually restored from one. If you haven't restored, you don't have backups — you have a directory of files that might be backups.

Demo

Managed Postgres providers (RDS, Supabase, Neon, DigitalOcean, Crunchy) all do automated daily snapshots — for free or near-free — and PITR (point-in-time recovery) on the paid tier. Use it. If you're self-hosting, set up pg_dump on a cron writing to S3, with a 14-30 day rolling retention. Then schedule a quarterly restore drill: spin up an empty DB, restore the latest backup into it, run a couple of queries. The drill is the part everyone skips, and the part that turns 'we have backups' from a hope into a fact.

# /etc/cron.daily/pg-backup.sh — self-hosted Postgres
#!/bin/bash
set -euo pipefail
TS=$(date -u +%Y-%m-%dT%H-%M-%SZ)
DUMP_FILE=/tmp/pg-backup-$TS.sql.gz

# 1. Dump and compress in one pipe — never touches disk uncompressed
pg_dump "$DATABASE_URL" | gzip > "$DUMP_FILE"

# 2. Encrypt before it leaves the box
gpg --batch --yes --passphrase-file /etc/backup-pass --symmetric "$DUMP_FILE"
rm "$DUMP_FILE"

# 3. Push to off-box storage
aws s3 cp "$DUMP_FILE.gpg" "s3://my-backups/postgres/$TS.sql.gz.gpg"

# 4. Prune anything older than 30 days
aws s3 ls s3://my-backups/postgres/ \
  | awk '{print $4}' \
  | sort -r | tail -n +31 \
  | xargs -I{} aws s3 rm "s3://my-backups/postgres/{}"

# Test the restore quarterly. The script below isn't enough — you have to actually run it.

Run: node main.js

Try it yourself

Confirm your managed DB has automated daily snapshots enabled (or set up pg_dump on a cron). Note where the backups physically live — they should NOT be on the same machine as your DB.

Pick one backup at random. Restore it into a fresh empty Postgres on your laptop. Run SELECT count(*) FROM users; — does the number match prod, minus the last day's growth? If it errors, your backup is broken.

Add a quarterly calendar reminder titled 'Restore drill'. The reminder is the discipline, not the backup itself.

Calculate your RPO and RTO: how much data can you lose (RPO — usually 5 min to 24 hours) and how long can you be down restoring (RTO — usually 15 min to a few hours)? Write these on the company's incident doc.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Explain RPO, RTO, PITR, and the difference between a logical dump (pg_dump) and a physical backup (pg_basebackup / managed snapshot). When is each one the right choice?

2. Why it works (the mechanism)

Walk me through what happens during a Postgres PITR: how does WAL archiving work, what does the DB do when you say 'restore to 2pm yesterday', and where can it go wrong?

3. Advanced — application & what's next

I want an RPO of 5 minutes and RTO of 15 minutes for a 200 GB Postgres on a managed provider. Cost is a real constraint. Design the backup + replica + restore-drill setup and estimate the monthly cost.