lukegbcom: add a long rambly post about my Vault setup

2022-04-08 01:42:43 +01:00 · 2022-04-08 01:42:43 +01:00 · ff665ab50f
commit ff665ab50f
parent b238831963
1 changed files with 225 additions and 0 deletions
--- a/web/lukegbcom/posts/2022-04-07-vault-and-me.md
+++ b/web/lukegbcom/posts/2022-04-07-vault-and-me.md
@ -0,0 +1,225 @@
+---
+title: Vault and Me: Taking It Too Far
+date: 2022-04-07
+layout: Post
+---
+
+Recently I've been thinking about how I could distribute secrets to my NixOS
+machines in a... relatively... decent way.
+
+---
+
+At the same time, I've been wanting to move to using an SSH CA to issue myself
+credentials, rather than a variety of SSH public keys.
+
+In any case, my Vault setup ends up looking like this:
+
+- Vault instance, running on [Google Cloud Run](https://cloud.google.com/run)
+  with autounsealing from Cloud HSM, backed by Cloud Storage
+- Vault configuration using [Terranix](https://terranix.org/) (Terraform, but
+  with the config in Nix)
+- Vault App ID credentials on each machine, with Vault Agent used to
+  automatically auth
+- `tokend` for issuing service credentials
+- `secretsmgr` for managing SSH CA host certificates and ACME certificates
+- `access` for issuing SSH CA user certificates
+
+Let's go over these one at a time from a relatively low-level perspective - and
+I'll describe in more detail the mechanics of SSH CAs in a separate post. If
+you're interested in some [good](https://www.lorier.net/docs/ssh-ca.html)
+[docs](https://blog.habets.se/2011/07/OpenSSH-certificates.html) on how SSH CAs
+work, written by some colleagues of mine - then I refer you to those instead,
+since I'll primarily be focusing on my specific setup with Vault rather than a
+more general introduction.
+
+## Vault instance on GCP
+
+First off: why on GCP when I have a bunch of physical boxen I'm managing?
+
+The reason is simple: if I have a lot riding on it, I'd rather that it doesn't
+have the same failure domain as the stuff I'm hosting. It makes it much easier
+to recover if I don't have to rely on having my own infrastructure up in order
+to recover it.
+
+On the other hand... it does mean I'm putting my root of trust "out of my
+hands". I'll take that risk.
+
+Broadly speaking, my setup roughly mirrors Kelsey Hightower's [Serverless Vault
+with Cloud
+Run](https://github.com/kelseyhightower/serverless-vault-with-cloud-run) -
+although I build the Docker container [using
+Nix](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/nix/docker/vault/default.nix).
+
+It's a relatively neat setup, although... it turns out to be expensive. Maybe
+I'll move it to Oracle Cloud's free tier running on one of their ARM64
+instances.
+
+### vault-acme
+
+One important thing to notice is that I install the
+[`vault-acme`](https://github.com/remilapeyre/vault-acme) secret engine for
+issuing SSL certificates using ACME.
+
+This allows me to just store the Let's Encrypt stuff within Vault and not have
+to distribute my DNS server's credentials to each individual server. I could
+run a separate service to do this, but it's super convenient to just have Vault
+do it, since everything already authenticates to it.
+
+## Vault configuration
+
+I use Terranix to manage the Vault configuration - this is because I have all
+my server's configs in my repo, so I can actually introspect the NixOS
+configuration to determine how I want to build the Vault config.
+
+I have a helper script called `terraform` that acts like the normal Terraform
+binary after having compiled my Vault configuration, so I can run `./terraform
+apply` and have it just work the way you'd expect. At present it requires GCP
+credentials to be issued separately, using gcloud, since I store my Terraform
+state in a GCS bucket, but I'm hoping to instead grant access to this using a
+Vault-managed GCP service account instead (still with the ability to use my
+"normal" Google account though, if needed, because obviously I need to be able
+to use it to fix Vault if Vault is broken...)
+
+In particular, I generate identities per server defined in the config, and
+provide myself some useful hooks to make "app" policy configuration more easy.
+
+### What do I mean by an "app"
+
+My configuration basically defines an app as a separate Unix user; if you are
+running as a Linux user named `fooservice` on server `barserver` (and the Vault
+configuration says that `fooservice` is intended to exist on that server), then
+`tokend` will issue you a Vault token with the policies `app/fooservice` and
+`server/barserver/app/fooservice`, if those policies exist.
+
+This is super useful: for instance, in the main case apps are only deployed on
+one host, and if I'm moving it around then it makes sense for it to have access
+to the same secrets. I use [Pomerium](https://pomerium.io) as an authenticating
+proxy, and so there's an `app/pomerium` policy which grants access to secrets
+like `kv/apps/pomerium`.
+
+However sometimes there are users which are deployed on more than one machine -
+such as `gitlab-runner` - and that user should only get access to secrets on
+one specific host. I use this concept for granting access to `gitlab-runner` on
+a server called `clouvider-lon01` to be able to deploy to this blog! It [has
+access](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/vault/cfg/lukegbcom-deployer.nix)
+to get an OAuth token to a specific GCP service account with permission to
+deploy to Firebase Hosting via the `server/clouvider-lon01/app/gitlab-runner`
+policy, but the `gitlab-runner` user anywhere else is not permitted to get
+access to this secret.
+
+### `server-user`
+
+There are some secrets which aren't super secret and should be generally
+accessible by users on servers, even if they don't have their own Vault token.
+`tokend` checks to see if the user talking to it (via a Unix domain socket) is
+a normal user rather than a service user, and if so will issue a token with the
+`server-user` policy instead.
+
+This token really just has a credential to get access to my Nix binary cache on
+Google Cloud Storage, so it's not super confidential. There aren't really many
+instances where this is useful, and in general on "client" devices I expect to
+authenticate to Vault and get a more fully-fledged token as myself.
+
+It doesn't have access to, for instance, issue SSH user certificates. That
+power is restricted to "real" authenticated users who have authenticated
+directly with Vault.
+
+### `server/hostname`
+
+Servers are also permitted to have server-wide secrets. This is mostly just
+used for `secretsmgr` at the moment - arguably this could be its own app.
+
+By default, servers [have
+access](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/vault/cfg/policies/server.hcl)
+to `kv/server/$HOSTNAME`, and to issue ACME certificates, and the Nix binary
+cache credentials. They also have the power to issue subtokens with
+lesser-power than themselves.
+
+## Vault App ID credentials
+
+I use the "App ID" mode in Vault to provision secrets to servers; when setting
+a machine up (a process I have not yet automated), I run
+[`reissue-secret-id.sh`](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/vault/reissue-secret-id.sh)
+which revokes all existing secret IDs for that host and dumps out a Vault
+[response wrapped
+token](https://www.vaultproject.io/docs/concepts/response-wrapping), which can
+be used one time only to get the secret ID for that host.
+
+There's a
+[`provision-secret-id`](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/vault/default.nix)
+script installed on every machine which will then install the secret for me.
+
+Future work in this space for me is binding the secret to the TPM (e.g. using
+mTLS auth) so I don't have to stick the secret ID on disk... but then again I'm
+not a multinational corporation, and my secrets aren't worth _that_ much.
+
+## Vault Agent
+
+The Vault Agent is a daemon that serves a number of purposes:
+
+- It can act as a proxy which keeps an internal Vault token and automatically
+  refreshes it, and then attaches it to every request that it proxies to Vault
+  as received on a Unix socket (or a TCP socket... but I don't use it like
+  that)
+- It can use templates to write secrets to disk (with drawbacks, hence the
+  creation of `secretsmgr` below)
+- Probably some other functionality I don't really use
+
+In my setup, only a few things have direct access to the Vault Agent socket,
+and in future I might get rid of it from my setup entirely. `tokend` and
+`secretsmgr` have access, and that's pretty much it. This is because the powers
+that its Vault token gets are a combination of all the policies granted to the
+machine, including all the apps running on it, so any app with access to its
+Unix socket effectively gets all the secrets shared to anything on the server.
+
+The secrets I use it to write to disk are strictly the plain KV type, rather
+than anything more sophisticated, but I do use some [relatively complicated
+Polkit
+rules](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/nixos/lib/vault-agent-secrets.nix)
+to allow it to reload/restart services when those secrets change.
+
+
+## `tokend`
+
+The user-based authentication I mentioned above (with the app policies and the
+`server-user`) policy is powered by
+[`tokend`](https://hg.lukegb.com/lukegb/depot/-/tree/branch/default/go/tokend),
+which is a daemon that listens on a Unix socket and proxies requests through
+the local Vault Agent, with a token issued that has a subset of the powers of
+the original server-wide token.
+
+The ACLs on talking to `tokend` are much more permissive than those for talking
+directly to the Vault agent.
+
+## `secretsmgr`
+
+`secretsmgr` exists to solve some problems I was having with getting Vault
+Agent to write secrets that require more complex Vault requests, like the TLS
+certificates using ACME (which have ratelimits imposed by Let's Encrypt!), and
+SSH host certificates.
+
+It's a pretty simple binary which runs using a systemd timer unit, starts up,
+checks the remaining lifetime of the certificates it's responsible for, and
+then reissues them if required.
+
+Similar to the Vault Agent above, I use some [Polkit
+rules](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/nixos/lib/secretsmgr.nix)
+to allow it to restart the ACME certificate consumers (usually nginx or
+pomerium), and sshd.
+
+## `access`
+
+`access` checks to see if there's currently an active Vault token. If not, then
+it launches the `vault login` flow which in my case asks me to log in with my
+Google account. If that succeeds, or if I already had a token, then it
+generates a new ED25519 private key and asks Vault to sign it with a lifetime
+of about 24 hours, and then inserts it into the SSH agent. This means the key
+never has to hit disk, since it can just reside in the SSH agent.
+
+The token that this flow issues is a `user` token (i.e. not a `server-user`
+token, nor an `admin` token), which has permission to look at some specific
+secrets related to that user, things which are generally shared -- like those
+Nix binary cache credentials, but doesn't have general access to administrate
+Vault. I issue `admin` tokens with a lifetime of 1h when I need them, and at
+some point will try to scope them down further -- although since I use them to
+deploy all the policies there is a limit to what I can feasibly do.