From ff665ab50fb0ec7e610865b600b15d5a3fc35651 Mon Sep 17 00:00:00 2001 From: Luke Granger-Brown Date: Fri, 8 Apr 2022 01:42:43 +0100 Subject: [PATCH] lukegbcom: add a long rambly post about my Vault setup --- .../posts/2022-04-07-vault-and-me.md | 225 ++++++++++++++++++ 1 file changed, 225 insertions(+) create mode 100644 web/lukegbcom/posts/2022-04-07-vault-and-me.md diff --git a/web/lukegbcom/posts/2022-04-07-vault-and-me.md b/web/lukegbcom/posts/2022-04-07-vault-and-me.md new file mode 100644 index 0000000000..bd0c62d4c3 --- /dev/null +++ b/web/lukegbcom/posts/2022-04-07-vault-and-me.md @@ -0,0 +1,225 @@ +--- +title: Vault and Me: Taking It Too Far +date: 2022-04-07 +layout: Post +--- + +Recently I've been thinking about how I could distribute secrets to my NixOS +machines in a... relatively... decent way. + +--- + +At the same time, I've been wanting to move to using an SSH CA to issue myself +credentials, rather than a variety of SSH public keys. + +In any case, my Vault setup ends up looking like this: + +- Vault instance, running on [Google Cloud Run](https://cloud.google.com/run) + with autounsealing from Cloud HSM, backed by Cloud Storage +- Vault configuration using [Terranix](https://terranix.org/) (Terraform, but + with the config in Nix) +- Vault App ID credentials on each machine, with Vault Agent used to + automatically auth +- `tokend` for issuing service credentials +- `secretsmgr` for managing SSH CA host certificates and ACME certificates +- `access` for issuing SSH CA user certificates + +Let's go over these one at a time from a relatively low-level perspective - and +I'll describe in more detail the mechanics of SSH CAs in a separate post. If +you're interested in some [good](https://www.lorier.net/docs/ssh-ca.html) +[docs](https://blog.habets.se/2011/07/OpenSSH-certificates.html) on how SSH CAs +work, written by some colleagues of mine - then I refer you to those instead, +since I'll primarily be focusing on my specific setup with Vault rather than a +more general introduction. + +## Vault instance on GCP + +First off: why on GCP when I have a bunch of physical boxen I'm managing? + +The reason is simple: if I have a lot riding on it, I'd rather that it doesn't +have the same failure domain as the stuff I'm hosting. It makes it much easier +to recover if I don't have to rely on having my own infrastructure up in order +to recover it. + +On the other hand... it does mean I'm putting my root of trust "out of my +hands". I'll take that risk. + +Broadly speaking, my setup roughly mirrors Kelsey Hightower's [Serverless Vault +with Cloud +Run](https://github.com/kelseyhightower/serverless-vault-with-cloud-run) - +although I build the Docker container [using +Nix](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/nix/docker/vault/default.nix). + +It's a relatively neat setup, although... it turns out to be expensive. Maybe +I'll move it to Oracle Cloud's free tier running on one of their ARM64 +instances. + +### vault-acme + +One important thing to notice is that I install the +[`vault-acme`](https://github.com/remilapeyre/vault-acme) secret engine for +issuing SSL certificates using ACME. + +This allows me to just store the Let's Encrypt stuff within Vault and not have +to distribute my DNS server's credentials to each individual server. I could +run a separate service to do this, but it's super convenient to just have Vault +do it, since everything already authenticates to it. + +## Vault configuration + +I use Terranix to manage the Vault configuration - this is because I have all +my server's configs in my repo, so I can actually introspect the NixOS +configuration to determine how I want to build the Vault config. + +I have a helper script called `terraform` that acts like the normal Terraform +binary after having compiled my Vault configuration, so I can run `./terraform +apply` and have it just work the way you'd expect. At present it requires GCP +credentials to be issued separately, using gcloud, since I store my Terraform +state in a GCS bucket, but I'm hoping to instead grant access to this using a +Vault-managed GCP service account instead (still with the ability to use my +"normal" Google account though, if needed, because obviously I need to be able +to use it to fix Vault if Vault is broken...) + +In particular, I generate identities per server defined in the config, and +provide myself some useful hooks to make "app" policy configuration more easy. + +### What do I mean by an "app" + +My configuration basically defines an app as a separate Unix user; if you are +running as a Linux user named `fooservice` on server `barserver` (and the Vault +configuration says that `fooservice` is intended to exist on that server), then +`tokend` will issue you a Vault token with the policies `app/fooservice` and +`server/barserver/app/fooservice`, if those policies exist. + +This is super useful: for instance, in the main case apps are only deployed on +one host, and if I'm moving it around then it makes sense for it to have access +to the same secrets. I use [Pomerium](https://pomerium.io) as an authenticating +proxy, and so there's an `app/pomerium` policy which grants access to secrets +like `kv/apps/pomerium`. + +However sometimes there are users which are deployed on more than one machine - +such as `gitlab-runner` - and that user should only get access to secrets on +one specific host. I use this concept for granting access to `gitlab-runner` on +a server called `clouvider-lon01` to be able to deploy to this blog! It [has +access](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/vault/cfg/lukegbcom-deployer.nix) +to get an OAuth token to a specific GCP service account with permission to +deploy to Firebase Hosting via the `server/clouvider-lon01/app/gitlab-runner` +policy, but the `gitlab-runner` user anywhere else is not permitted to get +access to this secret. + +### `server-user` + +There are some secrets which aren't super secret and should be generally +accessible by users on servers, even if they don't have their own Vault token. +`tokend` checks to see if the user talking to it (via a Unix domain socket) is +a normal user rather than a service user, and if so will issue a token with the +`server-user` policy instead. + +This token really just has a credential to get access to my Nix binary cache on +Google Cloud Storage, so it's not super confidential. There aren't really many +instances where this is useful, and in general on "client" devices I expect to +authenticate to Vault and get a more fully-fledged token as myself. + +It doesn't have access to, for instance, issue SSH user certificates. That +power is restricted to "real" authenticated users who have authenticated +directly with Vault. + +### `server/hostname` + +Servers are also permitted to have server-wide secrets. This is mostly just +used for `secretsmgr` at the moment - arguably this could be its own app. + +By default, servers [have +access](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/vault/cfg/policies/server.hcl) +to `kv/server/$HOSTNAME`, and to issue ACME certificates, and the Nix binary +cache credentials. They also have the power to issue subtokens with +lesser-power than themselves. + +## Vault App ID credentials + +I use the "App ID" mode in Vault to provision secrets to servers; when setting +a machine up (a process I have not yet automated), I run +[`reissue-secret-id.sh`](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/vault/reissue-secret-id.sh) +which revokes all existing secret IDs for that host and dumps out a Vault +[response wrapped +token](https://www.vaultproject.io/docs/concepts/response-wrapping), which can +be used one time only to get the secret ID for that host. + +There's a +[`provision-secret-id`](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/vault/default.nix) +script installed on every machine which will then install the secret for me. + +Future work in this space for me is binding the secret to the TPM (e.g. using +mTLS auth) so I don't have to stick the secret ID on disk... but then again I'm +not a multinational corporation, and my secrets aren't worth _that_ much. + +## Vault Agent + +The Vault Agent is a daemon that serves a number of purposes: + +- It can act as a proxy which keeps an internal Vault token and automatically + refreshes it, and then attaches it to every request that it proxies to Vault + as received on a Unix socket (or a TCP socket... but I don't use it like + that) +- It can use templates to write secrets to disk (with drawbacks, hence the + creation of `secretsmgr` below) +- Probably some other functionality I don't really use + +In my setup, only a few things have direct access to the Vault Agent socket, +and in future I might get rid of it from my setup entirely. `tokend` and +`secretsmgr` have access, and that's pretty much it. This is because the powers +that its Vault token gets are a combination of all the policies granted to the +machine, including all the apps running on it, so any app with access to its +Unix socket effectively gets all the secrets shared to anything on the server. + +The secrets I use it to write to disk are strictly the plain KV type, rather +than anything more sophisticated, but I do use some [relatively complicated +Polkit +rules](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/nixos/lib/vault-agent-secrets.nix) +to allow it to reload/restart services when those secrets change. + + +## `tokend` + +The user-based authentication I mentioned above (with the app policies and the +`server-user`) policy is powered by +[`tokend`](https://hg.lukegb.com/lukegb/depot/-/tree/branch/default/go/tokend), +which is a daemon that listens on a Unix socket and proxies requests through +the local Vault Agent, with a token issued that has a subset of the powers of +the original server-wide token. + +The ACLs on talking to `tokend` are much more permissive than those for talking +directly to the Vault agent. + +## `secretsmgr` + +`secretsmgr` exists to solve some problems I was having with getting Vault +Agent to write secrets that require more complex Vault requests, like the TLS +certificates using ACME (which have ratelimits imposed by Let's Encrypt!), and +SSH host certificates. + +It's a pretty simple binary which runs using a systemd timer unit, starts up, +checks the remaining lifetime of the certificates it's responsible for, and +then reissues them if required. + +Similar to the Vault Agent above, I use some [Polkit +rules](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/nixos/lib/secretsmgr.nix) +to allow it to restart the ACME certificate consumers (usually nginx or +pomerium), and sshd. + +## `access` + +`access` checks to see if there's currently an active Vault token. If not, then +it launches the `vault login` flow which in my case asks me to log in with my +Google account. If that succeeds, or if I already had a token, then it +generates a new ED25519 private key and asks Vault to sign it with a lifetime +of about 24 hours, and then inserts it into the SSH agent. This means the key +never has to hit disk, since it can just reside in the SSH agent. + +The token that this flow issues is a `user` token (i.e. not a `server-user` +token, nor an `admin` token), which has permission to look at some specific +secrets related to that user, things which are generally shared -- like those +Nix binary cache credentials, but doesn't have general access to administrate +Vault. I issue `admin` tokens with a lifetime of 1h when I need them, and at +some point will try to scope them down further -- although since I use them to +deploy all the policies there is a limit to what I can feasibly do.