lukegbcom: add a long rambly post about my Vault setup
This commit is contained in:
parent
b238831963
commit
ff665ab50f
1 changed files with 225 additions and 0 deletions
225
web/lukegbcom/posts/2022-04-07-vault-and-me.md
Normal file
225
web/lukegbcom/posts/2022-04-07-vault-and-me.md
Normal file
|
@ -0,0 +1,225 @@
|
|||
---
|
||||
title: Vault and Me: Taking It Too Far
|
||||
date: 2022-04-07
|
||||
layout: Post
|
||||
---
|
||||
|
||||
Recently I've been thinking about how I could distribute secrets to my NixOS
|
||||
machines in a... relatively... decent way.
|
||||
|
||||
---
|
||||
|
||||
At the same time, I've been wanting to move to using an SSH CA to issue myself
|
||||
credentials, rather than a variety of SSH public keys.
|
||||
|
||||
In any case, my Vault setup ends up looking like this:
|
||||
|
||||
- Vault instance, running on [Google Cloud Run](https://cloud.google.com/run)
|
||||
with autounsealing from Cloud HSM, backed by Cloud Storage
|
||||
- Vault configuration using [Terranix](https://terranix.org/) (Terraform, but
|
||||
with the config in Nix)
|
||||
- Vault App ID credentials on each machine, with Vault Agent used to
|
||||
automatically auth
|
||||
- `tokend` for issuing service credentials
|
||||
- `secretsmgr` for managing SSH CA host certificates and ACME certificates
|
||||
- `access` for issuing SSH CA user certificates
|
||||
|
||||
Let's go over these one at a time from a relatively low-level perspective - and
|
||||
I'll describe in more detail the mechanics of SSH CAs in a separate post. If
|
||||
you're interested in some [good](https://www.lorier.net/docs/ssh-ca.html)
|
||||
[docs](https://blog.habets.se/2011/07/OpenSSH-certificates.html) on how SSH CAs
|
||||
work, written by some colleagues of mine - then I refer you to those instead,
|
||||
since I'll primarily be focusing on my specific setup with Vault rather than a
|
||||
more general introduction.
|
||||
|
||||
## Vault instance on GCP
|
||||
|
||||
First off: why on GCP when I have a bunch of physical boxen I'm managing?
|
||||
|
||||
The reason is simple: if I have a lot riding on it, I'd rather that it doesn't
|
||||
have the same failure domain as the stuff I'm hosting. It makes it much easier
|
||||
to recover if I don't have to rely on having my own infrastructure up in order
|
||||
to recover it.
|
||||
|
||||
On the other hand... it does mean I'm putting my root of trust "out of my
|
||||
hands". I'll take that risk.
|
||||
|
||||
Broadly speaking, my setup roughly mirrors Kelsey Hightower's [Serverless Vault
|
||||
with Cloud
|
||||
Run](https://github.com/kelseyhightower/serverless-vault-with-cloud-run) -
|
||||
although I build the Docker container [using
|
||||
Nix](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/nix/docker/vault/default.nix).
|
||||
|
||||
It's a relatively neat setup, although... it turns out to be expensive. Maybe
|
||||
I'll move it to Oracle Cloud's free tier running on one of their ARM64
|
||||
instances.
|
||||
|
||||
### vault-acme
|
||||
|
||||
One important thing to notice is that I install the
|
||||
[`vault-acme`](https://github.com/remilapeyre/vault-acme) secret engine for
|
||||
issuing SSL certificates using ACME.
|
||||
|
||||
This allows me to just store the Let's Encrypt stuff within Vault and not have
|
||||
to distribute my DNS server's credentials to each individual server. I could
|
||||
run a separate service to do this, but it's super convenient to just have Vault
|
||||
do it, since everything already authenticates to it.
|
||||
|
||||
## Vault configuration
|
||||
|
||||
I use Terranix to manage the Vault configuration - this is because I have all
|
||||
my server's configs in my repo, so I can actually introspect the NixOS
|
||||
configuration to determine how I want to build the Vault config.
|
||||
|
||||
I have a helper script called `terraform` that acts like the normal Terraform
|
||||
binary after having compiled my Vault configuration, so I can run `./terraform
|
||||
apply` and have it just work the way you'd expect. At present it requires GCP
|
||||
credentials to be issued separately, using gcloud, since I store my Terraform
|
||||
state in a GCS bucket, but I'm hoping to instead grant access to this using a
|
||||
Vault-managed GCP service account instead (still with the ability to use my
|
||||
"normal" Google account though, if needed, because obviously I need to be able
|
||||
to use it to fix Vault if Vault is broken...)
|
||||
|
||||
In particular, I generate identities per server defined in the config, and
|
||||
provide myself some useful hooks to make "app" policy configuration more easy.
|
||||
|
||||
### What do I mean by an "app"
|
||||
|
||||
My configuration basically defines an app as a separate Unix user; if you are
|
||||
running as a Linux user named `fooservice` on server `barserver` (and the Vault
|
||||
configuration says that `fooservice` is intended to exist on that server), then
|
||||
`tokend` will issue you a Vault token with the policies `app/fooservice` and
|
||||
`server/barserver/app/fooservice`, if those policies exist.
|
||||
|
||||
This is super useful: for instance, in the main case apps are only deployed on
|
||||
one host, and if I'm moving it around then it makes sense for it to have access
|
||||
to the same secrets. I use [Pomerium](https://pomerium.io) as an authenticating
|
||||
proxy, and so there's an `app/pomerium` policy which grants access to secrets
|
||||
like `kv/apps/pomerium`.
|
||||
|
||||
However sometimes there are users which are deployed on more than one machine -
|
||||
such as `gitlab-runner` - and that user should only get access to secrets on
|
||||
one specific host. I use this concept for granting access to `gitlab-runner` on
|
||||
a server called `clouvider-lon01` to be able to deploy to this blog! It [has
|
||||
access](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/vault/cfg/lukegbcom-deployer.nix)
|
||||
to get an OAuth token to a specific GCP service account with permission to
|
||||
deploy to Firebase Hosting via the `server/clouvider-lon01/app/gitlab-runner`
|
||||
policy, but the `gitlab-runner` user anywhere else is not permitted to get
|
||||
access to this secret.
|
||||
|
||||
### `server-user`
|
||||
|
||||
There are some secrets which aren't super secret and should be generally
|
||||
accessible by users on servers, even if they don't have their own Vault token.
|
||||
`tokend` checks to see if the user talking to it (via a Unix domain socket) is
|
||||
a normal user rather than a service user, and if so will issue a token with the
|
||||
`server-user` policy instead.
|
||||
|
||||
This token really just has a credential to get access to my Nix binary cache on
|
||||
Google Cloud Storage, so it's not super confidential. There aren't really many
|
||||
instances where this is useful, and in general on "client" devices I expect to
|
||||
authenticate to Vault and get a more fully-fledged token as myself.
|
||||
|
||||
It doesn't have access to, for instance, issue SSH user certificates. That
|
||||
power is restricted to "real" authenticated users who have authenticated
|
||||
directly with Vault.
|
||||
|
||||
### `server/hostname`
|
||||
|
||||
Servers are also permitted to have server-wide secrets. This is mostly just
|
||||
used for `secretsmgr` at the moment - arguably this could be its own app.
|
||||
|
||||
By default, servers [have
|
||||
access](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/vault/cfg/policies/server.hcl)
|
||||
to `kv/server/$HOSTNAME`, and to issue ACME certificates, and the Nix binary
|
||||
cache credentials. They also have the power to issue subtokens with
|
||||
lesser-power than themselves.
|
||||
|
||||
## Vault App ID credentials
|
||||
|
||||
I use the "App ID" mode in Vault to provision secrets to servers; when setting
|
||||
a machine up (a process I have not yet automated), I run
|
||||
[`reissue-secret-id.sh`](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/vault/reissue-secret-id.sh)
|
||||
which revokes all existing secret IDs for that host and dumps out a Vault
|
||||
[response wrapped
|
||||
token](https://www.vaultproject.io/docs/concepts/response-wrapping), which can
|
||||
be used one time only to get the secret ID for that host.
|
||||
|
||||
There's a
|
||||
[`provision-secret-id`](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/vault/default.nix)
|
||||
script installed on every machine which will then install the secret for me.
|
||||
|
||||
Future work in this space for me is binding the secret to the TPM (e.g. using
|
||||
mTLS auth) so I don't have to stick the secret ID on disk... but then again I'm
|
||||
not a multinational corporation, and my secrets aren't worth _that_ much.
|
||||
|
||||
## Vault Agent
|
||||
|
||||
The Vault Agent is a daemon that serves a number of purposes:
|
||||
|
||||
- It can act as a proxy which keeps an internal Vault token and automatically
|
||||
refreshes it, and then attaches it to every request that it proxies to Vault
|
||||
as received on a Unix socket (or a TCP socket... but I don't use it like
|
||||
that)
|
||||
- It can use templates to write secrets to disk (with drawbacks, hence the
|
||||
creation of `secretsmgr` below)
|
||||
- Probably some other functionality I don't really use
|
||||
|
||||
In my setup, only a few things have direct access to the Vault Agent socket,
|
||||
and in future I might get rid of it from my setup entirely. `tokend` and
|
||||
`secretsmgr` have access, and that's pretty much it. This is because the powers
|
||||
that its Vault token gets are a combination of all the policies granted to the
|
||||
machine, including all the apps running on it, so any app with access to its
|
||||
Unix socket effectively gets all the secrets shared to anything on the server.
|
||||
|
||||
The secrets I use it to write to disk are strictly the plain KV type, rather
|
||||
than anything more sophisticated, but I do use some [relatively complicated
|
||||
Polkit
|
||||
rules](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/nixos/lib/vault-agent-secrets.nix)
|
||||
to allow it to reload/restart services when those secrets change.
|
||||
|
||||
|
||||
## `tokend`
|
||||
|
||||
The user-based authentication I mentioned above (with the app policies and the
|
||||
`server-user`) policy is powered by
|
||||
[`tokend`](https://hg.lukegb.com/lukegb/depot/-/tree/branch/default/go/tokend),
|
||||
which is a daemon that listens on a Unix socket and proxies requests through
|
||||
the local Vault Agent, with a token issued that has a subset of the powers of
|
||||
the original server-wide token.
|
||||
|
||||
The ACLs on talking to `tokend` are much more permissive than those for talking
|
||||
directly to the Vault agent.
|
||||
|
||||
## `secretsmgr`
|
||||
|
||||
`secretsmgr` exists to solve some problems I was having with getting Vault
|
||||
Agent to write secrets that require more complex Vault requests, like the TLS
|
||||
certificates using ACME (which have ratelimits imposed by Let's Encrypt!), and
|
||||
SSH host certificates.
|
||||
|
||||
It's a pretty simple binary which runs using a systemd timer unit, starts up,
|
||||
checks the remaining lifetime of the certificates it's responsible for, and
|
||||
then reissues them if required.
|
||||
|
||||
Similar to the Vault Agent above, I use some [Polkit
|
||||
rules](https://hg.lukegb.com/lukegb/depot/-/blob/branch/default/ops/nixos/lib/secretsmgr.nix)
|
||||
to allow it to restart the ACME certificate consumers (usually nginx or
|
||||
pomerium), and sshd.
|
||||
|
||||
## `access`
|
||||
|
||||
`access` checks to see if there's currently an active Vault token. If not, then
|
||||
it launches the `vault login` flow which in my case asks me to log in with my
|
||||
Google account. If that succeeds, or if I already had a token, then it
|
||||
generates a new ED25519 private key and asks Vault to sign it with a lifetime
|
||||
of about 24 hours, and then inserts it into the SSH agent. This means the key
|
||||
never has to hit disk, since it can just reside in the SSH agent.
|
||||
|
||||
The token that this flow issues is a `user` token (i.e. not a `server-user`
|
||||
token, nor an `admin` token), which has permission to look at some specific
|
||||
secrets related to that user, things which are generally shared -- like those
|
||||
Nix binary cache credentials, but doesn't have general access to administrate
|
||||
Vault. I issue `admin` tokens with a lifetime of 1h when I need them, and at
|
||||
some point will try to scope them down further -- although since I use them to
|
||||
deploy all the policies there is a limit to what I can feasibly do.
|
Loading…
Reference in a new issue