depot/web/lukegbcom/posts/2022-04-07-vault-and-me.md

262 lines
13 KiB
Markdown
Raw Normal View History

---
2022-04-08 01:01:51 +00:00
title: "Vault and Me: Taking It Too Far"
date: 2022-04-07
layout: Post
hero: https://images.unsplash.com/photo-1582139329536-e7284fece509
hero credit: https://unsplash.com/photos/3wPJxh-piRw
hero credit text: "Jason Dent"
classes:
header: header-black-gradient
---
Recently I've been thinking about how I could distribute secrets to my NixOS
machines in a... relatively... decent way.
---
At the same time, I've been wanting to move to using an SSH CA to issue myself
credentials, rather than a variety of SSH public keys.
In any case, my Vault setup ends up looking like this:
- Vault instance, running on [Google Cloud Run](https://cloud.google.com/run)
with autounsealing from Cloud HSM, backed by Cloud Storage
- Vault configuration using [Terranix](https://terranix.org/) (Terraform, but
with the config in Nix)
- Vault App ID credentials on each machine, with Vault Agent used to
automatically auth
- `tokend` for issuing service credentials
- `secretsmgr` for managing SSH CA host certificates and ACME certificates
- `access` for issuing SSH CA user certificates
Let's go over these one at a time from a relatively low-level perspective - and
I'll describe in more detail the mechanics of SSH CAs in a separate post. If
you're interested in some [good](https://www.lorier.net/docs/ssh-ca.html)
[docs](https://blog.habets.se/2011/07/OpenSSH-certificates.html) on how SSH CAs
work, written by some colleagues of mine - then I refer you to those instead,
since I'll primarily be focusing on my specific setup with Vault rather than a
more general introduction.
## Vault instance on GCP
First off: why on GCP when I have a bunch of physical boxen I'm managing?
The reason is simple: if I have a lot riding on it, I'd rather that it doesn't
have the same failure domain as the stuff I'm hosting. It makes it much easier
to recover if I don't have to rely on having my own infrastructure up in order
to recover it.
On the other hand... it does mean I'm putting my root of trust "out of my
hands". I'll take that risk.
Broadly speaking, my setup roughly mirrors Kelsey Hightower's [Serverless Vault
with Cloud
Run](https://github.com/kelseyhightower/serverless-vault-with-cloud-run) -
although I build the Docker container [using
2024-11-16 15:30:41 +00:00
Nix](https://git.lukegb.com/lukegb/depot/src/branch/canon/nix/docker/vault/default.nix).
It's a relatively neat setup, although... it turns out to be expensive. Maybe
I'll move it to Oracle Cloud's free tier running on one of their ARM64
instances.
### vault-acme
One important thing to notice is that I install the
[`vault-acme`](https://github.com/remilapeyre/vault-acme) secret engine for
issuing SSL certificates using ACME.
This allows me to just store the Let's Encrypt stuff within Vault and not have
to distribute my DNS server's credentials to each individual server. I could
run a separate service to do this, but it's super convenient to just have Vault
do it, since everything already authenticates to it.
## Vault configuration
I use Terranix to manage the Vault configuration - this is because I have all
my server's configs in my repo, so I can actually introspect the NixOS
configuration to determine how I want to build the Vault config.
I have a helper script called `terraform` that acts like the normal Terraform
binary after having compiled my Vault configuration, so I can run `./terraform
apply` and have it just work the way you'd expect. At present it requires GCP
credentials to be issued separately, using gcloud, since I store my Terraform
state in a GCS bucket, but I'm hoping to instead grant access to this using a
Vault-managed GCP service account instead (still with the ability to use my
"normal" Google account though, if needed, because obviously I need to be able
to use it to fix Vault if Vault is broken...)
In particular, I generate identities per server defined in the config, and
provide myself some useful hooks to make "app" policy configuration more easy.
### What do I mean by an "app"
My configuration basically defines an app as a separate Unix user; if you are
running as a Linux user named `fooservice` on server `barserver` (and the Vault
configuration says that `fooservice` is intended to exist on that server), then
`tokend` will issue you a Vault token with the policies `app/fooservice` and
`server/barserver/app/fooservice`, if those policies exist.
This is super useful: for instance, in the main case apps are only deployed on
one host, and if I'm moving it around then it makes sense for it to have access
to the same secrets. I use [Pomerium](https://pomerium.io) as an authenticating
proxy, and so there's an `app/pomerium` policy which grants access to secrets
like `kv/apps/pomerium`.
However sometimes there are users which are deployed on more than one machine -
such as `gitlab-runner` - and that user should only get access to secrets on
one specific host. I use this concept for granting access to `gitlab-runner` on
a server called `clouvider-lon01` to be able to deploy to this blog! It [has
2024-11-16 15:30:41 +00:00
access](https://git.lukegb.com/lukegb/depot/src/branch/canon/ops/vault/cfg/lukegbcom-deployer.nix)
to get an OAuth token to a specific GCP service account with permission to
deploy to Firebase Hosting via the `server/clouvider-lon01/app/gitlab-runner`
policy, but the `gitlab-runner` user anywhere else is not permitted to get
access to this secret.
### `server-user`
There are some secrets which aren't super secret and should be generally
accessible by users on servers, even if they don't have their own Vault token.
`tokend` checks to see if the user talking to it (via a Unix domain socket) is
a normal user rather than a service user, and if so will issue a token with the
`server-user` policy instead.
This token really just has a credential to get access to my Nix binary cache on
Google Cloud Storage, so it's not super confidential. There aren't really many
instances where this is useful, and in general on "client" devices I expect to
authenticate to Vault and get a more fully-fledged token as myself.
It doesn't have access to, for instance, issue SSH user certificates. That
power is restricted to "real" authenticated users who have authenticated
directly with Vault.
### `server/hostname`
Servers are also permitted to have server-wide secrets. This is mostly just
used for `secretsmgr` at the moment - arguably this could be its own app.
By default, servers [have
2024-11-16 15:30:41 +00:00
access](https://git.lukegb.com/lukegb/depot/src/branch/canon/ops/vault/cfg/policies/server.hcl)
to `kv/server/$HOSTNAME`, and to issue ACME certificates, and the Nix binary
cache credentials. They also have the power to issue subtokens with
lesser-power than themselves.
### ...how about as a diagram?
The description of the above might be a little confusing in terms of the Vault
policy hierarchy, so here's an example:
![Diagram illustrating token hierarchy](token-hierarchy.svg)
1. Vault issues the Vault Agent on `clouvider-lon01` a token. This token
includes the Vault policies `default`, `server`, `server-user`,
`server/clouvider-lon01`, `server/clouvider-lon01/app/gitlab-runner`, and
`app/deployer`. The app policies (`server/clouvider-lon01/app/gitlab-runner`
and `app/deployer`) are attached because the server configuration in the
repository states that those two applications are intended to be deployed on
that server.
2. `secretsmgr` on `clouvider-lon01` uses the token held by the Vault Agent
directly to refresh any TLS or SSH certificates needed by the server.
3. `tokend` on `clouvider-lon01` has no token of its own, but uses the one held
by the Vault Agent to issue app- or user-specific sub-tokens, with a subset
of the policies attached to the initial token.
4. `gitlab-runner` on `clouvider-lon01` talks to `tokend`, which issues it a
subtoken with **just** the `default` and
`server/clouvider-lon01/app/gitlab-runner` policies.
5. `deployer` on `clouvider-lon01` also talks to `tokend`, but it gets a
different subtoken which instead has the `default` and `app/deployer`
policies.
6. My own personal user account, `lukegb`, can also talk to `tokend` to get a
subtoken with the `default` and `server-user` policies. This token is very
limited compared to a standard `user`-policy token, which needs to be issued
by using the Vault API directly to authenticate as a user based on some
OpenID Connect credentials.
## Vault App ID credentials
I use the "App ID" mode in Vault to provision secrets to servers; when setting
a machine up (a process I have not yet automated), I run
2024-11-16 15:30:41 +00:00
[`reissue-secret-id.sh`](https://git.lukegb.com/lukegb/depot/src/branch/canon/ops/vault/reissue-secret-id.sh)
which revokes all existing secret IDs for that host and dumps out a Vault
[response wrapped
token](https://www.vaultproject.io/docs/concepts/response-wrapping), which can
be used one time only to get the secret ID for that host.
There's a
2024-11-16 15:30:41 +00:00
[`provision-secret-id`](https://git.lukegb.com/lukegb/depot/src/branch/canon/ops/vault/default.nix)
script installed on every machine which will then install the secret for me.
Future work in this space for me is binding the secret to the TPM (e.g. using
mTLS auth) so I don't have to stick the secret ID on disk... but then again I'm
not a multinational corporation, and my secrets aren't worth _that_ much.
## Vault Agent
The Vault Agent is a daemon that serves a number of purposes:
- It can act as a proxy which keeps an internal Vault token and automatically
refreshes it, and then attaches it to every request that it proxies to Vault
as received on a Unix socket (or a TCP socket... but I don't use it like
that)
- It can use templates to write secrets to disk (with drawbacks, hence the
creation of `secretsmgr` below)
- Probably some other functionality I don't really use
In my setup, only a few things have direct access to the Vault Agent socket,
and in future I might get rid of it from my setup entirely. `tokend` and
`secretsmgr` have access, and that's pretty much it. This is because the powers
that its Vault token gets are a combination of all the policies granted to the
machine, including all the apps running on it, so any app with access to its
Unix socket effectively gets all the secrets shared to anything on the server.
The secrets I use it to write to disk are strictly the plain KV type, rather
than anything more sophisticated, but I do use some [relatively complicated
Polkit
2024-11-16 15:30:41 +00:00
rules](https://git.lukegb.com/lukegb/depot/src/branch/canon/ops/nixos/lib/vault-agent-secrets.nix)
to allow it to reload/restart services when those secrets change.
## `tokend`
The user-based authentication I mentioned above (with the app policies and the
`server-user`) policy is powered by
2024-11-16 15:30:41 +00:00
[`tokend`](https://git.lukegb.com/lukegb/depot/src/branch/canon/go/tokend),
which is a daemon that listens on a Unix socket and proxies requests through
the local Vault Agent, with a token issued that has a subset of the powers of
the original server-wide token.
The ACLs on talking to `tokend` are much more permissive than those for talking
directly to the Vault agent, because the token you get depends on your identity.
## `secretsmgr`
`secretsmgr` exists to solve some problems I was having with getting Vault
Agent to write secrets that require more complex Vault requests, like the TLS
certificates using ACME (which have ratelimits imposed by Let's Encrypt!), and
SSH host certificates.
It's a pretty simple binary which runs using a systemd timer unit, starts up,
checks the remaining lifetime of the certificates it's responsible for, and
then reissues them if required.
Similar to the Vault Agent above, I use some [Polkit
2024-11-16 15:30:41 +00:00
rules](https://git.lukegb.com/lukegb/depot/src/branch/canon/ops/nixos/lib/secretsmgr.nix)
to allow it to restart the ACME certificate consumers (usually nginx or
pomerium), and sshd.
## `access`
`access` checks to see if there's currently an active Vault token. If not, then
it launches the `vault login` flow which in my case asks me to log in with my
Google account. If that succeeds, or if I already had a token, then it
generates a new ED25519 private key and asks Vault to sign it with a lifetime
of about 24 hours, and then inserts it into the SSH agent. This means the key
never has to hit disk, since it can just reside in the SSH agent.
The token that this flow issues is a `user` token (i.e. not a `server-user`
token, nor an `admin` token), which has permission to look at some specific
secrets related to that user, things which are generally shared -- like those
Nix binary cache credentials, but doesn't have general access to administrate
Vault. I issue `admin` tokens with a lifetime of 1h when I need them, and at
some point will try to scope them down further -- although since I use them to
deploy all the policies there is a limit to what I can feasibly do.