261 lines
13 KiB
Markdown
261 lines
13 KiB
Markdown
---
|
|
title: "Vault and Me: Taking It Too Far"
|
|
date: 2022-04-07
|
|
layout: Post
|
|
hero: https://images.unsplash.com/photo-1582139329536-e7284fece509
|
|
hero credit: https://unsplash.com/photos/3wPJxh-piRw
|
|
hero credit text: "Jason Dent"
|
|
classes:
|
|
header: header-black-gradient
|
|
---
|
|
|
|
Recently I've been thinking about how I could distribute secrets to my NixOS
|
|
machines in a... relatively... decent way.
|
|
|
|
---
|
|
|
|
At the same time, I've been wanting to move to using an SSH CA to issue myself
|
|
credentials, rather than a variety of SSH public keys.
|
|
|
|
In any case, my Vault setup ends up looking like this:
|
|
|
|
- Vault instance, running on [Google Cloud Run](https://cloud.google.com/run)
|
|
with autounsealing from Cloud HSM, backed by Cloud Storage
|
|
- Vault configuration using [Terranix](https://terranix.org/) (Terraform, but
|
|
with the config in Nix)
|
|
- Vault App ID credentials on each machine, with Vault Agent used to
|
|
automatically auth
|
|
- `tokend` for issuing service credentials
|
|
- `secretsmgr` for managing SSH CA host certificates and ACME certificates
|
|
- `access` for issuing SSH CA user certificates
|
|
|
|
Let's go over these one at a time from a relatively low-level perspective - and
|
|
I'll describe in more detail the mechanics of SSH CAs in a separate post. If
|
|
you're interested in some [good](https://www.lorier.net/docs/ssh-ca.html)
|
|
[docs](https://blog.habets.se/2011/07/OpenSSH-certificates.html) on how SSH CAs
|
|
work, written by some colleagues of mine - then I refer you to those instead,
|
|
since I'll primarily be focusing on my specific setup with Vault rather than a
|
|
more general introduction.
|
|
|
|
## Vault instance on GCP
|
|
|
|
First off: why on GCP when I have a bunch of physical boxen I'm managing?
|
|
|
|
The reason is simple: if I have a lot riding on it, I'd rather that it doesn't
|
|
have the same failure domain as the stuff I'm hosting. It makes it much easier
|
|
to recover if I don't have to rely on having my own infrastructure up in order
|
|
to recover it.
|
|
|
|
On the other hand... it does mean I'm putting my root of trust "out of my
|
|
hands". I'll take that risk.
|
|
|
|
Broadly speaking, my setup roughly mirrors Kelsey Hightower's [Serverless Vault
|
|
with Cloud
|
|
Run](https://github.com/kelseyhightower/serverless-vault-with-cloud-run) -
|
|
although I build the Docker container [using
|
|
Nix](https://git.lukegb.com/lukegb/depot/src/branch/canon/nix/docker/vault/default.nix).
|
|
|
|
It's a relatively neat setup, although... it turns out to be expensive. Maybe
|
|
I'll move it to Oracle Cloud's free tier running on one of their ARM64
|
|
instances.
|
|
|
|
### vault-acme
|
|
|
|
One important thing to notice is that I install the
|
|
[`vault-acme`](https://github.com/remilapeyre/vault-acme) secret engine for
|
|
issuing SSL certificates using ACME.
|
|
|
|
This allows me to just store the Let's Encrypt stuff within Vault and not have
|
|
to distribute my DNS server's credentials to each individual server. I could
|
|
run a separate service to do this, but it's super convenient to just have Vault
|
|
do it, since everything already authenticates to it.
|
|
|
|
## Vault configuration
|
|
|
|
I use Terranix to manage the Vault configuration - this is because I have all
|
|
my server's configs in my repo, so I can actually introspect the NixOS
|
|
configuration to determine how I want to build the Vault config.
|
|
|
|
I have a helper script called `terraform` that acts like the normal Terraform
|
|
binary after having compiled my Vault configuration, so I can run `./terraform
|
|
apply` and have it just work the way you'd expect. At present it requires GCP
|
|
credentials to be issued separately, using gcloud, since I store my Terraform
|
|
state in a GCS bucket, but I'm hoping to instead grant access to this using a
|
|
Vault-managed GCP service account instead (still with the ability to use my
|
|
"normal" Google account though, if needed, because obviously I need to be able
|
|
to use it to fix Vault if Vault is broken...)
|
|
|
|
In particular, I generate identities per server defined in the config, and
|
|
provide myself some useful hooks to make "app" policy configuration more easy.
|
|
|
|
### What do I mean by an "app"
|
|
|
|
My configuration basically defines an app as a separate Unix user; if you are
|
|
running as a Linux user named `fooservice` on server `barserver` (and the Vault
|
|
configuration says that `fooservice` is intended to exist on that server), then
|
|
`tokend` will issue you a Vault token with the policies `app/fooservice` and
|
|
`server/barserver/app/fooservice`, if those policies exist.
|
|
|
|
This is super useful: for instance, in the main case apps are only deployed on
|
|
one host, and if I'm moving it around then it makes sense for it to have access
|
|
to the same secrets. I use [Pomerium](https://pomerium.io) as an authenticating
|
|
proxy, and so there's an `app/pomerium` policy which grants access to secrets
|
|
like `kv/apps/pomerium`.
|
|
|
|
However sometimes there are users which are deployed on more than one machine -
|
|
such as `gitlab-runner` - and that user should only get access to secrets on
|
|
one specific host. I use this concept for granting access to `gitlab-runner` on
|
|
a server called `clouvider-lon01` to be able to deploy to this blog! It [has
|
|
access](https://git.lukegb.com/lukegb/depot/src/branch/canon/ops/vault/cfg/lukegbcom-deployer.nix)
|
|
to get an OAuth token to a specific GCP service account with permission to
|
|
deploy to Firebase Hosting via the `server/clouvider-lon01/app/gitlab-runner`
|
|
policy, but the `gitlab-runner` user anywhere else is not permitted to get
|
|
access to this secret.
|
|
|
|
### `server-user`
|
|
|
|
There are some secrets which aren't super secret and should be generally
|
|
accessible by users on servers, even if they don't have their own Vault token.
|
|
`tokend` checks to see if the user talking to it (via a Unix domain socket) is
|
|
a normal user rather than a service user, and if so will issue a token with the
|
|
`server-user` policy instead.
|
|
|
|
This token really just has a credential to get access to my Nix binary cache on
|
|
Google Cloud Storage, so it's not super confidential. There aren't really many
|
|
instances where this is useful, and in general on "client" devices I expect to
|
|
authenticate to Vault and get a more fully-fledged token as myself.
|
|
|
|
It doesn't have access to, for instance, issue SSH user certificates. That
|
|
power is restricted to "real" authenticated users who have authenticated
|
|
directly with Vault.
|
|
|
|
### `server/hostname`
|
|
|
|
Servers are also permitted to have server-wide secrets. This is mostly just
|
|
used for `secretsmgr` at the moment - arguably this could be its own app.
|
|
|
|
By default, servers [have
|
|
access](https://git.lukegb.com/lukegb/depot/src/branch/canon/ops/vault/cfg/policies/server.hcl)
|
|
to `kv/server/$HOSTNAME`, and to issue ACME certificates, and the Nix binary
|
|
cache credentials. They also have the power to issue subtokens with
|
|
lesser-power than themselves.
|
|
|
|
### ...how about as a diagram?
|
|
|
|
The description of the above might be a little confusing in terms of the Vault
|
|
policy hierarchy, so here's an example:
|
|
|
|
![Diagram illustrating token hierarchy](token-hierarchy.svg)
|
|
|
|
1. Vault issues the Vault Agent on `clouvider-lon01` a token. This token
|
|
includes the Vault policies `default`, `server`, `server-user`,
|
|
`server/clouvider-lon01`, `server/clouvider-lon01/app/gitlab-runner`, and
|
|
`app/deployer`. The app policies (`server/clouvider-lon01/app/gitlab-runner`
|
|
and `app/deployer`) are attached because the server configuration in the
|
|
repository states that those two applications are intended to be deployed on
|
|
that server.
|
|
2. `secretsmgr` on `clouvider-lon01` uses the token held by the Vault Agent
|
|
directly to refresh any TLS or SSH certificates needed by the server.
|
|
3. `tokend` on `clouvider-lon01` has no token of its own, but uses the one held
|
|
by the Vault Agent to issue app- or user-specific sub-tokens, with a subset
|
|
of the policies attached to the initial token.
|
|
4. `gitlab-runner` on `clouvider-lon01` talks to `tokend`, which issues it a
|
|
subtoken with **just** the `default` and
|
|
`server/clouvider-lon01/app/gitlab-runner` policies.
|
|
5. `deployer` on `clouvider-lon01` also talks to `tokend`, but it gets a
|
|
different subtoken which instead has the `default` and `app/deployer`
|
|
policies.
|
|
6. My own personal user account, `lukegb`, can also talk to `tokend` to get a
|
|
subtoken with the `default` and `server-user` policies. This token is very
|
|
limited compared to a standard `user`-policy token, which needs to be issued
|
|
by using the Vault API directly to authenticate as a user based on some
|
|
OpenID Connect credentials.
|
|
|
|
## Vault App ID credentials
|
|
|
|
I use the "App ID" mode in Vault to provision secrets to servers; when setting
|
|
a machine up (a process I have not yet automated), I run
|
|
[`reissue-secret-id.sh`](https://git.lukegb.com/lukegb/depot/src/branch/canon/ops/vault/reissue-secret-id.sh)
|
|
which revokes all existing secret IDs for that host and dumps out a Vault
|
|
[response wrapped
|
|
token](https://www.vaultproject.io/docs/concepts/response-wrapping), which can
|
|
be used one time only to get the secret ID for that host.
|
|
|
|
There's a
|
|
[`provision-secret-id`](https://git.lukegb.com/lukegb/depot/src/branch/canon/ops/vault/default.nix)
|
|
script installed on every machine which will then install the secret for me.
|
|
|
|
Future work in this space for me is binding the secret to the TPM (e.g. using
|
|
mTLS auth) so I don't have to stick the secret ID on disk... but then again I'm
|
|
not a multinational corporation, and my secrets aren't worth _that_ much.
|
|
|
|
## Vault Agent
|
|
|
|
The Vault Agent is a daemon that serves a number of purposes:
|
|
|
|
- It can act as a proxy which keeps an internal Vault token and automatically
|
|
refreshes it, and then attaches it to every request that it proxies to Vault
|
|
as received on a Unix socket (or a TCP socket... but I don't use it like
|
|
that)
|
|
- It can use templates to write secrets to disk (with drawbacks, hence the
|
|
creation of `secretsmgr` below)
|
|
- Probably some other functionality I don't really use
|
|
|
|
In my setup, only a few things have direct access to the Vault Agent socket,
|
|
and in future I might get rid of it from my setup entirely. `tokend` and
|
|
`secretsmgr` have access, and that's pretty much it. This is because the powers
|
|
that its Vault token gets are a combination of all the policies granted to the
|
|
machine, including all the apps running on it, so any app with access to its
|
|
Unix socket effectively gets all the secrets shared to anything on the server.
|
|
|
|
The secrets I use it to write to disk are strictly the plain KV type, rather
|
|
than anything more sophisticated, but I do use some [relatively complicated
|
|
Polkit
|
|
rules](https://git.lukegb.com/lukegb/depot/src/branch/canon/ops/nixos/lib/vault-agent-secrets.nix)
|
|
to allow it to reload/restart services when those secrets change.
|
|
|
|
|
|
## `tokend`
|
|
|
|
The user-based authentication I mentioned above (with the app policies and the
|
|
`server-user`) policy is powered by
|
|
[`tokend`](https://git.lukegb.com/lukegb/depot/src/branch/canon/go/tokend),
|
|
which is a daemon that listens on a Unix socket and proxies requests through
|
|
the local Vault Agent, with a token issued that has a subset of the powers of
|
|
the original server-wide token.
|
|
|
|
The ACLs on talking to `tokend` are much more permissive than those for talking
|
|
directly to the Vault agent, because the token you get depends on your identity.
|
|
|
|
## `secretsmgr`
|
|
|
|
`secretsmgr` exists to solve some problems I was having with getting Vault
|
|
Agent to write secrets that require more complex Vault requests, like the TLS
|
|
certificates using ACME (which have ratelimits imposed by Let's Encrypt!), and
|
|
SSH host certificates.
|
|
|
|
It's a pretty simple binary which runs using a systemd timer unit, starts up,
|
|
checks the remaining lifetime of the certificates it's responsible for, and
|
|
then reissues them if required.
|
|
|
|
Similar to the Vault Agent above, I use some [Polkit
|
|
rules](https://git.lukegb.com/lukegb/depot/src/branch/canon/ops/nixos/lib/secretsmgr.nix)
|
|
to allow it to restart the ACME certificate consumers (usually nginx or
|
|
pomerium), and sshd.
|
|
|
|
## `access`
|
|
|
|
`access` checks to see if there's currently an active Vault token. If not, then
|
|
it launches the `vault login` flow which in my case asks me to log in with my
|
|
Google account. If that succeeds, or if I already had a token, then it
|
|
generates a new ED25519 private key and asks Vault to sign it with a lifetime
|
|
of about 24 hours, and then inserts it into the SSH agent. This means the key
|
|
never has to hit disk, since it can just reside in the SSH agent.
|
|
|
|
The token that this flow issues is a `user` token (i.e. not a `server-user`
|
|
token, nor an `admin` token), which has permission to look at some specific
|
|
secrets related to that user, things which are generally shared -- like those
|
|
Nix binary cache credentials, but doesn't have general access to administrate
|
|
Vault. I issue `admin` tokens with a lifetime of 1h when I need them, and at
|
|
some point will try to scope them down further -- although since I use them to
|
|
deploy all the policies there is a limit to what I can feasibly do.
|