Deploying your application secrets: Hashicorp Vault and continuous delivery

2019/03/01

This is the text transcript of the presentation Déploiement de vos secrets applicatifs : Hashicorp Vault et la livraison en continue, made at Devoxx France 2019.

Everybody has secrets: people, machines and web applications. Pretty much any application needs to access external or protected resources: databases, web services, secure tunnels, etc. Today we’ll look into securing the resources that gives access to those external systems: application secrets.

Context

Web application secrets

We define a secret as information that can be used to access sensitive data. Pretty much any information that we cannot put on a public repository. That includes:

Insurer web service credentials (username, password)
Encryption keys and key passphrases
Database credentials (username, password)

We do not consider customer credentials and personally identifiable information (PII) to be secrets, since those things are already protected in our database.

Secret in Java file

Secrets in source code is pretty common, one example being directly in Java files. This examples shows a callback to connect to an external web service that shows cleartext username and password.

public class ClientPasswordCallback implements CallbackHandler {

  private static final String USERNAME = "lesfurets";
  private static final String PASSWORD = "hunter2";

  @Override
  public void handle(Callback[] callbacks) {
    final WSPasswordCallback pc = (WSPasswordCallback) callbacks[0];
    if (USERNAME.equals(pc.getIdentifier())) {
      pc.setPassword(PASSWORD);
    }
  }

}

Secret in Tomcat server.xml

They can also reside in infrastructure code, like this server.xml code declaring a new JDBC resource for a database connection, also with cleartext username and password.

<?xml version='1.0' encoding='utf-8'?>
<Server port="1234" shutdown="SHUTDOWN">
  <!-- ... -->
  <GlobalNamingResources>
    <Resource name="jdbc/b2b2cDatabase" 
                 username="dev"
                 password="hunter2"
                 url="localhost:2345"
                 type="javax.sql.DataSource"
                 driverClassName="org.mariadb.jdbc.Driver"
                 jdbcInterceptors="..."/>
  </GlobalNamingResources>
  <!-- ... -->
</Server>

Today’s objective

Remove secrets from code and production machines.

We’re putting emphasis on this, since your objective might be different, and the solution you’ll choose will vary greatly depending on it. You could for instance choose to remove secrets only in code and not on production machines, which is really easier.

Overview of password workflow

Our objective is to have secret life cycle that works like this:

A developer adds or uses a password key in code (ex: insurer_password). He removes the production value if necessary
The same developer puts a development value in code (ex: hunter2). This is useful for accessing a mock service or a test service.
A security admin adds the secret production value in secret management system.
A release manager deploys the application without seeing the production value.
The production app machine uses the secret

From code to production, different person with different access rights are handling secrets. What is interesting here is that for a small team of developer, all the roles can be filled by the same developer, but as team size grows, the process of enabling different security levels is already in place.

Prerequisite: Infrastructure as code

Some elements are prerequisite to our Vault and continuous delivery approach. We consider you won’t be able to do it without it.

If you do infratructure as code, you probably already have secrets in your source code, because your infrastructure configuration requires credentials, like database credentials. We want to keep infra as code, but remove the secrets in the source code.

Prerequisite: Infrastructure automation

Our machine provisioning and deployment is done with Ansible. It makes staging possible by facilitating the creation of new environment and enables disposable infrastructure (immutable infrastructure). Basically, it makes losing a machine or a production instance okay because you know you can redeploy it easily.

Logo Ansible Logo Chef Logo Puppet

Prerequisite: Continuous delivery

At LesFurets we deliver code to production at least daily. Continuous delivery means that it is easy to push a feature to production (like a new secret, or changing an existing one fast), and also easy to push an old version in case of emergency (or hot fix).

We use Jenkins with the jobs written in pipeline (that’s Groovy code), because we can commit the jobs description in our code source.

Logo Jenkins Logo Teamcity

Security

Choosing a tool

Many tools are available for secrets management, yet not all will fit your purpose. Making your own custom solution might not be a good idea given how hard it is.

Ansible, Chef, etc. do not remove secrets on production machine, so it was not an option for us, but might be totally reasonable for other use cases
Keywhiz very similar to Vault and could have been a good choice
Amazon KMS, Azure Key Vault, Google KMS somewhat similar to Vault but tied to specific ecosystem, so it was not an option for us

Buildtime secrets vs Runtime secrets

You can fetch the secrets at:

Buildtime which means the production machine will have a cleartext copy of the secret, because the build system will replace a secret placeholder (basically the key of the secret) with the secret value before sending the app to production
Runtime which means the production machine will dynamically get the secret, use it, then discard it, resulting in increased security (that’s what we want to do)

Hashicorp Vault

Lightweight, performant, open-source and battle hardened.

Seal and unseal makes your Vault safe. It is very useful to zip and send a Vault to another machine, since on startup it means the Vault will not respond to any command until it is unsealed with the encryption key generated on Vault initialisation.
Wrap secrets to distribute them safely. It is very useful to provide single use secrets, since a wrapped secret can be unwrapped only once by Vault (subsequent unwraps will fail) and can be restricted with lease and permissions.
Authenticate with different methods. We’re using token authentication (wrapped token authentication, enabling a single use, one-time token), but you could also use LDAP authentication, username / password, etc.
Audit log out of the box and easy to use. This is crucial for monitoring and alerting.

Logo Hashicorp Vault

Disclaimer

This is our field experience feedback, your mileage may vary. We did a very specific design for specific reasons.

Our Vault usage context

We’re deploying multiple copies of Vault instead of using it as a central database. This is very different from classic usage, but we’re doing it for specific reasons.

We can see in the following diagrams that on the left, all environments and JVM instances connects to the same Vault. That can be a problem is the network is bad, if the performance requirements are high, or if the risk of having a single point of failure is too high. We avoid that by deploying one Vault for each JVM instance, which has a lot of advantages for us. You can see the Vault on the right is local to the JVM.

Diagram of centralized Vault deployment Diagram of decentralized Vault deployment

Why use Vault decentralized?

We are looking for specific advantages:

Disposable infrastructure: replace instead of modify
Continuous delivery: easier to replace the secrets container
Version migration: don’t have to migrate secrets database
Attack surface: infrastructure wide system with all the info
Staging: deploying specific secrets for a specific env
SLA: single point of failure, network issues
Performance: one local Vault per JVM is super-fast

Team Password Manager

TPM is a password manager (like Vault) containing our secrets, but it is never used directly by the production servers.

Additional failsafe layer: if it fails, it doesn’t impact the system, and we can repair it at our own pace
Easier to migrate: since the production doesn’t depend on it (nobody likes to migrate a live system)
Can be any database system: any secure database system (ever another Vault)

Storage of key -> values

What we need in terms of password manager solution is basically a key -> value storage. The key is the same as the in the code, the value is the secret we’re protecting.

Screen capture of Team Password Manager 01

Uses permissions and audit logs

We also need permissions and audit logs for humans and machines as well. That should be necessary and default for any systems handling secrets.

Screen capture of Team Password Manager 02

Threat model

What is a threat model?

A process by which potential threats can be identified, enumerated, and prioritized.

Why a threat model?

To design a system with security in mind.

Threat model: STRIDE

There are many ways to do a threat model, today we’ll use the popular STRIDE method (see STRIDE (security)).

Threat model: Vault startup

This is the first of two threat model we’ll do, showing the potential threat of creating a Vault and starting it on the production machine.

Theat model of Vault startup risk 01

Threat model: Vault startup (Denial of service: TPM)

The first potential risk is a denial of service threat on TPM (either by malicious intent or bug). This risk is mitigated by two things: it doesn’t impact the production Vault (on the left) and can be alleviated by making TPM high availability.

Theat model of Vault startup risk 02

Vault startup (Information Disclosure: Unseal key)

On the second step the Vault is initialized so the key is kept in memory of the Ansible process. If that key is leaked it could be used to do information disclosure since an attacker could unseal the Vault. That risk is mitigated by two things: keeping the key only in memory (with a docker tmpfs volume) and making sure the Vault is properly protected by authentication (see next threat model).

Theat model of Vault startup risk 03

Vault startup (Information Disclosure: Unseal key)

On the third step the Vault is sealed then sent to the production machine. There’s an information disclosure problem if an attacker gets hold of the Vault, but it is mitigated if the key is well protected (see previous and next diagrams). We also assume that Vault’s threat model is sound.

Theat model of Vault startup risk 04

Vault startup (Information Disclosure: Key in transit)

On the fourth step the Vault is unsealed by the Ansible process. There’s a possible information disclosure on the key in transit since it could be intercepted and used to unseal the Vault. The risk is mitigated by passing the key by stdin in a SSH tunnel, which is fairly secure.

Theat model of Vault startup risk 05

Threat model: Tomcat startup

The second threat model is on the bootstrap of the Tomcat that will authenticate on the unsealed Vault (this step is just after the last threat model).

Theat model of Tomcat startup risk 01

Tomcat startup (Elevation of privilege: One time token)

On the second step of the Tomcat startup, the Ansible process will create a single use authentication token that wraps the session token that is authorized to query Vault. If that token is leaked an attacker might be able to elevate it’s privilege by unwrapping the token and using the session token to query Vault. We’re mitigating this attack by: keeping the token in memory only and by unwrapping it soon after on the production machine (validating if it was already unwrapped).

Theat model of Tomcat startup risk 02

Tomcat startup (Elevation of privilege: One time token)

On the third step, we deploy the JVM with the OTT (One Time Token, or single use token). There is a small window here to intercept the content and do a elevation of privilege by using the token. We mitigate that risk by making sure the next step happens quickly.

Theat model of Tomcat startup risk 03

Tomcat startup (Denial of service: One time token)

On the fourth step, the JVM will bootstrap itself and unwrap the OTT by calling unwrap on the Vault process, giving back the session token. This verifies the token wasn’t used by anybody else, but if it was, Vault will throw an exception and should be considered a security breach. The JVM won’t start without a connexion to Vault, so that’s a denial of service, and the security team should close the offending machine, effectively sealing the Vault and stopping the attack.

Theat model of Tomcat startup risk 04

Tomcat startup (Spoofing: Session token)

On the last step, the JVM keeps a session token in memory to query (calling read) Vault. This session token is stored in memory and if stolen an attacker could spoof itself on the Vault. This is mitigated by renewing the token often (with a time to live of 1 minute) and obfuscating the memory (this is harder in Java since memory dumps cannot be prevented).

Theat model of Tomcat startup risk 05

Security design implications: decryption key

There is only one decryption key that can unseal the Vault. It should never be written to disk. If the Vault is sealed (manually or not), it cannot be unsealed again.

If that happens, the application needs to be redeployed.

Security design implications: authentication

There is only one, single use, wrapped token than can provide the session token. Once the wrapped token is used, there is no other way of connecting to the Vault.

If the connexion to the Vault is lost for too long, the lease for the session token expires and the app cannot authenticate anymore.

If that happens, the application needs to be redeployed.

In practice

Overview of delivery pipeline

This is an overview of the steps necessary to package and deploy Vault and Tomcat on a production server.

Overview delivery pipeline diagram 01

The first step is to get the keys from the source code in git. The keys are generated during build time using annotations.

Overview delivery pipeline diagram 02

The second step creates an archive with the Vault configuration (policy, server conf, etc.) and the keys.

Overview delivery pipeline diagram 03

The third step uploads that artefact to Nexus. This is useful because the Jenkins deploy job might not come right after the packaging job.

Overview delivery pipeline diagram 04

The fourth step gets the Vault artefact from the Nexus.

Overview delivery pipeline diagram 05

The fifth step will iterate on every key and request the secret value to TPM. It is necessary that this process do not write the secrets to disk, but instead directly to Vault.

Overview delivery pipeline diagram 06

The sixth step will seal the Vault and zip it to prepare sending it to the production machine. At that point it has both the key and the OTT in memory.

Overview delivery pipeline diagram 07

The seventh step will copy the zipped Vaults (there is one Vault per Tomcat, so in production we’ll package 20 different Vault, each with it’s own key, OTT and certificate) to the production machine and start them.

Overview delivery pipeline diagram 08

The eight step will unseal the Vaults with the key, the Vaults become usable.

Overview delivery pipeline diagram 09

The ninth step will start the Tomcat with to OTT, which will make it’s bootstrap checks and connect to the Vault.

Overview delivery pipeline diagram 10

Jenkins pipeline (groovy)

Because we need to connect to TPM, we need to store the connexion credentials somewhere. We store them in Jenkins Credentials, a secure secrets container that is accessible by the agents. Each environment has its own credentials so we can monitor the jobs and alert if weird behaviors are detected.

The credentials are get using withCredentials in the agent process, then passed to the Docker process in environment variable using withEnv (this is one of the only safe way of passing secrets between process, also with stdin), then used by the Ansible process.

dir('scripts/ansible') {
  withCredentials([
        usernamePassword(credentialsId: 'teamPasswordJenkinsUser',
        usernameVariable: 'teamPasswordJenkinsUsername',
        passwordVariable: 'teamPasswordJenkinsPassword')]) {
    withEnv([
        "TEAM_PASSWORD_USERNAME=${teamPasswordJenkinsUsername}",
        "TEAM_PASSWORD_PASSWORD=${teamPasswordJenkinsPassword}"]) {
      sh """
        docker pull ansible/ansible
        docker run -w /playbooks \
                   -e TEAM_PASSWORD_USERNAME \
                   -e TEAM_PASSWORD_PASSWORD \
                   -v \$(pwd):/playbooks \
                   ansible/ansible ansible-playbook ./tomcat-b2c.yml \
                   --inventory=./hosts \
                   --extra-vars='version=${scmHash} target=${conf.nginxEnvName}'
      """
    }
  }
}

Deploying vault with Ansible

Using Ansible is important to have robust deployment automation. It handles network failures, monitors the time taken by each steps and designed in a way that deploying 1000 machines is the same as deploying 1.

...
[09:31:13] vault - localhost - download from nexus ... | localhost | SUCCESS | 52ms
[09:31:14] vault - localhost - unarchive artifact ... | localhost | CHANGED | 916ms
[09:31:15] vault - localhost - fetch secrets ... | localhost | SUCCESS | 59ms
[09:32:23] vault - localhost - create vault ... | localhost | CHANGED | 5225ms
[09:32:28] vault - localhost - create sealed vault archive ... | localhost | CHANGED | 553ms
[09:32:28] vault - localhost - set vault_key, vault_token ... | localhost | SUCCESS | 120ms
...

Build infrastructure implications

Jenkins Credentials storage is used to connect to TPM. Each environment should have it’s own monitored credentials. Handle secrets in memory only. Ansible is executed in a Docker container with tmpfs volumes, since by default the Docker volumes are written to disk and the key and OTT would leak. Deployment time is limited with a lease of 10 minutes on the wrapped token. Past that time, the application cannot unwrap the token and will not start.

Initializing Vault (bash)

We’re using vault client in Bash to bootstrap the Vault server. Important parts are:

Initializing vault with 1 key to unseal
Creating a one-time token with wrapping and TTL
Revoking the root token and sealing the Vault

# Start vault with data and log directory (in the background)
nohup vault server -config="conf/install.json" &> logfile &
# (check for startup code omitted)

# Init vault with 1 key, save it and save the root token
STDOUT=$( vault operator init -key-shares=1 -key-threshold=1 -format="json" )
KEY=$( echo "$STDOUT" | jq --raw-output ".unseal_keys_b64[0]" )
TOKEN=$( echo "$STDOUT" | jq --raw-output ".root_token" )

# Unseal vault and authenticate with root token (without ~/.vault-token file)
vault operator unseal "$KEY"
vault login -no-store "$TOKEN"

# setting vault token for next operations
export VAULT_TOKEN="$TOKEN"

# Add read only policy (dev, stage, etc.) for environment to vault
vault policy write "${ENVIRONMENT}" "policies/${ENVIRONMENT}.hcl"

# Mount a new key-value store for the environment (/dev, /stage, etc.)
vault secrets enable -path="${ENVIRONMENT}" kv

# Creates a wrapped token (needs to be unwraped before use), TTL is 10 minutes
vault token create \
-orphan \
-renewable="true" \
-policy="${ENVIRONMENT}" \
-ttl="${TOKEN_TTL}" \
-period="${TOKEN_TTL}" \
-wrap-ttl="${TOKEN_TTL}" \
-format="json" \
| jq --raw-output .wrap_info.token \
> wrap

# Write secrets in vault
# ... (for each key)
value=$( echo "${line}" | cut -d '=' -f 2- | sed 's/^@/\\@/g' )
vault write "${ENVIRONMENT}/${key}" "value=${value}"

# Enable audit log with syslog output (then goes to datadog)
vault audit enable syslog tag="vault-${ENVIRONMENT}-${INSTANCE}"

# Revoke root token, only otp tokens remains
vault token revoke "$TOKEN"

# Seal and close the vault
kill $( cat ${BUILD_DIR}/pid/vault.pid )

Querying Vault at runtime (Java)

Vault server has a simple REST API to query data. You can use it directly in Java or use a library like spring-vault (from Spring) or vault-java-driver (from BetterCloud)

JVM bootstrap vault client

The important part in bootstrapping the connexion between the JVM and Vault is to check the authenticity of the wrapped token. That is done by calling lookupWrap on the wrapped token and then by checking the creation path is equal to “auth/token/create”. That alone guaranties the token wasn’t forged by an attacker (the attacker doesn’t have authorization to create tokens on the Vault but it is recommended to test anyway).

You can then proceed call the unwrap method with the wrap token, then recreate a second connexion with the session token that will be kept in memory.

LOG.info("Vault using certificate");
SslConfig sslConfig = new SslConfig().pemFile(CERTIFICATE_PATH.toFile())
                                     .build();

LOG.info("Vault response wrapping token validation (wrapToken)");
Vault vault = new Vault(new VaultConfig().sslConfig(sslConfig)
                                               .address(ADDRESS)
                                               .token(wrapToken)
                                               .build());
LogicalResponse lookup = vault.auth().lookupWrap();
String creationPath = lookup.getData().get("creation_path");
if (!"auth/token/create".equals(creationPath)) {
  throw logAndThrow("vault wrong wrapping token path '"
                        + creationPath + "' token might be forged");
}

LOG.info("Vault unwrapping login token");
String sessionToken = vault.auth().unwrap().getAuthClientToken();

LOG.info("Vault starting secure connexion (sessionToken)");
return new Vault(new VaultConfig().sslConfig(sslConfig)
                                        .address(ADDRESS)
                                        .token(sessionToken)
                                        .build());

Flow diagram of wrap token lease

This is an overview of the calls between the JVM and Vault, from JVM bootstrap to reading the information.

The first step is calling /sys/wrapping/lookup with the wrapped token, it is described in the previous code snippet.

Vault token wrap lease diagram 01

The second step is calling /sys/wrapping/unwrap with the wrapped token, which must be done maximum 10 minutes after the creation of the wrapped token (or else the lease will have expired).

Vault token wrap lease diagram 02

The third step is calling /auth/token/renew-self at least once a minute. It is doesn’t happen on time, the lease of the session token will expire and the authentication is not possible anymore on the Vault.

Vault token wrap lease diagram 03

The fourth step is calling /v1/prod/mysecret to get the value of the “mysecret” secret in the “prod” environment. The v1 at the start denotes the secret version, see Versioned Key/Value Secret Engine for information on how versioned secrets work (we don’t use it).

Vault token wrap lease diagram 04

No more secrets in code!

Tada! The first example of this article is free of secrets, goal achieved. Note that we do not assign the secret value to a variable anymore, this is a good practice to reduce secret leakage.

public class ClientPasswordCallback implements CallbackHandler {

  private static final VaultService VAULT = CoreServiceFactory.getInstance().getVaultClient();

  @Override
  public void handle(Callback[] callbacks) {
    final WSPasswordCallback pc = (WSPasswordCallback) callbacks[0];
    if (VAULT.getSecret("insurer_username").equals(pc.getIdentifier())) {
      pc.setPassword(VAULT.getSecret("insurer_password"));
    }
  }

}

Java design implications

Read OWASP Secure Coding Practices and make sure it is known in the development team. A secure system needs a secure codebase.

Java isn’t a secure language but for our use case using short lived secrets (stack memory, not heap memory) is a good start.

Using a security static code analysis tool like Checkmarx is also recommended.

Operations

Performance / Scalability

Using Vault decentralized makes it easier to manage and performance is not an issue if each JVM has it own Vault. It’s also easier to scale by adding new Vaults and more resilient to network failures

Monitoring / Alert plan

Along your monitoring, it’s very important to know how to react when you have an alert. At the very least a list of developers should be notified in case of an error in the secret system.

Infrastructure (Datadog): monitor CPU usage, memory and process number (each tomcat instance has a Vault instance), this is classic stuff
Logs (Datadog): alert on any ERROR log from the Vault service (potential security breach), this is very important

Audit

The audit log is important for two separate reasons: in case of a breach you can diagnose what happened and you can detect a potential breach.

In an audit log, for each operation on Vault, a new entry is added containing information about the request and the response.

vault audit enable syslog tag="vault-${ENVIRONMENT}-${INSTANCE}"

Breach detection example

You know what requests (local only) your application does, then you can alert on anything else. Here the “rewew-self” path on the left is quite normal, but no so much on the right with “funky/looking/url” (or the address that should always be localhost).

{
  "time": "2019-03-19T13:51:46",
    "auth": {
      "client_token": "hmac-sha256:...",
      ...
    },
    "request": {
      "id": "...",
      "operation": "update",
      "client_token": "hmac-sha256:...",
      "path": "auth/token/renew-self",
      "remote_address": "127.0.0.1",
      ...
    },
    "response": {
      ...
    },
}

{
  "time": "2019-03-19T13:51:46",
    "auth": {
      "client_token": "hmac-sha256:...",
      ...
    },
    "request": {
      "id": "...",
      "operation": "update",
      "client_token": "hmac-sha256:...",
      "path": "funky/looking/url",
      "remote_address": "123.0.0.0",
      ...
    },
    "response": {
      ...
    },
}

Experience feedback & lessons learned

Security is hard

The harder you make it for an attacker, the harder it is for you to use your own system.

Secrets migration

True Story: all the insurer that had a “@” in their password crashed when we first deployed to production because of a failed char escape When migrating, how do you test?

Testing and mocking

Access to the secrets is different now, the most important part is being able to reproduce the production environment with proper authorization. You will also need to:

Automate migration: frown upon copy / paste
Refactor tests: to test on the secret key, not the value
Mock Vault: and keep development values in code

Build all the things

Meme build all the things

True Story: the SSL certificate for the Tomcat to Vault connexion expired.

Automate everything: even certificate renewal, self signed certificates are easy to produce at each deploy for each Tomcat and Vault pair
Small jobs: to redeploy small parts for emergencies

Human challenge

That’s a lot of changes for the development team, you need to make sure that:

they understand the new process,
they have the right tools to work,
they understand the system as a whole.

Experience tells us it’s easier to migrate the system part by part, so the teams can adapt progressively.

In retrospective: is it a good specific solution?

Remember our goals, mainly: disposable infrastructure, continuous delivery, version migration, reduced operation, performance (speed and network).

In retrospective: disadvantages

Complex solution: compared to a single Vault, this is more complicated to implement, but easier to maintain.
Requires strong automation: we had to port old Bash deployment to Ansible, but it is a healthy approach that benefits the whole system.
Impossible application restart: this is disposable infrastructure, it is not a problem if redeployment is fast.

In retrospective: advantages

Continuous deployment and disposable infrastructure: easier to replace than migrate
DevOps: no additional infrastructure, less work for the operations team, and more freedom for the devs
We have no network failures, no migration, excellent performance and easy staging for new environment.

Conclusion

Wrapping up

Security is hard! But it gets easier if your solution fits your system and your process. Make sure you keep your goals and risks in mind when you design your solution and choose your tools.