How licensing fails open and fails closed

mcpgw uses Ed25519-signed JWTs for licensing. The verification is entirely offline. There is no live revocation API, no phone-home, no analytics ping. This document explains the choices behind that posture and what they mean operationally.


The constraint

We wanted three properties simultaneously:

  1. mcpgw must run in environments with no outbound internet. Air-gapped, VPC-internal, on-prem regulated networks. Phone-home licensing would block these deployments.
  2. A misbehaving license server must not break customers. Any centralized check is a new SPOF for every running gateway. We don’t want our outage to become your outage.
  3. The license must be cheap to operate. Customers should be able to swap a JWT in seconds, not file a support ticket.

The cleanest design that satisfies all three is offline JWT verification with a baked-in public key.


How it works

The release binary is built with a -ldflags injection that hex-decodes a 32-byte Ed25519 public key into the binary. At startup, the gateway:

  1. Reads the JWT from license.path.
  2. Verifies the signature against the baked-in public key.
  3. Validates aud == "mcpgw", nbf <= now+60s, and exp + grace_days >= now.

That’s it. No network calls. The same code path runs at startup, on SIGHUP, and on the hourly recheck.

The private key for signing lives in the mcpgw operator’s secret manager. mcpgw never sees it. Customers receive only the JWT.


Fail-open vs fail-closed

The interesting design choice is what happens when exp is in the past.

Fail-closed immediately at exp

Pros: clear contract, no ambiguity. Customers know to renew. Cons: any small operational mishap (forgot to rotate, JWT not yet propagated to all replicas, clock skew) takes the gateway down. License renewal becomes an emergency rather than a chore.

Fail-closed only after exp + grace_days

Pros: a week-or-month-long buffer for renewal. The gateway logs warnings during the grace period; operators have time to react before traffic is impacted. Cons: a customer who never renews keeps running for grace_days past their entitlement.

We chose fail-open during grace, fail-closed after. The argument:

  • The cost of failing closed too aggressively is “every customer suffers an outage from a clerical error.” That cost is borne by customers.
  • The cost of failing open during grace is “a small number of customers run for an extra month past their entitlement.” That cost is borne by us.
  • The first cost is product-quality damage. The second is collectible revenue.

We’re happy to bear the smaller cost.

The default grace is 30 days. It’s overrideable per-license via the grace_days claim. Compliance-sensitive customers can set it to 0 (immediate fail-closed) if they need that posture; everyone else gets the safety margin.

Behavior summary

State/healthz/readyz/mcpLogsSpan attr
Valid (now < exp)200200normalquiet
Grace (exp <= now < exp + grace_days)200200normalwarn every minutemcp.license.grace = true
Expired beyond grace200503503 license_invalidwarn every minute

/healthz stays 200 even when expired, so your orchestrator does not aggressively restart-loop a binary that can’t help. /readyz returning 503 removes the pod from the LB rotation. New connections are accepted and immediately rejected so health checks see the right behavior.


What revocation looks like

There is no online revocation. To revoke a license:

  1. Issue all production licenses with short exp windows (e.g. 90 days).
  2. Establish a regular rotation cadence (e.g. every 60 days).
  3. To revoke, simply do not re-issue. The gateway will fail-closed at exp + grace_days.

For faster revocation, reduce grace_days (per-license) and shorten exp (at issuance).

The trade-off is real: a long-issued license cannot be revoked before its exp + grace_days. The mitigation is to issue short licenses. We recommend 90-day issuance with weekly automated rotation; that gives a maximum revocation window of about a week.

A live revocation API would solve this faster. We don’t ship one because:

  • It would phone home, breaking constraint #1 (air-gapped deployments).
  • It would create a new SPOF, breaking constraint #2.
  • It would be a new attack surface — the revocation API is interesting to attackers because compromising it is functionally equivalent to issuing licenses.

The right way to think about this: mcpgw’s licensing is eventual revocation with operator-tunable urgency. If you need immediate kill-switch behavior, you build it at a different layer (firewall rule, network ACL, removing the license file).


Why a single baked-in public key

Each release binary embeds exactly one public key. Re-keying requires a new release.

The alternative — accept any of N keys, or fetch the key set at startup — has worse failure modes:

  • Multi-key acceptance means a compromised private key extends its damage until every binary is replaced. Single-key + binary release is a clear cut: the compromised key affects only releases that included it.
  • Fetching the key set introduces a runtime dependency on a key-distribution endpoint, defeating constraint #1 (offline operation).

The downside is that a key compromise is a release event. We’re willing to do that — releases are cheap, and the security argument is strong.


Why JWT specifically

JWT is not a great format. It has well-known footguns (alg: none, key confusion, JKU header tricks). We use it anyway because:

  • The footguns are well-known, which means there are well-tested libraries that defend against them. We use golang-jwt/jwt/v5 with WithValidMethods(["EdDSA"]) and WithoutClaimsValidation (we do explicit claim validation ourselves).
  • Operators understand JWTs. They know how to inspect them with jq. They know how to roll them.
  • The format is portable. The same JWT works for offline verification, the future online checks if we ever add them, and any external auditing tooling.

The alternative we considered was a custom binary format. It would have been simpler to verify but harder for operators to inspect, and we judged inspectability more valuable than format minimalism.


What you should not do

Three anti-patterns we’ve seen in early customer deployments:

  • Don’t share a single JWT across many tenants. Each customer should get their own JWT, even if they’re internally one organization. The sub claim is your audit identity; collapsing tenants makes audit useless.
  • Don’t commit the JWT to source control. The JWT is bearer-equivalent. Anyone with the file can run a gateway as your tenant. Treat it like an API key — secrets manager, mounted at runtime, never in repos.
  • Don’t disable readiness probes “because the license keeps expiring.” If the license keeps expiring, the operational fix is automated rotation, not silenced probes. The probes are doing their job.