Health endpoints

mcpgw exposes two endpoints for orchestrators. They serve distinct purposes — do not point both your liveness and readiness probes at the same one.

EndpointProbe typeReturns 503 when…
/healthzLivenessNever (a hung process answering 200 is the only failure mode)
/readyzReadinessLicense is invalid or expired beyond grace

/healthz

GET /healthz
ResponseMeaning
200 OK body okProcess is responsive
no responseProcess is dead, hung, or the network is broken — your liveness probe should restart

Use as your liveness probe. Kubernetes example:

livenessProbe:
  httpGet:
    path: /healthz
    port: 7332
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

/healthz deliberately does not check the license, the upstreams, or the audit file writability. Its only job is to prove the process is responsive. Coupling liveness to license state turns a slow license-renewal incident into a thundering-herd restart that makes everything worse.


/readyz

GET /readyz
ResponseMeaning
200 OK body readyLicense is valid (within exp + grace_days)
503 Service Unavailable body license_expiredLicense is expired beyond the grace window
503 Service Unavailable body license_invalidSignature or claim validation failed

Use as your readiness probe. When /readyz returns 503, the orchestrator should remove the pod from the LB rotation but not restart it.

readinessProbe:
  httpGet:
    path: /readyz
    port: 7332
  initialDelaySeconds: 2
  periodSeconds: 30
  failureThreshold: 2

/readyz does not check upstream health. An upstream MCP server being down does not make the gateway “not ready” — the gateway can still receive requests and return useful errors (502 upstream_unreachable). Coupling readiness to upstream health collapses both layers’ availability into one number.


Why split these?

The Kubernetes-style liveness/readiness split exists for a reason: they correspond to different remediations.

  • Liveness fail = restart me. The process is broken and only a restart will fix it.
  • Readiness fail = remove me from rotation. The process is fine but currently can’t serve traffic correctly.

License expiry is a readiness failure: you do not want the orchestrator to restart-loop a binary while you are renewing the JWT. Hung process is a liveness failure: restart immediately.

If your orchestrator only supports one probe type, use /readyz. The downside is slightly slower recovery from process hangs (one fewer fast-fail signal); the upside is correct behavior on license expiry.


Fronting /readyz with a load balancer

LB health checks are typically configured as readiness probes. Point them at /readyz. If you need the LB to detect process death faster, use a TCP-only check on :7332 in addition.

/readyz is excluded from rate-limit identity and policy evaluation. It does not appear in the audit log or in OTLP spans. This is intentional — health-check traffic should not pollute observability data.