Health endpoints
mcpgw exposes two endpoints for orchestrators. They serve distinct purposes — do not point both your liveness and readiness probes at the same one.
| Endpoint | Probe type | Returns 503 when… |
|---|---|---|
/healthz | Liveness | Never (a hung process answering 200 is the only failure mode) |
/readyz | Readiness | License is invalid or expired beyond grace |
/healthz
GET /healthz
| Response | Meaning |
|---|---|
200 OK body ok | Process is responsive |
| no response | Process is dead, hung, or the network is broken — your liveness probe should restart |
Use as your liveness probe. Kubernetes example:
livenessProbe:
httpGet:
path: /healthz
port: 7332
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
/healthz deliberately does not check the license, the upstreams, or the audit file writability. Its only job is to prove the process is responsive. Coupling liveness to license state turns a slow license-renewal incident into a thundering-herd restart that makes everything worse.
/readyz
GET /readyz
| Response | Meaning |
|---|---|
200 OK body ready | License is valid (within exp + grace_days) |
503 Service Unavailable body license_expired | License is expired beyond the grace window |
503 Service Unavailable body license_invalid | Signature or claim validation failed |
Use as your readiness probe. When /readyz returns 503, the orchestrator should remove the pod from the LB rotation but not restart it.
readinessProbe:
httpGet:
path: /readyz
port: 7332
initialDelaySeconds: 2
periodSeconds: 30
failureThreshold: 2
/readyz does not check upstream health. An upstream MCP server being down does not make the gateway “not ready” — the gateway can still receive requests and return useful errors (502 upstream_unreachable). Coupling readiness to upstream health collapses both layers’ availability into one number.
Why split these?
The Kubernetes-style liveness/readiness split exists for a reason: they correspond to different remediations.
- Liveness fail = restart me. The process is broken and only a restart will fix it.
- Readiness fail = remove me from rotation. The process is fine but currently can’t serve traffic correctly.
License expiry is a readiness failure: you do not want the orchestrator to restart-loop a binary while you are renewing the JWT. Hung process is a liveness failure: restart immediately.
If your orchestrator only supports one probe type, use /readyz. The downside is slightly slower recovery from process hangs (one fewer fast-fail signal); the upside is correct behavior on license expiry.
Fronting /readyz with a load balancer
LB health checks are typically configured as readiness probes. Point them at /readyz. If you need the LB to detect process death faster, use a TCP-only check on :7332 in addition.
/readyz is excluded from rate-limit identity and policy evaluation. It does not appear in the audit log or in OTLP spans. This is intentional — health-check traffic should not pollute observability data.
Related
- CLI reference: server — exit codes and signals
- License JWT reference — what makes a license invalid