Skip to content

Commit c527bf1

Browse files
committed
wip
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
1 parent 9fa9dc5 commit c527bf1

File tree

8 files changed

+44
-91
lines changed

8 files changed

+44
-91
lines changed

docs/content/en/docs/documentation/operations/health-probes.md

Lines changed: 25 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -17,72 +17,69 @@ API.
1717
| `isStarted()` | `true` once the operator and all its controllers have fully started |
1818
| `allEventSourcesAreHealthy()` | `true` when every registered event source (informers, polling sources, etc.) reports a healthy status |
1919
| `unhealthyEventSources()` | returns a map of controller name → unhealthy event sources, useful for diagnostics |
20+
| `unhealthyInformerWrappingEventSourceHealthIndicator()` | returns a map of controller name → unhealthy informer-wrapping event sources, each exposing per-informer details via `InformerHealthIndicator` (`hasSynced()`, `isWatching()`, `isRunning()`, `getTargetNamespace()`) |
2021

21-
These map naturally to Kubernetes probes:
22+
In most cases a single readiness probe backed by `allEventSourcesAreHealthy()` is sufficient: before the
23+
operator has fully started the informers will not have synced yet, so the check naturally covers the startup
24+
case as well. Once running, it detects runtime degradation such as a lost watch connection.
2225

23-
- **Startup probe**`isStarted()` — fails until all informers have synced and the operator is ready to
24-
reconcile.
25-
- **Readiness probe**`allEventSourcesAreHealthy()` — fails if an informer loses its watch connection
26-
or any event source reports an unhealthy status.
26+
### Fine-Grained Informer Diagnostics
2727

28-
## Setting Up Probe Endpoints
28+
For advanced use cases — such as exposing per-informer health in a diagnostic endpoint or logging which
29+
specific namespace lost its watch — `unhealthyInformerWrappingEventSourceHealthIndicator()` gives access to
30+
individual `InformerHealthIndicator` instances. Each indicator exposes `hasSynced()`, `isWatching()`,
31+
`isRunning()`, and `getTargetNamespace()`. This is typically not needed for a standard health probe but can
32+
be valuable for operational dashboards or troubleshooting.
2933

30-
The example below uses [Jetty](https://eclipse.dev/jetty/) to expose health probe endpoints. Any HTTP
34+
## Setting Up a Probe Endpoint
35+
36+
The example below uses [Jetty](https://eclipse.dev/jetty/) to expose a `/healthz` endpoint. Any HTTP
3137
server library works — the key is calling the `RuntimeInfo` methods to determine the response code.
3238

3339
```java
3440
import org.eclipse.jetty.server.Server;
3541
import org.eclipse.jetty.server.handler.ContextHandler;
36-
import org.eclipse.jetty.server.handler.ContextHandlerCollection;
3742

3843
Operator operator = new Operator();
3944
operator.register(new MyReconciler());
40-
operator.start();
4145

42-
var startup = new ContextHandler(new StartupHandler(operator), "/startup");
43-
var readiness = new ContextHandler(new ReadinessHandler(operator), "/ready");
46+
// start the health server before the operator so probes can be queried during startup
47+
var health = new ContextHandler(new HealthHandler(operator), "/healthz");
4448
Server server = new Server(8080);
45-
server.setHandler(new ContextHandlerCollection(startup, readiness));
49+
server.setHandler(health);
4650
server.start();
51+
52+
operator.start();
4753
```
4854

49-
Where `StartupHandler` and `ReadinessHandler` extend `org.eclipse.jetty.server.Handler.Abstract` and
50-
check `operator.getRuntimeInfo().isStarted()` and
51-
`operator.getRuntimeInfo().allEventSourcesAreHealthy()` respectively.
55+
Where `HealthHandler` extends `org.eclipse.jetty.server.Handler.Abstract` and checks
56+
`operator.getRuntimeInfo().allEventSourcesAreHealthy()`.
5257

5358
See the
5459
[`operations` sample operator](https://github.com/java-operator-sdk/java-operator-sdk/tree/main/sample-operators/operations)
5560
for a complete working example.
5661

5762
## Kubernetes Deployment Configuration
5863

59-
Once your operator exposes probe endpoints, configure them in your Deployment manifest:
64+
Once your operator exposes the probe endpoint, configure a readiness probe in your Deployment manifest:
6065

6166
```yaml
6267
containers:
6368
- name: operator
6469
ports:
6570
- name: probes
6671
containerPort: 8080
67-
startupProbe:
68-
httpGet:
69-
path: /startup
70-
port: probes
71-
initialDelaySeconds: 1
72-
periodSeconds: 3
73-
failureThreshold: 20
7472
readinessProbe:
7573
httpGet:
76-
path: /ready
74+
path: /healthz
7775
port: probes
7876
initialDelaySeconds: 5
7977
periodSeconds: 5
8078
failureThreshold: 3
8179
```
8280
83-
The startup probe gives the operator time to start (up to ~60 s with the settings above). Once the startup
84-
probe succeeds, the readiness probe takes over and will mark the pod as not-ready if any event source
85-
becomes unhealthy.
81+
The readiness probe will mark the pod as not-ready until all informers have synced. After that, it
82+
continues to monitor event source health at runtime.
8683
8784
## Helm Chart Support
8885
@@ -92,12 +89,9 @@ Enable them in your `values.yaml`:
9289
```yaml
9390
probes:
9491
port: 8080
95-
startup:
96-
enabled: true
97-
path: /startup
9892
readiness:
9993
enabled: true
100-
path: /ready
94+
path: /healthz
10195
```
10296

10397
All probe timing parameters (`initialDelaySeconds`, `periodSeconds`, `failureThreshold`) have sensible

helm/generic-helm-chart/tests/deployment_test.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -305,7 +305,7 @@ tests:
305305
asserts:
306306
- equal:
307307
path: spec.template.spec.containers[0].startupProbe.httpGet.path
308-
value: /startup
308+
value: /healthz
309309
- equal:
310310
path: spec.template.spec.containers[0].startupProbe.httpGet.port
311311
value: probes
@@ -325,7 +325,7 @@ tests:
325325
asserts:
326326
- equal:
327327
path: spec.template.spec.containers[0].readinessProbe.httpGet.path
328-
value: /ready
328+
value: /healthz
329329
- equal:
330330
path: spec.template.spec.containers[0].readinessProbe.httpGet.port
331331
value: probes

helm/generic-helm-chart/values.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -134,13 +134,13 @@ probes:
134134
port: 8080
135135
startup:
136136
enabled: false
137-
path: /startup
137+
path: /healthz
138138
initialDelaySeconds: 1
139139
periodSeconds: 3
140140
failureThreshold: 20
141141
readiness:
142142
enabled: false
143-
path: /ready
143+
path: /healthz
144144
initialDelaySeconds: 5
145145
periodSeconds: 5
146146
failureThreshold: 3

sample-operators/operations/pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@
8585
<dependency>
8686
<groupId>org.eclipse.jetty</groupId>
8787
<artifactId>jetty-server</artifactId>
88-
<version>12.1.0</version>
88+
<version>12.1.8</version>
8989
</dependency>
9090
<dependency>
9191
<groupId>io.javaoperatorsdk</groupId>

sample-operators/operations/src/main/java/io/javaoperatorsdk/operator/sample/metrics/StartupHandler.java renamed to sample-operators/operations/src/main/java/io/javaoperatorsdk/operator/sample/metrics/HealthHandler.java

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -25,20 +25,26 @@
2525

2626
import io.javaoperatorsdk.operator.Operator;
2727

28-
public class StartupHandler extends Handler.Abstract {
28+
/**
29+
* Combined health endpoint that checks whether all event sources (informers, polling sources, etc.)
30+
* are healthy. Before the operator has fully started the informers will not have synced yet, so
31+
* this endpoint naturally covers the startup case as well.
32+
*/
33+
public class HealthHandler extends Handler.Abstract {
2934

3035
private final Operator operator;
3136

32-
public StartupHandler(Operator operator) {
37+
public HealthHandler(Operator operator) {
3338
this.operator = operator;
3439
}
3540

3641
@Override
3742
public boolean handle(Request request, Response response, Callback callback) {
38-
if (operator.getRuntimeInfo().isStarted()) {
39-
sendMessage(response, 200, "started", callback);
43+
var runtimeInfo = operator.getRuntimeInfo();
44+
if (runtimeInfo.isStarted() && runtimeInfo.allEventSourcesAreHealthy()) {
45+
sendMessage(response, 200, "healthy", callback);
4046
} else {
41-
sendMessage(response, 400, "not started yet", callback);
47+
sendMessage(response, 503, "not healthy", callback);
4248
}
4349
return true;
4450
}

sample-operators/operations/src/main/java/io/javaoperatorsdk/operator/sample/metrics/MetricsHandlingSampleOperator.java

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@
2525

2626
import org.eclipse.jetty.server.Server;
2727
import org.eclipse.jetty.server.handler.ContextHandler;
28-
import org.eclipse.jetty.server.handler.ContextHandlerCollection;
2928
import org.jspecify.annotations.NonNull;
3029
import org.jspecify.annotations.Nullable;
3130
import org.slf4j.Logger;
@@ -79,10 +78,9 @@ public static void main(String[] args) throws Exception {
7978
operator.register(
8079
new MetricsHandlingReconciler2(),
8180
configLoader.applyControllerConfigs(MetricsHandlingReconciler2.NAME));
82-
var startup = new ContextHandler(new StartupHandler(operator), "/startup");
83-
var readiness = new ContextHandler(new ReadinessHandler(operator), "/ready");
81+
var health = new ContextHandler(new HealthHandler(operator), "/healthz");
8482
Server server = new Server(8080);
85-
server.setHandler(new ContextHandlerCollection(startup, readiness));
83+
server.setHandler(health);
8684
server.start();
8785
log.info("Health probe server started on port 8080");
8886

sample-operators/operations/src/main/java/io/javaoperatorsdk/operator/sample/metrics/ReadinessHandler.java

Lines changed: 0 additions & 44 deletions
This file was deleted.

sample-operators/operations/src/test/resources/helm-values.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,7 @@ primaryResources:
3434
- metricshandlingcustomresource2s
3535

3636
probes:
37-
startup:
38-
enabled: true
3937
readiness:
4038
enabled: true
39+
path: /healthz
4140

0 commit comments

Comments
 (0)