| .gitignore | ||
| go.mod | ||
| go.sum | ||
| LICENSE | ||
| main.go | ||
| Makefile | ||
| README.md | ||
| test.py | ||
🛰️ Pulsar Sentinel
Pulsar Sentinel is a lightweight, self-contained cluster watchdog for Apache Pulsar brokers.
It ensures high availability and safe failover by exchanging UDP-based heartbeats, monitoring broker health, and applying fencing logic when quorum or priority conditions are violated.
🚀 Features
- 🧭 Heartbeat-based cluster coordination via UDP broadcast or multicast
- ❤️ Local broker health checks (HTTP endpoint returning
ok) - 🧩 Priority group logic (e.g., active/passive or multi-tier clusters)
- 🧠 Quorum evaluation to detect network partitions or node loss
- 🪓 Automatic fencing with grace and cooldown periods
- 🔒 Stateless, configuration-free runtime (no external dependencies)
- 📡 Built-in HTTP health endpoint for monitoring (used load balancers)
⚙️ How It Works
Each node runs a pulsar-sentinel process.
It periodically sends a heartbeat containing:
{
"type": "heartbeat",
"node": "broker-1-prio1",
"ts": 1699999999,
"data": {
"priority": 1,
"healthy": true
}
}
All other nodes receive these via UDP.
Sentinel then:
- Tracks which peers are healthy (based on age and broker health).
- Counts healthy nodes per priority group.
- Decides whether this node should remain active, or self-fence.
⚡ Fencing Rules
| Condition | Action |
|---|---|
| Local priority group loses quorum | Trigger fencing after grace period |
| Higher-priority group regains quorum | Trigger fencing immediately |
| Quorum stable, no higher priority active | Remain active |
| Condition clears before grace expires | Cancel fencing |
Fencing runs a configurable shell command (FENCING_COMMAND), e.g.:
export FENCING_COMMAND="ss --kill state established sport = :6651"
By default, it runs true (no-op).
💡 Info: In setups where an external load balancer distributes traffic across nodes of multiple priority groups but cannot actively drop connections on role change, the fencing action must ensure connection consistency. Since the load balancer will stop routing new connections once the healthcheck fails, the fencing command’s job is to terminate existing TCP sessions on the broker port (e.g.,
6651for Pulsar TLS). This can be done one-time or periodically using the Linuxsstool:ss --kill state established sport = :6651This guarantees that no stale client connections remain open toward a fenced (inactive) node.
🧩 HTTP Endpoints
/
Health endpoint for external systems.
Returns ok if:
- The local broker healthcheck returns
ok(HTTP 200 + bodyok), and - The cluster quorum and priority rules allow this node to stay active.
Otherwise returns fail.
Example:
curl http://localhost:3000/
# ok
🧠 Command-line Flags
| Flag | Default | Description |
|---|---|---|
-priority |
(required) | Node priority (1 = highest) |
-port |
3001 |
UDP port for heartbeats |
-http-port |
3000 |
Port for Sentinel’s HTTP health endpoint |
-healthcheck |
http://localhost:8080/admin/v2/brokers/health |
Broker healthcheck URL |
-heartbeat-interval |
5s |
Interval between heartbeats |
-timeout |
15s |
Maximum heartbeat age before node is considered dead |
-grace-period |
20s |
Delay before fencing after a failure is detected |
-fencing-cooldown |
60s |
Cooldown time between fencing executions |
-quorum-size |
2 |
Minimum number of healthy nodes required per group |
-multicast |
(empty) | Multicast group address (e.g., 239.0.0.1), defaults to broadcast |
-debug |
false |
Enable detailed logging |
🌍 Environment Variables
| Variable | Purpose |
|---|---|
HEALTHCHECK_URL |
Overrides -healthcheck flag |
JWT_TOKEN |
Adds Bearer token to the broker healthcheck request |
FENCING_COMMAND |
Command executed when fencing is triggered |
🩺 Example Setup
Node A (priority 1)
./pulsar-sentinel -priority 1 -healthcheck http://localhost:8080/health
Node B (priority 2)
./pulsar-sentinel -priority 2 -healthcheck http://localhost:8080/health
Sample Log Output
INFO[0000] Listening for broadcast on 0.0.0.0:3001
INFO[0000] Starting HTTP health server on :3000
INFO[0010] Current group counts: map[1:2 2:1]
WARN[0010] Condition detected: Higher priority group 1 restored quorum (2/2)
ERROR[0030] Fencing triggered: Higher priority group 1 restored quorum (2/2)
INFO[0030] Executing fencing command: "ss --kill state established sport = :6651"
🧪 Testing the Local Healthcheck
To simulate a simple broker health endpoint locally, you can use this Python snippet:
python3 -c "from http.server import BaseHTTPRequestHandler, HTTPServer; class H(BaseHTTPRequestHandler): def do_GET(self): self.send_response(200); self.end_headers(); self.wfile.write(b'ok'); HTTPServer(('',8080), H).serve_forever()"
Then run Sentinel and query:
curl http://localhost:3000/
# ok
🛠️ Build & Run
Build
go build -o pulsar-sentinel main.go
Run
./pulsar-sentinel -priority 1 -healthcheck http://localhost:8080/health
Enable debug logs
./pulsar-sentinel -priority 1 -debug
🔒 Example Systemd Unit
[Unit]
Description=Pulsar Sentinel
After=network.target
[Service]
ExecStart=/usr/local/bin/pulsar-sentinel -priority 1
Restart=always
Environment="HEALTHCHECK_URL=http://localhost:8080/admin/v2/brokers/health"
Environment="FENCING_COMMAND=ss --kill state established sport = :6651"
[Install]
WantedBy=multi-user.target
📜 License
MIT License © 2025 Matthias Petermann
🧭 Summary
Pulsar Sentinel is a minimalistic cluster manager designed to prevent data corruption and split-brain scenarios in Pulsar deployments. It’s simple, dependency-free, and designed to play nicely with systemd alike.