| src | ||
| .gitignore | ||
| Cargo.lock | ||
| Cargo.toml | ||
| LICENSE | ||
| Makefile | ||
| README.md | ||
🛰️ Pulsar Sentinel
Pulsar Sentinel is a lightweight high-availability (HA) and cluster watchdog for Apache Pulsar or similar distributed systems. It continuously monitors broker health, exchanges heartbeat signals across nodes, and triggers fencing (automated failover or recovery actions) when quorum or health conditions are violated.
🚀 Features
- ✅ Cluster Heartbeat System (UDP-based, multicast or broadcast)
- ✅ Broker Health Checks via HTTP endpoint
- ✅ Automatic Fencing with configurable grace and cooldown periods
- ✅ Prometheus Metrics endpoint for observability
- ✅ Quorum Detection based on node priorities
- ✅ Designed for Geo-Replicated Pulsar Deployments
- ✅ Asynchronous runtime powered by Tokio
⚙️ Usage
Command-line Options
| Flag | Description | Default |
|---|---|---|
--priority |
Node priority (1 = highest). Required. | none |
--port |
UDP port for heartbeat traffic | 3001 |
--http-port |
HTTP port for Prometheus metrics | 3000 |
--heartbeat-interval |
Time between heartbeat messages | 5s |
--timeout |
Maximum allowed heartbeat age before node is considered dead | 15s |
--grace-period |
Duration a failure must persist before fencing starts | 20s |
--fencing-cooldown |
Time between fencing actions | 60s |
--quorum-size |
Minimum number of healthy nodes required per priority group | 2 |
--multicast |
Multicast group IP (e.g. 239.0.0.1). If empty, broadcast is used. |
"" |
--healthcheck |
HTTP health endpoint (e.g. Pulsar’s /admin/v2/brokers/health) |
http://localhost:8080/admin/v2/brokers/health |
--debug |
Enable debug-level logging | false |
Example
pulsar-sentinel --priority 2 --port 3001 --http-port 3000 --heartbeat-interval 5s --timeout 15s --grace-period 20s --fencing-cooldown 60s --quorum-size 2 --multicast 239.0.0.1 --healthcheck http://localhost:8080/admin/v2/brokers/health
🔐 Environment Variables
| Variable | Description | Example |
|---|---|---|
JWT_TOKEN |
Optional authentication token for Pulsar healthcheck endpoint | eyJhbGciOiJIUzI1NiIsInR... |
FENCING_COMMAND |
Command executed when fencing is triggered | podman restart pulsar-proxy |
📊 Metrics (Prometheus)
Sentinel exposes metrics at http://<host>:<http_port>/metrics
| Metric | Description |
|---|---|
sentinel_broker_healthy |
Local broker health (1=ok, 0=fail) |
sentinel_group_nodes{priority} |
Count of healthy nodes per priority group |
sentinel_fencing_active |
Indicates if fencing is currently active (1/0) |
🧩 How It Works
- Each node periodically checks its local broker’s health via HTTP.
- The node broadcasts (or multicasts) a heartbeat message containing:
- its name
- priority
- health status
- Every node listens for peer heartbeats and maintains a local registry of cluster state.
- The quorum checker runs periodically:
- If the current priority group loses quorum → start fencing after grace period.
- If a higher-priority group regains quorum → stop fencing.
- Fencing executes a user-defined shell command (e.g. to restart a proxy or broker).
🌍 Failover Behavior in Geo-Replicated Environments
In asynchronous Geo-Replication setups (e.g. Pulsar clusters spread across regions), failover scenarios can be complex:
- When a region becomes unhealthy or isolated, part of the clients might still stay connected to local brokers or proxies.
- These clients may not immediately detect the failover — especially if the network partition is partial (e.g. only the inter-cluster replication path fails).
- As a result, the "unhealthy" site can continue to serve stale or diverging data.
💡 Solution: Controlled Fencing
To enforce a clean failover, Sentinel can force-disconnect clients from unhealthy regions by restarting their proxy layer.
This ensures clients reconnect to a healthy cluster once failover is triggered.
Example fencing action:
export FENCING_COMMAND="podman restart pulsar-proxy"
When quorum is lost or the healthcheck fails persistently, Sentinel will execute this command locally.
This can be extended to run node-level fencing scripts or container orchestrator commands if desired.
The fencing process:
- Failure detected (loss of quorum or unhealthy broker).
- Grace period starts (to avoid flapping).
- If issue persists, fencing command runs → forcibly restarts the proxy container.
- Clients are disconnected and automatically reconnect to the other Pulsar region.
This mechanism prevents split-brain behavior and ensures that only one region serves active clients at a time.
🧠 Design Overview
+----------------------------+
| Region A (Active Cluster) |
| +----------------------+ |
| | Pulsar Sentinel | |
| | - Heartbeats | |
| | - Fencing Control | |
| +----------+-----------+ |
| | |
| UDP / HTTP | |
+-------------v--------------+
^
|
+-------------+--------------+
| Region B (Passive Cluster) |
| +----------------------+ |
| | Pulsar Sentinel | |
| | - Heartbeats | |
| | - Proxy Restart | |
| +----------------------+ |
+----------------------------+
🧰 Build & Run
Requirements
- Rust 1.74+
- Tokio runtime
- Network access for UDP broadcast or multicast
Build
cargo build --release
Run
./target/release/pulsar-sentinel --priority 1
📘 License
MIT License © 2025 Matthias Petermann / Petermann Digital