A lightweight high-availability (HA) and cluster watchdog for Apache Pulsar
Find a file
2025-10-15 17:07:29 +02:00
src Consider priority group quorum for individual health decision, reflect health state on root / context 2025-10-15 17:07:29 +02:00
.gitignore Initial import 2025-10-04 11:12:59 +02:00
Cargo.lock Initial import 2025-10-04 11:12:59 +02:00
Cargo.toml Initial import 2025-10-04 11:12:59 +02:00
LICENSE Updated LICENSE file 2025-10-04 13:15:49 +02:00
Makefile Fixes in Makefile 2025-10-13 13:42:31 +02:00
README.md Initial import 2025-10-04 11:12:59 +02:00

🛰️ Pulsar Sentinel

Pulsar Sentinel is a lightweight high-availability (HA) and cluster watchdog for Apache Pulsar or similar distributed systems. It continuously monitors broker health, exchanges heartbeat signals across nodes, and triggers fencing (automated failover or recovery actions) when quorum or health conditions are violated.


🚀 Features

  • Cluster Heartbeat System (UDP-based, multicast or broadcast)
  • Broker Health Checks via HTTP endpoint
  • Automatic Fencing with configurable grace and cooldown periods
  • Prometheus Metrics endpoint for observability
  • Quorum Detection based on node priorities
  • Designed for Geo-Replicated Pulsar Deployments
  • Asynchronous runtime powered by Tokio

⚙️ Usage

Command-line Options

Flag Description Default
--priority Node priority (1 = highest). Required. none
--port UDP port for heartbeat traffic 3001
--http-port HTTP port for Prometheus metrics 3000
--heartbeat-interval Time between heartbeat messages 5s
--timeout Maximum allowed heartbeat age before node is considered dead 15s
--grace-period Duration a failure must persist before fencing starts 20s
--fencing-cooldown Time between fencing actions 60s
--quorum-size Minimum number of healthy nodes required per priority group 2
--multicast Multicast group IP (e.g. 239.0.0.1). If empty, broadcast is used. ""
--healthcheck HTTP health endpoint (e.g. Pulsars /admin/v2/brokers/health) http://localhost:8080/admin/v2/brokers/health
--debug Enable debug-level logging false

Example

pulsar-sentinel   --priority 2   --port 3001   --http-port 3000   --heartbeat-interval 5s   --timeout 15s   --grace-period 20s   --fencing-cooldown 60s   --quorum-size 2   --multicast 239.0.0.1   --healthcheck http://localhost:8080/admin/v2/brokers/health

🔐 Environment Variables

Variable Description Example
JWT_TOKEN Optional authentication token for Pulsar healthcheck endpoint eyJhbGciOiJIUzI1NiIsInR...
FENCING_COMMAND Command executed when fencing is triggered podman restart pulsar-proxy

📊 Metrics (Prometheus)

Sentinel exposes metrics at http://<host>:<http_port>/metrics

Metric Description
sentinel_broker_healthy Local broker health (1=ok, 0=fail)
sentinel_group_nodes{priority} Count of healthy nodes per priority group
sentinel_fencing_active Indicates if fencing is currently active (1/0)

🧩 How It Works

  1. Each node periodically checks its local brokers health via HTTP.
  2. The node broadcasts (or multicasts) a heartbeat message containing:
    • its name
    • priority
    • health status
  3. Every node listens for peer heartbeats and maintains a local registry of cluster state.
  4. The quorum checker runs periodically:
    • If the current priority group loses quorum → start fencing after grace period.
    • If a higher-priority group regains quorum → stop fencing.
  5. Fencing executes a user-defined shell command (e.g. to restart a proxy or broker).

🌍 Failover Behavior in Geo-Replicated Environments

In asynchronous Geo-Replication setups (e.g. Pulsar clusters spread across regions), failover scenarios can be complex:

  • When a region becomes unhealthy or isolated, part of the clients might still stay connected to local brokers or proxies.
  • These clients may not immediately detect the failover — especially if the network partition is partial (e.g. only the inter-cluster replication path fails).
  • As a result, the "unhealthy" site can continue to serve stale or diverging data.

💡 Solution: Controlled Fencing

To enforce a clean failover, Sentinel can force-disconnect clients from unhealthy regions by restarting their proxy layer.
This ensures clients reconnect to a healthy cluster once failover is triggered.

Example fencing action:

export FENCING_COMMAND="podman restart pulsar-proxy"

When quorum is lost or the healthcheck fails persistently, Sentinel will execute this command locally.
This can be extended to run node-level fencing scripts or container orchestrator commands if desired.

The fencing process:

  1. Failure detected (loss of quorum or unhealthy broker).
  2. Grace period starts (to avoid flapping).
  3. If issue persists, fencing command runs → forcibly restarts the proxy container.
  4. Clients are disconnected and automatically reconnect to the other Pulsar region.

This mechanism prevents split-brain behavior and ensures that only one region serves active clients at a time.


🧠 Design Overview

+----------------------------+
| Region A (Active Cluster)  |
|  +----------------------+  |
|  | Pulsar Sentinel      |  |
|  |  - Heartbeats        |  |
|  |  - Fencing Control   |  |
|  +----------+-----------+  |
|             |              |
|  UDP / HTTP |              |
+-------------v--------------+
              ^
              |
+-------------+--------------+
| Region B (Passive Cluster) |
|  +----------------------+  |
|  | Pulsar Sentinel      |  |
|  |  - Heartbeats        |  |
|  |  - Proxy Restart     |  |
|  +----------------------+  |
+----------------------------+

🧰 Build & Run

Requirements

  • Rust 1.74+
  • Tokio runtime
  • Network access for UDP broadcast or multicast

Build

cargo build --release

Run

./target/release/pulsar-sentinel --priority 1

📘 License

MIT License © 2025 Matthias Petermann / Petermann Digital