No description
Find a file
2025-10-16 07:06:07 +02:00
.gitignore Initial version 2025-10-15 22:46:26 +02:00
go.mod Updated dependencies 2025-10-16 07:01:31 +02:00
go.sum Updated dependencies 2025-10-16 07:01:31 +02:00
LICENSE Added LICENSE file 2025-10-16 06:51:38 +02:00
main.go Configurable health check URL 2025-10-16 06:19:43 +02:00
Makefile Reset to initial version 2025-10-16 06:09:22 +02:00
README.md Fix README - options invoked with single dash 2025-10-16 07:06:07 +02:00
test.py Added test server to simulate pulsar healthcheck endpoint 2025-10-16 06:28:45 +02:00

🛰️ Pulsar Sentinel

Pulsar Sentinel is a lightweight, self-contained cluster watchdog for Apache Pulsar brokers.
It ensures high availability and safe failover by exchanging UDP-based heartbeats, monitoring broker health, and applying fencing logic when quorum or priority conditions are violated.


🚀 Features

  • 🧭 Heartbeat-based cluster coordination via UDP broadcast or multicast
  • ❤️ Local broker health checks (HTTP endpoint returning ok)
  • 🧩 Priority group logic (e.g., active/passive or multi-tier clusters)
  • 🧠 Quorum evaluation to detect network partitions or node loss
  • 🪓 Automatic fencing with grace and cooldown periods
  • 🔒 Stateless, configuration-free runtime (no external dependencies)
  • 📡 Built-in HTTP health endpoint for monitoring (used load balancers)

⚙️ How It Works

Each node runs a pulsar-sentinel process.
It periodically sends a heartbeat containing:

{
  "type": "heartbeat",
  "node": "broker-1-prio1",
  "ts": 1699999999,
  "data": {
    "priority": 1,
    "healthy": true
  }
}

All other nodes receive these via UDP.
Sentinel then:

  1. Tracks which peers are healthy (based on age and broker health).
  2. Counts healthy nodes per priority group.
  3. Decides whether this node should remain active, or self-fence.

Fencing Rules

Condition Action
Local priority group loses quorum Trigger fencing after grace period
Higher-priority group regains quorum Trigger fencing immediately
Quorum stable, no higher priority active Remain active
Condition clears before grace expires Cancel fencing

Fencing runs a configurable shell command (FENCING_COMMAND), e.g.:

export FENCING_COMMAND="ss --kill state established sport = :6651"

By default, it runs true (no-op).

💡 Info: In setups where an external load balancer distributes traffic across nodes of multiple priority groups but cannot actively drop connections on role change, the fencing action must ensure connection consistency. Since the load balancer will stop routing new connections once the healthcheck fails, the fencing commands job is to terminate existing TCP sessions on the broker port (e.g., 6651 for Pulsar TLS). This can be done one-time or periodically using the Linux ss tool:

ss --kill state established sport = :6651

This guarantees that no stale client connections remain open toward a fenced (inactive) node.


🧩 HTTP Endpoints

/

Health endpoint for external systems.
Returns ok if:

  • The local broker healthcheck returns ok (HTTP 200 + body ok), and
  • The cluster quorum and priority rules allow this node to stay active.

Otherwise returns fail.

Example:

curl http://localhost:3000/
# ok

🧠 Command-line Flags

Flag Default Description
-priority (required) Node priority (1 = highest)
-port 3001 UDP port for heartbeats
-http-port 3000 Port for Sentinels HTTP health endpoint
-healthcheck http://localhost:8080/admin/v2/brokers/health Broker healthcheck URL
-heartbeat-interval 5s Interval between heartbeats
-timeout 15s Maximum heartbeat age before node is considered dead
-grace-period 20s Delay before fencing after a failure is detected
-fencing-cooldown 60s Cooldown time between fencing executions
-quorum-size 2 Minimum number of healthy nodes required per group
-multicast (empty) Multicast group address (e.g., 239.0.0.1), defaults to broadcast
-debug false Enable detailed logging

🌍 Environment Variables

Variable Purpose
HEALTHCHECK_URL Overrides -healthcheck flag
JWT_TOKEN Adds Bearer token to the broker healthcheck request
FENCING_COMMAND Command executed when fencing is triggered

🩺 Example Setup

Node A (priority 1)

./pulsar-sentinel -priority 1 -healthcheck http://localhost:8080/health

Node B (priority 2)

./pulsar-sentinel -priority 2 -healthcheck http://localhost:8080/health

Sample Log Output

INFO[0000] Listening for broadcast on 0.0.0.0:3001
INFO[0000] Starting HTTP health server on :3000
INFO[0010] Current group counts: map[1:2 2:1]
WARN[0010] Condition detected: Higher priority group 1 restored quorum (2/2)
ERROR[0030] Fencing triggered: Higher priority group 1 restored quorum (2/2)
INFO[0030] Executing fencing command: "ss --kill state established sport = :6651"

🧪 Testing the Local Healthcheck

To simulate a simple broker health endpoint locally, you can use this Python snippet:

python3 -c "from http.server import BaseHTTPRequestHandler, HTTPServer; class H(BaseHTTPRequestHandler):     def do_GET(self): self.send_response(200); self.end_headers(); self.wfile.write(b'ok'); HTTPServer(('',8080), H).serve_forever()"

Then run Sentinel and query:

curl http://localhost:3000/
# ok

🛠️ Build & Run

Build

go build -o pulsar-sentinel main.go

Run

./pulsar-sentinel -priority 1 -healthcheck http://localhost:8080/health

Enable debug logs

./pulsar-sentinel -priority 1 -debug

🔒 Example Systemd Unit

[Unit]
Description=Pulsar Sentinel
After=network.target

[Service]
ExecStart=/usr/local/bin/pulsar-sentinel -priority 1
Restart=always
Environment="HEALTHCHECK_URL=http://localhost:8080/admin/v2/brokers/health"
Environment="FENCING_COMMAND=ss --kill state established sport = :6651"

[Install]
WantedBy=multi-user.target

📜 License

MIT License © 2025 Matthias Petermann


🧭 Summary

Pulsar Sentinel is a minimalistic cluster manager designed to prevent data corruption and split-brain scenarios in Pulsar deployments. Its simple, dependency-free, and designed to play nicely with systemd alike.