No description

Find a file

Matthias Petermann 5ee7d9ca1b Fix README - options invoked with single dash		2025-10-16 07:06:07 +02:00
.gitignore	Initial version	2025-10-15 22:46:26 +02:00
go.mod	Updated dependencies	2025-10-16 07:01:31 +02:00
go.sum	Updated dependencies	2025-10-16 07:01:31 +02:00
LICENSE	Added LICENSE file	2025-10-16 06:51:38 +02:00
main.go	Configurable health check URL	2025-10-16 06:19:43 +02:00
Makefile	Reset to initial version	2025-10-16 06:09:22 +02:00
README.md	Fix README - options invoked with single dash	2025-10-16 07:06:07 +02:00
test.py	Added test server to simulate pulsar healthcheck endpoint	2025-10-16 06:28:45 +02:00

README.md

🛰️ Pulsar Sentinel

Pulsar Sentinel is a lightweight, self-contained cluster watchdog for Apache Pulsar brokers.
It ensures high availability and safe failover by exchanging UDP-based heartbeats, monitoring broker health, and applying fencing logic when quorum or priority conditions are violated.

🚀 Features

🧭 Heartbeat-based cluster coordination via UDP broadcast or multicast
❤️ Local broker health checks (HTTP endpoint returning ok)
🧩 Priority group logic (e.g., active/passive or multi-tier clusters)
🧠 Quorum evaluation to detect network partitions or node loss
🪓 Automatic fencing with grace and cooldown periods
🔒 Stateless, configuration-free runtime (no external dependencies)
📡 Built-in HTTP health endpoint for monitoring (used load balancers)

⚙️ How It Works

Each node runs a pulsar-sentinel process.
It periodically sends a heartbeat containing:

{
  "type": "heartbeat",
  "node": "broker-1-prio1",
  "ts": 1699999999,
  "data": {
    "priority": 1,
    "healthy": true
  }
}

All other nodes receive these via UDP.
Sentinel then:

Tracks which peers are healthy (based on age and broker health).
Counts healthy nodes per priority group.
Decides whether this node should remain active, or self-fence.

⚡ Fencing Rules

Condition	Action
Local priority group loses quorum	Trigger fencing after grace period
Higher-priority group regains quorum	Trigger fencing immediately
Quorum stable, no higher priority active	Remain active
Condition clears before grace expires	Cancel fencing

Fencing runs a configurable shell command (FENCING_COMMAND), e.g.:

export FENCING_COMMAND="ss --kill state established sport = :6651"

By default, it runs true (no-op).

💡 Info: In setups where an external load balancer distributes traffic across nodes of multiple priority groups but cannot actively drop connections on role change, the fencing action must ensure connection consistency. Since the load balancer will stop routing new connections once the healthcheck fails, the fencing command’s job is to terminate existing TCP sessions on the broker port (e.g., 6651 for Pulsar TLS). This can be done one-time or periodically using the Linux ss tool:
ss --kill state established sport = :6651
This guarantees that no stale client connections remain open toward a fenced (inactive) node.

🧩 HTTP Endpoints

`/`

Health endpoint for external systems.
Returns ok if:

The local broker healthcheck returns ok (HTTP 200 + body ok), and
The cluster quorum and priority rules allow this node to stay active.

Otherwise returns fail.

Example:

curl http://localhost:3000/
# ok

🧠 Command-line Flags

Flag	Default	Description
`-priority`	(required)	Node priority (1 = highest)
`-port`	`3001`	UDP port for heartbeats
`-http-port`	`3000`	Port for Sentinel’s HTTP health endpoint
`-healthcheck`	`http://localhost:8080/admin/v2/brokers/health`	Broker healthcheck URL
`-heartbeat-interval`	`5s`	Interval between heartbeats
`-timeout`	`15s`	Maximum heartbeat age before node is considered dead
`-grace-period`	`20s`	Delay before fencing after a failure is detected
`-fencing-cooldown`	`60s`	Cooldown time between fencing executions
`-quorum-size`	`2`	Minimum number of healthy nodes required per group
`-multicast`	(empty)	Multicast group address (e.g., `239.0.0.1`), defaults to broadcast
`-debug`	`false`	Enable detailed logging

🌍 Environment Variables

Variable	Purpose
`HEALTHCHECK_URL`	Overrides `-healthcheck` flag
`JWT_TOKEN`	Adds Bearer token to the broker healthcheck request
`FENCING_COMMAND`	Command executed when fencing is triggered

🩺 Example Setup

Node A (priority 1)

./pulsar-sentinel -priority 1 -healthcheck http://localhost:8080/health

Node B (priority 2)

./pulsar-sentinel -priority 2 -healthcheck http://localhost:8080/health

Sample Log Output

INFO[0000] Listening for broadcast on 0.0.0.0:3001
INFO[0000] Starting HTTP health server on :3000
INFO[0010] Current group counts: map[1:2 2:1]
WARN[0010] Condition detected: Higher priority group 1 restored quorum (2/2)
ERROR[0030] Fencing triggered: Higher priority group 1 restored quorum (2/2)
INFO[0030] Executing fencing command: "ss --kill state established sport = :6651"

🧪 Testing the Local Healthcheck

To simulate a simple broker health endpoint locally, you can use this Python snippet:

python3 -c "from http.server import BaseHTTPRequestHandler, HTTPServer; class H(BaseHTTPRequestHandler):     def do_GET(self): self.send_response(200); self.end_headers(); self.wfile.write(b'ok'); HTTPServer(('',8080), H).serve_forever()"

Then run Sentinel and query:

curl http://localhost:3000/
# ok

🛠️ Build & Run

Build

go build -o pulsar-sentinel main.go

Run

./pulsar-sentinel -priority 1 -healthcheck http://localhost:8080/health

Enable debug logs

./pulsar-sentinel -priority 1 -debug

🔒 Example Systemd Unit

[Unit]
Description=Pulsar Sentinel
After=network.target

[Service]
ExecStart=/usr/local/bin/pulsar-sentinel -priority 1
Restart=always
Environment="HEALTHCHECK_URL=http://localhost:8080/admin/v2/brokers/health"
Environment="FENCING_COMMAND=ss --kill state established sport = :6651"

[Install]
WantedBy=multi-user.target

📜 License

🧭 Summary

Pulsar Sentinel is a minimalistic cluster manager designed to prevent data corruption and split-brain scenarios in Pulsar deployments. It’s simple, dependency-free, and designed to play nicely with systemd alike.

README.md Unescape Escape