Documentation
¶
Overview ¶
Package daemonkit provides a reusable kernel for long-running daemons: a supervised-monitor restart loop, a Unix-socket HTTP control plane, and sd_notify integration. It depends only on the standard library, errorx, and golang.org/x/sync/errgroup, and intentionally imports nothing under internal/... or cmd/... so it can be shared across daemons (e.g. the solo-provisioner daemon and a future solo-operator daemon).
Index ¶
- func NotifyReady() error
- func NotifyStopping() error
- func SupervisedMonitor(ctx context.Context, m MonitorRunner, opts SupervisorOptions)
- func Watchdog(ctx context.Context, opts WatchdogOptions)
- func WatchdogInterval() (time.Duration, bool)
- type ComponentHandler
- type ComponentProbe
- type CompositeProbe
- type ConnectivityMonitor
- type DiskOwnershipProbe
- type DiskPermissionProbe
- type DiskWriteTestProbe
- type MonitorRunner
- type MonitorState
- type ProbableMonitor
- type Probe
- type ProbeError
- type Server
- type ServerConfig
- type ServerOptions
- type StatusError
- type StatusTracker
- type SupervisorOptions
- type TaggedProbe
- type WatchdogOptions
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func NotifyReady ¶
func NotifyReady() error
NotifyReady sends READY=1 to systemd, signalling that the daemon has finished startup and its socket is serving. No-op when NOTIFY_SOCKET is unset.
func NotifyStopping ¶
func NotifyStopping() error
NotifyStopping sends STOPPING=1 to systemd, signalling that the daemon has begun a graceful shutdown. No-op when NOTIFY_SOCKET is unset.
func SupervisedMonitor ¶
func SupervisedMonitor(ctx context.Context, m MonitorRunner, opts SupervisorOptions)
SupervisedMonitor runs m in a restart loop. When m.Run returns a non-nil error the supervisor waits for a back-off delay and then restarts it. Clean shutdown (nil return or ctx cancellation) exits the loop immediately without restarting.
Back-off strategy:
- Starts at supervisedBackoffInitial (5 s).
- Doubles on each crash up to supervisedBackoffCap (5 min).
- Resets to supervisedBackoffInitial when the monitor runs stably for at least supervisedStableThreshold (60 s) before the next crash.
Degradation alerting:
- Tracks consecutive crashes (resets after a stable run).
- Emits a MonitorDegraded Error log every supervisedDegradedThreshold consecutive crashes (at crash #5, #10, #15, …) so ops keeps seeing the alert as long as the monitor remains degraded.
Heartbeats:
- When opts.HeartbeatInterval > 0, a MonitorHeartbeat Info record is emitted on that interval while the monitor is running, so remote observability can alert on the absence of heartbeats.
This function never returns an error — it absorbs crashes and restarts the monitor indefinitely until ctx is cancelled.
See SupervisorOptions for the meaning of each field; the zero value is valid.
func Watchdog ¶
func Watchdog(ctx context.Context, opts WatchdogOptions)
Watchdog runs the systemd watchdog keepalive loop until ctx is cancelled.
It is a no-op (returns immediately) when the watchdog is not enabled for this process — see WatchdogInterval — so it is always safe to call unconditionally; enable it by setting WatchdogSec= in the unit file. When enabled it sends WATCHDOG=1 to NOTIFY_SOCKET every interval/2 (the conventional safety margin), optionally gated by opts.IsAlive.
This is opt-in: nothing else in the kit calls it. Invoke it (typically in its own goroutine) only if you want systemd to kill+restart the daemon when it stops pinging. Pair it with Restart=on-failure in the unit.
func WatchdogInterval ¶
WatchdogInterval reports the systemd watchdog interval for this process and whether the watchdog is enabled, mirroring sd_watchdog_enabled(3). It reads WATCHDOG_USEC and, when WATCHDOG_PID is set, honours it so a value inherited by a child process is ignored. When the watchdog is not enabled for this process it returns (0, false), so callers can branch without env parsing.
Types ¶
type ComponentHandler ¶
ComponentHandler is implemented by each component to register its own HTTP route sub-tree on the daemon control plane.
Convention: all routes registered by a handler must be prefixed with /<component_name>/ (e.g. /consensus_node/..., /block_node/...) to keep the API namespace partitioned. Process-level routes (/health, /status) are registered by the Server itself and must not be claimed by any ComponentHandler.
type ComponentProbe ¶
type ComponentProbe interface {
Probe(ctx context.Context) error
// ComponentName returns the component identifier used in structured log
// entries (e.g. "consensus-node").
ComponentName() string
}
ComponentProbe is the component-boundary interface seen by the supervisor. A component with no external dependencies sets its probe field to nil and is treated as immediately ready by the composite probe runner.
func BuildComponentProbe ¶
func BuildComponentProbe(componentName string, monitors []MonitorRunner) ComponentProbe
BuildComponentProbe collects RequiredProbe() from every ProbableMonitor in monitors and wraps them in a CompositeProbe named componentName. Returns nil when no monitor declares a prerequisite (host-only component); the supervisor treats a nil probe as immediately ready.
type CompositeProbe ¶
type CompositeProbe struct {
// contains filtered or unexported fields
}
CompositeProbe implements ComponentProbe at the component boundary. It fans out to a set of leaf Probe instances concurrently and returns nil only when every sub-probe passes. The first failure cancels sibling probes via errgroup context cancellation so the composite exits as fast as possible.
Sub-probes may themselves be CompositeProbe instances — since CompositeProbe satisfies the Probe interface, probes can be nested to arbitrary depth.
Use NewCompositeProbe to construct.
func NewCompositeProbe ¶
func NewCompositeProbe(componentName string, leafProbes ...Probe) *CompositeProbe
NewCompositeProbe returns a CompositeProbe that runs all provided leaf probes concurrently under the given component name.
func (*CompositeProbe) ComponentName ¶
func (c *CompositeProbe) ComponentName() string
ComponentName implements ComponentProbe.
type ConnectivityMonitor ¶
type ConnectivityMonitor interface {
MonitorRunner
// ConnectivityError returns the current connectivity failure, or nil when
// the monitor's last operation completed successfully. Recovery (a
// successful list + watch cycle) must clear the error within one cycle.
//
// Concurrency: implementations MUST make this safe for concurrent read.
// The daemon's HTTP server goroutine calls ConnectivityError (via
// statusSnapshot) while the monitor's own Run goroutine is writing the
// underlying field. Guard the field with an atomic (e.g.
// atomic.Pointer[StatusError]) or a mutex; a plain field read/written
// from both goroutines is a data race.
ConnectivityError() *StatusError
}
ConnectivityMonitor is optionally implemented by monitors that maintain an in-process record of their last connectivity error (e.g. a K8s watch failure). statusSnapshot overlays ConnectivityError onto the tracker state so failures are visible via /status even while the goroutine is alive and retrying inside Run() — a goroutine in a retry loop is "running" by the supervisor's definition, but operators need to see the connectivity problem.
type DiskOwnershipProbe ¶
type DiskOwnershipProbe struct {
// Path is the file or directory to inspect.
Path string
// User is the expected owner username (e.g. "hedera"). Empty = skip.
User string
// Group is the expected owning group name (e.g. "hedera", "weaver"). Empty = skip.
Group string
// Permission is the set of mode bits that must all be present (e.g. 0o755).
// Zero = skip.
Permission os.FileMode
}
DiskOwnershipProbe verifies that Path exists and matches the declared owner, group, and/or permission bits. Any field left at its zero value is skipped:
- User == "" → owner username not checked
- Group == "" → owning group not checked
- Permission == 0 → permission bits not checked
Example — ensure /opt/hgcapp is owned by hedera:hedera with rwxr-xr-x:
&DiskOwnershipProbe{
Path: "/opt/hgcapp",
User: "hedera",
Group: "hedera",
Permission: 0o755,
}
Note: ownership is read from the inode via syscall.Stat_t. This probe does not check whether the current process has access — use DiskWriteTestProbe for that.
type DiskPermissionProbe ¶
type DiskPermissionProbe struct {
// Path is the file or directory to inspect.
Path string
// Permission is the set of mode bits that must all be present.
// Examples: 0o400 (owner-read), 0o600 (owner read+write), 0o700 (owner rwx).
Permission os.FileMode
}
DiskPermissionProbe verifies that Path exists and has at least the declared permission bits set on the inode. It checks the file mode returned by os.Stat — i.e. declared permissions, not actual process-level access.
Use DiskWriteTestProbe when you need to confirm the running process can actually write to a directory (takes side effects into account: ownership, ACLs, etc.).
type DiskWriteTestProbe ¶
type DiskWriteTestProbe struct {
// Dir is the directory to test write access in.
Dir string
}
DiskWriteTestProbe verifies that the running process can actually write to Dir by creating and immediately removing a temporary file. Unlike DiskPermissionProbe it exercises real process-level access — ownership, ACLs, mount flags, and SELinux/AppArmor policies are all tested implicitly.
Use this when the daemon must write to a directory at runtime and you want a startup guarantee that the write will succeed (e.g. the upgrade staging dir).
type MonitorRunner ¶
type MonitorRunner interface {
// Run starts the monitor and blocks until ctx is cancelled or the monitor
// encounters an unrecoverable error. A nil return means clean shutdown; a
// non-nil return triggers a supervised restart with back-off.
Run(ctx context.Context) error
// Name returns a stable, human-readable identifier for the monitor used
// in structured log entries (e.g. "upgrade-monitor", "migration-monitor").
Name() string
}
MonitorRunner is the interface that each long-running monitor goroutine must implement so it can be managed by SupervisedMonitor.
Implementations must:
- Return nil when ctx is cancelled (clean shutdown, no restart).
- Return a non-nil error only on unexpected failure (triggers supervised restart).
- Be safe to call again after returning an error (the supervisor calls Run again).
type MonitorState ¶
type MonitorState struct {
State string `json:"state"`
Error *StatusError `json:"error,omitempty"`
}
MonitorState describes the runtime state of a single supervised monitor. State values:
- "running" — monitor is executing normally
- "degraded" — monitor is running but its last operation failed; see Error for details; the monitor continues retrying automatically
- "backoff:<dur>" — monitor crashed (Run returned non-nil) and is waiting before restart
- "stopped" — monitor exited cleanly (ctx cancelled or nil return)
type ProbableMonitor ¶
type ProbableMonitor interface {
MonitorRunner
RequiredProbe() Probe
}
ProbableMonitor is optionally implemented by monitors that require external resources to be verified before they run. RequiredProbe returns a single Probe representing everything the monitor needs.
The component automatically collects RequiredProbe() from every enabled ProbableMonitor and combines them into its CompositeProbe via BuildComponentProbe.
type Probe ¶
Probe is the minimal leaf interface for a single prerequisite check. Concrete implementations (e.g. a disk-permission or RBAC probe) satisfy this interface. Probe should block and retry internally until success or ctx cancellation; returning ctx.Err() on cancellation is the expected exit path.
type ProbeError ¶
type ProbeError struct {
// Reason is a stable, machine-readable key (e.g. "UpgradeDirOwnershipCheckFailed").
Reason string
// Resolution is an actionable command or instruction the operator should run.
Resolution string
// Message is the human-readable error string. When empty, Error() falls back
// to the wrapped error's message.
Message string
// Err is the underlying error that triggered this failure, if any.
Err error
}
ProbeError is a kit-native error carrying an operator-facing Reason code and Resolution hint as plain struct fields. It mirrors StatusError so that the daemon boundary can build a rich StatusError without reaching for an errorx property registry — keeping daemonkit free of any consumer-model coupling.
Callers that want doctor-layer styling re-wrap ProbeError into errorx with their own property keys at the consumer boundary; the kit itself stays dependency-light.
func (*ProbeError) Error ¶
func (e *ProbeError) Error() string
Error implements error. It prefers Message, falling back to the wrapped error.
func (*ProbeError) Unwrap ¶
func (e *ProbeError) Unwrap() error
Unwrap exposes the underlying error for errors.Is / errors.As traversal.
type Server ¶
type Server struct {
// contains filtered or unexported fields
}
Server is the Unix socket HTTP control plane for a daemon.
func NewServer ¶
func NewServer(sockPath string, opts ServerOptions, cfg ServerConfig) *Server
NewServer constructs a Server and registers all routes.
Process-level routes (/health, /status) are always registered. Component routes are registered by calling RegisterRoutes on each entry in opts.ComponentHandlers.
Route scheme: /<component>/<monitor>/<sub-resource>/<verb>
type ServerConfig ¶
type ServerConfig struct {
// ReadHeaderTimeout is the maximum time to read request headers.
// Defaults to 5 s if zero. Set to a shorter value in tests.
ReadHeaderTimeout time.Duration
}
ServerConfig holds tunable parameters for Server. Zero values use defaults.
type ServerOptions ¶
type ServerOptions struct {
// StatusFn returns the full daemon status for GET /status. The returned
// value is serialised to JSON verbatim, so the concrete status payload type
// stays in the consuming daemon. Nil disables the endpoint (returns an
// empty JSON object).
StatusFn func() any
// ComponentHandlers registers per-component route sub-trees.
// Each entry owns its own /<component>/ prefix.
ComponentHandlers []ComponentHandler
// Logger is the structured logger the server logs through. When nil the
// server logs to a no-op discard logger, so it stays silent until a logger
// is injected (it never writes to the global slog default implicitly).
//
// To route output through zerolog + lumberjack, pass a logx-backed logger:
//
// opts.Logger = slog.New(logx.NewSlogHandler()) // github.com/automa-saga/logx
Logger *slog.Logger
}
ServerOptions groups all injectable dependencies for NewServer.
type StatusError ¶
type StatusError struct {
// Reason is a stable, machine-readable key matching the log reason field
// (e.g. "UpgradeMonitorListError", "UpgradeDirOwnershipCheckFailed").
Reason string `json:"reason"`
// Message is the human-readable error string.
Message string `json:"message"`
// Resolution is an actionable command or instruction the operator should
// run to resolve the issue. Empty when no specific remediation is known.
Resolution string `json:"resolution,omitempty"`
// Since is the RFC 3339 timestamp of when this error was first observed.
Since string `json:"since"`
}
StatusError is a rich, operator-facing error descriptor used in /status for both monitor connectivity failures and component probe (disk prerequisite) failures. Every populated field gives the operator enough context to act without opening journalctl.
type StatusTracker ¶
type StatusTracker struct {
// contains filtered or unexported fields
}
StatusTracker holds the latest observed state for a set of monitors. It is safe for concurrent use; SupervisedMonitor updates it on each state transition.
func NewStatusTracker ¶
func NewStatusTracker() *StatusTracker
NewStatusTracker returns an empty StatusTracker.
func (*StatusTracker) Snapshot ¶
func (t *StatusTracker) Snapshot() map[string]MonitorState
Snapshot returns a copy of all monitor states at the time of the call.
type SupervisorOptions ¶
type SupervisorOptions struct {
// Tracker, when non-nil, is updated on every monitor state transition so the
// /status endpoint can report per-monitor state without polling. May be nil.
Tracker *StatusTracker
// Logger is the structured logger the supervisor logs through (crash,
// back-off, degradation, heartbeat, clean exit). When nil the supervisor
// logs to a no-op discard logger and stays silent — it never writes to the
// global slog default implicitly. Inject a logger
// (e.g. slog.New(logx.NewSlogHandler())) to route diagnostics to your
// logging backend.
Logger *slog.Logger
// HeartbeatInterval, when greater than zero, makes the supervisor emit a
// periodic MonitorHeartbeat Info record (with the monitor name and its
// current uptime) while the monitor is in the running state. This lets a
// remote observability backend detect an alive-but-wedged monitor — one
// blocked inside Run, never crashing and never logging — by the ABSENCE of
// heartbeats. Zero (the default) disables heartbeats entirely.
HeartbeatInterval time.Duration
}
SupervisorOptions groups the optional dependencies and tunables for SupervisedMonitor. The zero value is valid: no status tracking, a silent (discard) logger, and no heartbeat.
type TaggedProbe ¶
TaggedProbe wraps a leaf Probe and attaches an operator-facing Reason code and Resolution hint to any error it returns. Use it inside RequiredProbe() implementations so that every prerequisite failure carries context-specific guidance for the operator.
On failure it returns a *ProbeError carrying Reason and Resolution as plain struct fields (no errorx property registry). The daemon boundary reads those fields directly to build a StatusError.
Example:
&daemonkit.TaggedProbe{
Inner: &daemonkit.DiskOwnershipProbe{Path: upgradeRoot, ...},
Reason: "UpgradeRootOwnershipCheckFailed",
Resolution: "sudo chown hedera:hedera " + upgradeRoot,
}
type WatchdogOptions ¶
type WatchdogOptions struct {
// Logger, when non-nil, logs watchdog lifecycle and ping failures. When nil
// the loop is silent (discard logger), consistent with the rest of the kit.
Logger *slog.Logger
// IsAlive, when non-nil, gates each keepalive: WATCHDOG=1 is sent only when
// IsAlive() returns true. When it returns false the ping is withheld, so
// systemd's WatchdogSec timer eventually fires and restarts the PROCESS.
//
// Use this ONLY when a process restart is the correct response to the
// monitored condition — most cleanly a single-monitor daemon where the
// process IS the monitor, so a restart has no healthy-monitor collateral.
// Leave it nil for an unconditional keepalive that guards only against a
// total process freeze. A multi-monitor daemon should generally leave this
// nil: withholding pings bounces the whole process and resets every healthy
// monitor too, which is rarely what you want.
IsAlive func() bool
}
WatchdogOptions configures the optional systemd watchdog keepalive loop.