daemonkit

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 19, 2026 License: Apache-2.0 Imports: 17 Imported by: 0

README

daemonkit

PR Checks

Small, dependency-light building blocks for long-running Go daemons: a supervised monitor restart loop, a Unix-socket control-plane HTTP server with health/status routes, a composite probe framework, systemd sd_notify helpers, an append-only JSONL event logger, and a retention-based file pruner.

The module deliberately keeps a minimal production dependency surface:

Packages

Import Purpose
github.com/automa-saga/daemonkit Control-plane HTTP server, supervised monitor, probe framework, sd_notify
github.com/automa-saga/daemonkit/eventlog Append-only JSONL structured event logger
github.com/automa-saga/daemonkit/filepruner Retention-based file pruning

The three packages are mutually independent — import only what you need.

Install

go get github.com/automa-saga/daemonkit@latest

Requires Go 1.26 or newer.

Quick start

See the User Guide for probes, component routes, the event logger, and the file pruner.

Documentation

  • Architecture — design, the daemon kernel model, and concurrency contracts.
  • User Guide — runnable examples for every package.

License

Apache-2.0. See LICENSE.

Documentation

Overview

Package daemonkit provides a reusable kernel for long-running daemons: a supervised-monitor restart loop, a Unix-socket HTTP control plane, and sd_notify integration. It depends only on the standard library, errorx, and golang.org/x/sync/errgroup, and intentionally imports nothing under internal/... or cmd/... so it can be shared across daemons (e.g. the solo-provisioner daemon and a future solo-operator daemon).

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NotifyReady

func NotifyReady() error

NotifyReady sends READY=1 to systemd, signalling that the daemon has finished startup and its socket is serving. No-op when NOTIFY_SOCKET is unset.

func NotifyStopping

func NotifyStopping() error

NotifyStopping sends STOPPING=1 to systemd, signalling that the daemon has begun a graceful shutdown. No-op when NOTIFY_SOCKET is unset.

func SupervisedMonitor

func SupervisedMonitor(ctx context.Context, m MonitorRunner, opts SupervisorOptions)

SupervisedMonitor runs m in a restart loop. When m.Run returns a non-nil error the supervisor waits for a back-off delay and then restarts it. Clean shutdown (nil return or ctx cancellation) exits the loop immediately without restarting.

Back-off strategy:

  • Starts at supervisedBackoffInitial (5 s).
  • Doubles on each crash up to supervisedBackoffCap (5 min).
  • Resets to supervisedBackoffInitial when the monitor runs stably for at least supervisedStableThreshold (60 s) before the next crash.

Degradation alerting:

  • Tracks consecutive crashes (resets after a stable run).
  • Emits a MonitorDegraded Error log every supervisedDegradedThreshold consecutive crashes (at crash #5, #10, #15, …) so ops keeps seeing the alert as long as the monitor remains degraded.

Heartbeats:

  • When opts.HeartbeatInterval > 0, a MonitorHeartbeat Info record is emitted on that interval while the monitor is running, so remote observability can alert on the absence of heartbeats.

This function never returns an error — it absorbs crashes and restarts the monitor indefinitely until ctx is cancelled.

See SupervisorOptions for the meaning of each field; the zero value is valid.

func Watchdog

func Watchdog(ctx context.Context, opts WatchdogOptions)

Watchdog runs the systemd watchdog keepalive loop until ctx is cancelled.

It is a no-op (returns immediately) when the watchdog is not enabled for this process — see WatchdogInterval — so it is always safe to call unconditionally; enable it by setting WatchdogSec= in the unit file. When enabled it sends WATCHDOG=1 to NOTIFY_SOCKET every interval/2 (the conventional safety margin), optionally gated by opts.IsAlive.

This is opt-in: nothing else in the kit calls it. Invoke it (typically in its own goroutine) only if you want systemd to kill+restart the daemon when it stops pinging. Pair it with Restart=on-failure in the unit.

func WatchdogInterval

func WatchdogInterval() (time.Duration, bool)

WatchdogInterval reports the systemd watchdog interval for this process and whether the watchdog is enabled, mirroring sd_watchdog_enabled(3). It reads WATCHDOG_USEC and, when WATCHDOG_PID is set, honours it so a value inherited by a child process is ignored. When the watchdog is not enabled for this process it returns (0, false), so callers can branch without env parsing.

Types

type ComponentHandler

type ComponentHandler interface {
	RegisterRoutes(mux *http.ServeMux)
}

ComponentHandler is implemented by each component to register its own HTTP route sub-tree on the daemon control plane.

Convention: all routes registered by a handler must be prefixed with /<component_name>/ (e.g. /consensus_node/..., /block_node/...) to keep the API namespace partitioned. Process-level routes (/health, /status) are registered by the Server itself and must not be claimed by any ComponentHandler.

type ComponentProbe

type ComponentProbe interface {
	Probe(ctx context.Context) error

	// ComponentName returns the component identifier used in structured log
	// entries (e.g. "consensus-node").
	ComponentName() string
}

ComponentProbe is the component-boundary interface seen by the supervisor. A component with no external dependencies sets its probe field to nil and is treated as immediately ready by the composite probe runner.

func BuildComponentProbe

func BuildComponentProbe(componentName string, monitors []MonitorRunner) ComponentProbe

BuildComponentProbe collects RequiredProbe() from every ProbableMonitor in monitors and wraps them in a CompositeProbe named componentName. Returns nil when no monitor declares a prerequisite (host-only component); the supervisor treats a nil probe as immediately ready.

type CompositeProbe

type CompositeProbe struct {
	// contains filtered or unexported fields
}

CompositeProbe implements ComponentProbe at the component boundary. It fans out to a set of leaf Probe instances concurrently and returns nil only when every sub-probe passes. The first failure cancels sibling probes via errgroup context cancellation so the composite exits as fast as possible.

Sub-probes may themselves be CompositeProbe instances — since CompositeProbe satisfies the Probe interface, probes can be nested to arbitrary depth.

Use NewCompositeProbe to construct.

func NewCompositeProbe

func NewCompositeProbe(componentName string, leafProbes ...Probe) *CompositeProbe

NewCompositeProbe returns a CompositeProbe that runs all provided leaf probes concurrently under the given component name.

func (*CompositeProbe) ComponentName

func (c *CompositeProbe) ComponentName() string

ComponentName implements ComponentProbe.

func (*CompositeProbe) Probe

func (c *CompositeProbe) Probe(ctx context.Context) error

Probe implements ComponentProbe (and the Probe interface). It fans out to all sub-probes concurrently; the first failure cancels the rest via the errgroup context.

type ConnectivityMonitor

type ConnectivityMonitor interface {
	MonitorRunner
	// ConnectivityError returns the current connectivity failure, or nil when
	// the monitor's last operation completed successfully. Recovery (a
	// successful list + watch cycle) must clear the error within one cycle.
	//
	// Concurrency: implementations MUST make this safe for concurrent read.
	// The daemon's HTTP server goroutine calls ConnectivityError (via
	// statusSnapshot) while the monitor's own Run goroutine is writing the
	// underlying field. Guard the field with an atomic (e.g.
	// atomic.Pointer[StatusError]) or a mutex; a plain field read/written
	// from both goroutines is a data race.
	ConnectivityError() *StatusError
}

ConnectivityMonitor is optionally implemented by monitors that maintain an in-process record of their last connectivity error (e.g. a K8s watch failure). statusSnapshot overlays ConnectivityError onto the tracker state so failures are visible via /status even while the goroutine is alive and retrying inside Run() — a goroutine in a retry loop is "running" by the supervisor's definition, but operators need to see the connectivity problem.

type DiskOwnershipProbe

type DiskOwnershipProbe struct {
	// Path is the file or directory to inspect.
	Path string

	// User is the expected owner username (e.g. "hedera"). Empty = skip.
	User string

	// Group is the expected owning group name (e.g. "hedera", "weaver"). Empty = skip.
	Group string

	// Permission is the set of mode bits that must all be present (e.g. 0o755).
	// Zero = skip.
	Permission os.FileMode
}

DiskOwnershipProbe verifies that Path exists and matches the declared owner, group, and/or permission bits. Any field left at its zero value is skipped:

  • User == "" → owner username not checked
  • Group == "" → owning group not checked
  • Permission == 0 → permission bits not checked

Example — ensure /opt/hgcapp is owned by hedera:hedera with rwxr-xr-x:

&DiskOwnershipProbe{
    Path:       "/opt/hgcapp",
    User:       "hedera",
    Group:      "hedera",
    Permission: 0o755,
}

Note: ownership is read from the inode via syscall.Stat_t. This probe does not check whether the current process has access — use DiskWriteTestProbe for that.

func (*DiskOwnershipProbe) Probe

Probe implements Probe.

type DiskPermissionProbe

type DiskPermissionProbe struct {
	// Path is the file or directory to inspect.
	Path string

	// Permission is the set of mode bits that must all be present.
	// Examples: 0o400 (owner-read), 0o600 (owner read+write), 0o700 (owner rwx).
	Permission os.FileMode
}

DiskPermissionProbe verifies that Path exists and has at least the declared permission bits set on the inode. It checks the file mode returned by os.Stat — i.e. declared permissions, not actual process-level access.

Use DiskWriteTestProbe when you need to confirm the running process can actually write to a directory (takes side effects into account: ownership, ACLs, etc.).

func (*DiskPermissionProbe) Probe

Probe implements Probe. Returns nil when Path exists and its permission bits include all bits in Permission. Returns an error immediately on any failure — callers supply their own retry loop if needed.

type DiskWriteTestProbe

type DiskWriteTestProbe struct {
	// Dir is the directory to test write access in.
	Dir string
}

DiskWriteTestProbe verifies that the running process can actually write to Dir by creating and immediately removing a temporary file. Unlike DiskPermissionProbe it exercises real process-level access — ownership, ACLs, mount flags, and SELinux/AppArmor policies are all tested implicitly.

Use this when the daemon must write to a directory at runtime and you want a startup guarantee that the write will succeed (e.g. the upgrade staging dir).

func (*DiskWriteTestProbe) Probe

Probe implements Probe. Creates a temporary file in Dir and removes it immediately. Returns nil on success, an error if the write fails for any reason.

type MonitorRunner

type MonitorRunner interface {
	// Run starts the monitor and blocks until ctx is cancelled or the monitor
	// encounters an unrecoverable error. A nil return means clean shutdown; a
	// non-nil return triggers a supervised restart with back-off.
	Run(ctx context.Context) error

	// Name returns a stable, human-readable identifier for the monitor used
	// in structured log entries (e.g. "upgrade-monitor", "migration-monitor").
	Name() string
}

MonitorRunner is the interface that each long-running monitor goroutine must implement so it can be managed by SupervisedMonitor.

Implementations must:

  • Return nil when ctx is cancelled (clean shutdown, no restart).
  • Return a non-nil error only on unexpected failure (triggers supervised restart).
  • Be safe to call again after returning an error (the supervisor calls Run again).

type MonitorState

type MonitorState struct {
	State string       `json:"state"`
	Error *StatusError `json:"error,omitempty"`
}

MonitorState describes the runtime state of a single supervised monitor. State values:

  • "running" — monitor is executing normally
  • "degraded" — monitor is running but its last operation failed; see Error for details; the monitor continues retrying automatically
  • "backoff:<dur>" — monitor crashed (Run returned non-nil) and is waiting before restart
  • "stopped" — monitor exited cleanly (ctx cancelled or nil return)

type ProbableMonitor

type ProbableMonitor interface {
	MonitorRunner
	RequiredProbe() Probe
}

ProbableMonitor is optionally implemented by monitors that require external resources to be verified before they run. RequiredProbe returns a single Probe representing everything the monitor needs.

The component automatically collects RequiredProbe() from every enabled ProbableMonitor and combines them into its CompositeProbe via BuildComponentProbe.

type Probe

type Probe interface {
	Probe(ctx context.Context) error
}

Probe is the minimal leaf interface for a single prerequisite check. Concrete implementations (e.g. a disk-permission or RBAC probe) satisfy this interface. Probe should block and retry internally until success or ctx cancellation; returning ctx.Err() on cancellation is the expected exit path.

type ProbeError

type ProbeError struct {
	// Reason is a stable, machine-readable key (e.g. "UpgradeDirOwnershipCheckFailed").
	Reason string

	// Resolution is an actionable command or instruction the operator should run.
	Resolution string

	// Message is the human-readable error string. When empty, Error() falls back
	// to the wrapped error's message.
	Message string

	// Err is the underlying error that triggered this failure, if any.
	Err error
}

ProbeError is a kit-native error carrying an operator-facing Reason code and Resolution hint as plain struct fields. It mirrors StatusError so that the daemon boundary can build a rich StatusError without reaching for an errorx property registry — keeping daemonkit free of any consumer-model coupling.

Callers that want doctor-layer styling re-wrap ProbeError into errorx with their own property keys at the consumer boundary; the kit itself stays dependency-light.

func (*ProbeError) Error

func (e *ProbeError) Error() string

Error implements error. It prefers Message, falling back to the wrapped error.

func (*ProbeError) Unwrap

func (e *ProbeError) Unwrap() error

Unwrap exposes the underlying error for errors.Is / errors.As traversal.

type Server

type Server struct {
	// contains filtered or unexported fields
}

Server is the Unix socket HTTP control plane for a daemon.

func NewServer

func NewServer(sockPath string, opts ServerOptions, cfg ServerConfig) *Server

NewServer constructs a Server and registers all routes.

Process-level routes (/health, /status) are always registered. Component routes are registered by calling RegisterRoutes on each entry in opts.ComponentHandlers.

Route scheme: /<component>/<monitor>/<sub-resource>/<verb>

func (*Server) Start

func (s *Server) Start(ctx context.Context) error

Start removes any stale socket file, listens on the Unix socket, serves requests, and shuts down cleanly when ctx is cancelled.

type ServerConfig

type ServerConfig struct {
	// ReadHeaderTimeout is the maximum time to read request headers.
	// Defaults to 5 s if zero. Set to a shorter value in tests.
	ReadHeaderTimeout time.Duration
}

ServerConfig holds tunable parameters for Server. Zero values use defaults.

type ServerOptions

type ServerOptions struct {
	// StatusFn returns the full daemon status for GET /status. The returned
	// value is serialised to JSON verbatim, so the concrete status payload type
	// stays in the consuming daemon. Nil disables the endpoint (returns an
	// empty JSON object).
	StatusFn func() any

	// ComponentHandlers registers per-component route sub-trees.
	// Each entry owns its own /<component>/ prefix.
	ComponentHandlers []ComponentHandler

	// Logger is the structured logger the server logs through. When nil the
	// server logs to a no-op discard logger, so it stays silent until a logger
	// is injected (it never writes to the global slog default implicitly).
	//
	// To route output through zerolog + lumberjack, pass a logx-backed logger:
	//
	//	opts.Logger = slog.New(logx.NewSlogHandler()) // github.com/automa-saga/logx
	Logger *slog.Logger
}

ServerOptions groups all injectable dependencies for NewServer.

type StatusError

type StatusError struct {
	// Reason is a stable, machine-readable key matching the log reason field
	// (e.g. "UpgradeMonitorListError", "UpgradeDirOwnershipCheckFailed").
	Reason string `json:"reason"`

	// Message is the human-readable error string.
	Message string `json:"message"`

	// Resolution is an actionable command or instruction the operator should
	// run to resolve the issue. Empty when no specific remediation is known.
	Resolution string `json:"resolution,omitempty"`

	// Since is the RFC 3339 timestamp of when this error was first observed.
	Since string `json:"since"`
}

StatusError is a rich, operator-facing error descriptor used in /status for both monitor connectivity failures and component probe (disk prerequisite) failures. Every populated field gives the operator enough context to act without opening journalctl.

type StatusTracker

type StatusTracker struct {
	// contains filtered or unexported fields
}

StatusTracker holds the latest observed state for a set of monitors. It is safe for concurrent use; SupervisedMonitor updates it on each state transition.

func NewStatusTracker

func NewStatusTracker() *StatusTracker

NewStatusTracker returns an empty StatusTracker.

func (*StatusTracker) Snapshot

func (t *StatusTracker) Snapshot() map[string]MonitorState

Snapshot returns a copy of all monitor states at the time of the call.

type SupervisorOptions

type SupervisorOptions struct {
	// Tracker, when non-nil, is updated on every monitor state transition so the
	// /status endpoint can report per-monitor state without polling. May be nil.
	Tracker *StatusTracker

	// Logger is the structured logger the supervisor logs through (crash,
	// back-off, degradation, heartbeat, clean exit). When nil the supervisor
	// logs to a no-op discard logger and stays silent — it never writes to the
	// global slog default implicitly. Inject a logger
	// (e.g. slog.New(logx.NewSlogHandler())) to route diagnostics to your
	// logging backend.
	Logger *slog.Logger

	// HeartbeatInterval, when greater than zero, makes the supervisor emit a
	// periodic MonitorHeartbeat Info record (with the monitor name and its
	// current uptime) while the monitor is in the running state. This lets a
	// remote observability backend detect an alive-but-wedged monitor — one
	// blocked inside Run, never crashing and never logging — by the ABSENCE of
	// heartbeats. Zero (the default) disables heartbeats entirely.
	HeartbeatInterval time.Duration
}

SupervisorOptions groups the optional dependencies and tunables for SupervisedMonitor. The zero value is valid: no status tracking, a silent (discard) logger, and no heartbeat.

type TaggedProbe

type TaggedProbe struct {
	Inner      Probe
	Reason     string
	Resolution string
}

TaggedProbe wraps a leaf Probe and attaches an operator-facing Reason code and Resolution hint to any error it returns. Use it inside RequiredProbe() implementations so that every prerequisite failure carries context-specific guidance for the operator.

On failure it returns a *ProbeError carrying Reason and Resolution as plain struct fields (no errorx property registry). The daemon boundary reads those fields directly to build a StatusError.

Example:

&daemonkit.TaggedProbe{
    Inner:      &daemonkit.DiskOwnershipProbe{Path: upgradeRoot, ...},
    Reason:     "UpgradeRootOwnershipCheckFailed",
    Resolution: "sudo chown hedera:hedera " + upgradeRoot,
}

func (*TaggedProbe) Probe

func (p *TaggedProbe) Probe(ctx context.Context) error

Probe implements Probe. Delegates to Inner; on failure wraps the error in a *ProbeError carrying Reason and Resolution so that the daemon boundary can build a rich StatusError without errorx property extraction.

type WatchdogOptions

type WatchdogOptions struct {
	// Logger, when non-nil, logs watchdog lifecycle and ping failures. When nil
	// the loop is silent (discard logger), consistent with the rest of the kit.
	Logger *slog.Logger

	// IsAlive, when non-nil, gates each keepalive: WATCHDOG=1 is sent only when
	// IsAlive() returns true. When it returns false the ping is withheld, so
	// systemd's WatchdogSec timer eventually fires and restarts the PROCESS.
	//
	// Use this ONLY when a process restart is the correct response to the
	// monitored condition — most cleanly a single-monitor daemon where the
	// process IS the monitor, so a restart has no healthy-monitor collateral.
	// Leave it nil for an unconditional keepalive that guards only against a
	// total process freeze. A multi-monitor daemon should generally leave this
	// nil: withholding pings bounces the whole process and resets every healthy
	// monitor too, which is rarely what you want.
	IsAlive func() bool
}

WatchdogOptions configures the optional systemd watchdog keepalive loop.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL