daemonkit

package module

v0.1.0 Latest Latest Go to latest Published: Jun 19, 2026 License: Apache-2.0 Imports: 17 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/automa-saga/daemonkit

Links

Open Source Insights

README ¶

daemonkit

Small, dependency-light building blocks for long-running Go daemons: a supervised monitor restart loop, a Unix-socket control-plane HTTP server with health/status routes, a composite probe framework, systemd sd_notify helpers, an append-only JSONL event logger, and a retention-based file pruner.

The module deliberately keeps a minimal production dependency surface:

github.com/joomcode/errorx — typed errors
golang.org/x/sync/errgroup — goroutine groups
the Go standard library (log/slog, net/http, encoding/json, ...)

Packages

Import	Purpose
`github.com/automa-saga/daemonkit`	Control-plane HTTP server, supervised monitor, probe framework, sd_notify
`github.com/automa-saga/daemonkit/eventlog`	Append-only JSONL structured event logger
`github.com/automa-saga/daemonkit/filepruner`	Retention-based file pruning

The three packages are mutually independent — import only what you need.

Install

go get github.com/automa-saga/daemonkit@latest

Requires Go 1.26 or newer.

Quick start

See the User Guide for probes, component routes, the event logger, and the file pruner.

Documentation

Architecture — design, the daemon kernel model, and concurrency contracts.
User Guide — runnable examples for every package.

License

Apache-2.0. See LICENSE.

Documentation ¶

Overview ¶

Package daemonkit provides a reusable kernel for long-running daemons: a supervised-monitor restart loop, a Unix-socket HTTP control plane, and sd_notify integration. It depends only on the standard library, errorx, and golang.org/x/sync/errgroup, and intentionally imports nothing under internal/... or cmd/... so it can be shared across daemons (e.g. the solo-provisioner daemon and a future solo-operator daemon).

Index ¶

func NotifyReady() error
func NotifyStopping() error
func SupervisedMonitor(ctx context.Context, m MonitorRunner, opts SupervisorOptions)
func Watchdog(ctx context.Context, opts WatchdogOptions)
func WatchdogInterval() (time.Duration, bool)
type ComponentHandler
type ComponentProbe
- func BuildComponentProbe(componentName string, monitors []MonitorRunner) ComponentProbe
type CompositeProbe
- func NewCompositeProbe(componentName string, leafProbes ...Probe) *CompositeProbe
- func (c *CompositeProbe) ComponentName() string
- func (c *CompositeProbe) Probe(ctx context.Context) error
type ConnectivityMonitor
type DiskOwnershipProbe
- func (p *DiskOwnershipProbe) Probe(_ context.Context) error
type DiskPermissionProbe
- func (p *DiskPermissionProbe) Probe(_ context.Context) error
type DiskWriteTestProbe
- func (p *DiskWriteTestProbe) Probe(_ context.Context) error
type MonitorRunner
type MonitorState
type ProbableMonitor
type Probe
type ProbeError
- func (e *ProbeError) Error() string
- func (e *ProbeError) Unwrap() error
type Server
- func NewServer(sockPath string, opts ServerOptions, cfg ServerConfig) *Server
- func (s *Server) Start(ctx context.Context) error
type ServerConfig
type ServerOptions
type StatusError
type StatusTracker
- func NewStatusTracker() *StatusTracker
- func (t *StatusTracker) Snapshot() map[string]MonitorState
type SupervisorOptions
type TaggedProbe
- func (p *TaggedProbe) Probe(ctx context.Context) error
type WatchdogOptions

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func NotifyReady ¶

func NotifyReady() error

NotifyReady sends READY=1 to systemd, signalling that the daemon has finished startup and its socket is serving. No-op when NOTIFY_SOCKET is unset.

func NotifyStopping ¶

func NotifyStopping() error

NotifyStopping sends STOPPING=1 to systemd, signalling that the daemon has begun a graceful shutdown. No-op when NOTIFY_SOCKET is unset.

func SupervisedMonitor ¶

func SupervisedMonitor(ctx context.Context, m MonitorRunner, opts SupervisorOptions)

SupervisedMonitor runs m in a restart loop. When m.Run returns a non-nil error the supervisor waits for a back-off delay and then restarts it. Clean shutdown (nil return or ctx cancellation) exits the loop immediately without restarting.

Back-off strategy:

Starts at supervisedBackoffInitial (5 s).
Doubles on each crash up to supervisedBackoffCap (5 min).
Resets to supervisedBackoffInitial when the monitor runs stably for at least supervisedStableThreshold (60 s) before the next crash.

Degradation alerting:

Tracks consecutive crashes (resets after a stable run).
Emits a MonitorDegraded Error log every supervisedDegradedThreshold consecutive crashes (at crash #5, #10, #15, …) so ops keeps seeing the alert as long as the monitor remains degraded.

Heartbeats:

When opts.HeartbeatInterval > 0, a MonitorHeartbeat Info record is emitted on that interval while the monitor is running, so remote observability can alert on the absence of heartbeats.

This function never returns an error — it absorbs crashes and restarts the monitor indefinitely until ctx is cancelled.

See SupervisorOptions for the meaning of each field; the zero value is valid.

func Watchdog ¶

func Watchdog(ctx context.Context, opts WatchdogOptions)

Watchdog runs the systemd watchdog keepalive loop until ctx is cancelled.

It is a no-op (returns immediately) when the watchdog is not enabled for this process — see WatchdogInterval — so it is always safe to call unconditionally; enable it by setting WatchdogSec= in the unit file. When enabled it sends WATCHDOG=1 to NOTIFY_SOCKET every interval/2 (the conventional safety margin), optionally gated by opts.IsAlive.

This is opt-in: nothing else in the kit calls it. Invoke it (typically in its own goroutine) only if you want systemd to kill+restart the daemon when it stops pinging. Pair it with Restart=on-failure in the unit.

func WatchdogInterval ¶

func WatchdogInterval() (time.Duration, bool)

WatchdogInterval reports the systemd watchdog interval for this process and whether the watchdog is enabled, mirroring sd_watchdog_enabled(3). It reads WATCHDOG_USEC and, when WATCHDOG_PID is set, honours it so a value inherited by a child process is ignored. When the watchdog is not enabled for this process it returns (0, false), so callers can branch without env parsing.

Types ¶

type ComponentHandler ¶

type ComponentHandler interface {
	RegisterRoutes(mux *http.ServeMux)
}

ComponentHandler is implemented by each component to register its own HTTP route sub-tree on the daemon control plane.

Convention: all routes registered by a handler must be prefixed with /<component_name>/ (e.g. /consensus_node/..., /block_node/...) to keep the API namespace partitioned. Process-level routes (/health, /status) are registered by the Server itself and must not be claimed by any ComponentHandler.

type ComponentProbe ¶

type ComponentProbe interface {
	Probe(ctx context.Context) error

	// ComponentName returns the component identifier used in structured log
	// entries (e.g. "consensus-node").
	ComponentName() string
}

ComponentProbe is the component-boundary interface seen by the supervisor. A component with no external dependencies sets its probe field to nil and is treated as immediately ready by the composite probe runner.

func BuildComponentProbe ¶

func BuildComponentProbe(componentName string, monitors []MonitorRunner) ComponentProbe

BuildComponentProbe collects RequiredProbe() from every ProbableMonitor in monitors and wraps them in a CompositeProbe named componentName. Returns nil when no monitor declares a prerequisite (host-only component); the supervisor treats a nil probe as immediately ready.

type CompositeProbe ¶

type CompositeProbe struct {
	// contains filtered or unexported fields
}

CompositeProbe implements ComponentProbe at the component boundary. It fans out to a set of leaf Probe instances concurrently and returns nil only when every sub-probe passes. The first failure cancels sibling probes via errgroup context cancellation so the composite exits as fast as possible.

Sub-probes may themselves be CompositeProbe instances — since CompositeProbe satisfies the Probe interface, probes can be nested to arbitrary depth.

Use NewCompositeProbe to construct.

func NewCompositeProbe ¶

func NewCompositeProbe(componentName string, leafProbes ...Probe) *CompositeProbe

NewCompositeProbe returns a CompositeProbe that runs all provided leaf probes concurrently under the given component name.

func (*CompositeProbe) ComponentName ¶

func (c *CompositeProbe) ComponentName() string

ComponentName implements ComponentProbe.

func (*CompositeProbe) Probe ¶

func (c *CompositeProbe) Probe(ctx context.Context) error

Probe implements ComponentProbe (and the Probe interface). It fans out to all sub-probes concurrently; the first failure cancels the rest via the errgroup context.

type ConnectivityMonitor ¶

type ConnectivityMonitor interface {
	MonitorRunner
	// ConnectivityError returns the current connectivity failure, or nil when
	// the monitor's last operation completed successfully. Recovery (a
	// successful list + watch cycle) must clear the error within one cycle.
	//
	// Concurrency: implementations MUST make this safe for concurrent read.
	// The daemon's HTTP server goroutine calls ConnectivityError (via
	// statusSnapshot) while the monitor's own Run goroutine is writing the
	// underlying field. Guard the field with an atomic (e.g.
	// atomic.Pointer[StatusError]) or a mutex; a plain field read/written
	// from both goroutines is a data race.
	ConnectivityError() *StatusError
}

ConnectivityMonitor is optionally implemented by monitors that maintain an in-process record of their last connectivity error (e.g. a K8s watch failure). statusSnapshot overlays ConnectivityError onto the tracker state so failures are visible via /status even while the goroutine is alive and retrying inside Run() — a goroutine in a retry loop is "running" by the supervisor's definition, but operators need to see the connectivity problem.

type DiskOwnershipProbe ¶

type DiskOwnershipProbe struct {
	// Path is the file or directory to inspect.
	Path string

	// User is the expected owner username (e.g. "hedera"). Empty = skip.
	User string

	// Group is the expected owning group name (e.g. "hedera", "weaver"). Empty = skip.
	Group string

	// Permission is the set of mode bits that must all be present (e.g. 0o755).
	// Zero = skip.
	Permission os.FileMode
}

DiskOwnershipProbe verifies that Path exists and matches the declared owner, group, and/or permission bits. Any field left at its zero value is skipped:

User == "" → owner username not checked
Group == "" → owning group not checked
Permission == 0 → permission bits not checked

Example — ensure /opt/hgcapp is owned by hedera:hedera with rwxr-xr-x:

&DiskOwnershipProbe{
    Path:       "/opt/hgcapp",
    User:       "hedera",
    Group:      "hedera",
    Permission: 0o755,
}

Note: ownership is read from the inode via syscall.Stat_t. This probe does not check whether the current process has access — use DiskWriteTestProbe for that.

func (*DiskOwnershipProbe) Probe ¶

func (p *DiskOwnershipProbe) Probe(_ context.Context) error

Probe implements Probe.

type DiskPermissionProbe ¶

type DiskPermissionProbe struct {
	// Path is the file or directory to inspect.
	Path string

	// Permission is the set of mode bits that must all be present.
	// Examples: 0o400 (owner-read), 0o600 (owner read+write), 0o700 (owner rwx).
	Permission os.FileMode
}

DiskPermissionProbe verifies that Path exists and has at least the declared permission bits set on the inode. It checks the file mode returned by os.Stat — i.e. declared permissions, not actual process-level access.

Use DiskWriteTestProbe when you need to confirm the running process can actually write to a directory (takes side effects into account: ownership, ACLs, etc.).

func (*DiskPermissionProbe) Probe ¶

func (p *DiskPermissionProbe) Probe(_ context.Context) error

Probe implements Probe. Returns nil when Path exists and its permission bits include all bits in Permission. Returns an error immediately on any failure — callers supply their own retry loop if needed.

type DiskWriteTestProbe ¶

type DiskWriteTestProbe struct {
	// Dir is the directory to test write access in.
	Dir string
}

DiskWriteTestProbe verifies that the running process can actually write to Dir by creating and immediately removing a temporary file. Unlike DiskPermissionProbe it exercises real process-level access — ownership, ACLs, mount flags, and SELinux/AppArmor policies are all tested implicitly.

Use this when the daemon must write to a directory at runtime and you want a startup guarantee that the write will succeed (e.g. the upgrade staging dir).

func (*DiskWriteTestProbe) Probe ¶

func (p *DiskWriteTestProbe) Probe(_ context.Context) error

Probe implements Probe. Creates a temporary file in Dir and removes it immediately. Returns nil on success, an error if the write fails for any reason.

type MonitorRunner ¶

type MonitorRunner interface {
	// Run starts the monitor and blocks until ctx is cancelled or the monitor
	// encounters an unrecoverable error. A nil return means clean shutdown; a
	// non-nil return triggers a supervised restart with back-off.
	Run(ctx context.Context) error

	// Name returns a stable, human-readable identifier for the monitor used
	// in structured log entries (e.g. "upgrade-monitor", "migration-monitor").
	Name() string
}

MonitorRunner is the interface that each long-running monitor goroutine must implement so it can be managed by SupervisedMonitor.

Implementations must:

Return nil when ctx is cancelled (clean shutdown, no restart).
Return a non-nil error only on unexpected failure (triggers supervised restart).
Be safe to call again after returning an error (the supervisor calls Run again).

type MonitorState ¶

type MonitorState struct {
	State string       `json:"state"`
	Error *StatusError `json:"error,omitempty"`
}

MonitorState describes the runtime state of a single supervised monitor. State values:

"running" — monitor is executing normally
"degraded" — monitor is running but its last operation failed; see Error for details; the monitor continues retrying automatically
"backoff:<dur>" — monitor crashed (Run returned non-nil) and is waiting before restart
"stopped" — monitor exited cleanly (ctx cancelled or nil return)

type ProbableMonitor ¶

type ProbableMonitor interface {
	MonitorRunner
	RequiredProbe() Probe
}

ProbableMonitor is optionally implemented by monitors that require external resources to be verified before they run. RequiredProbe returns a single Probe representing everything the monitor needs.

The component automatically collects RequiredProbe() from every enabled ProbableMonitor and combines them into its CompositeProbe via BuildComponentProbe.

type Probe ¶

type Probe interface {
	Probe(ctx context.Context) error
}

Probe is the minimal leaf interface for a single prerequisite check. Concrete implementations (e.g. a disk-permission or RBAC probe) satisfy this interface. Probe should block and retry internally until success or ctx cancellation; returning ctx.Err() on cancellation is the expected exit path.

type ProbeError ¶

type ProbeError struct {
	// Reason is a stable, machine-readable key (e.g. "UpgradeDirOwnershipCheckFailed").
	Reason string

	// Resolution is an actionable command or instruction the operator should run.
	Resolution string

	// Message is the human-readable error string. When empty, Error() falls back
	// to the wrapped error's message.
	Message string

	// Err is the underlying error that triggered this failure, if any.
	Err error
}

ProbeError is a kit-native error carrying an operator-facing Reason code and Resolution hint as plain struct fields. It mirrors StatusError so that the daemon boundary can build a rich StatusError without reaching for an errorx property registry — keeping daemonkit free of any consumer-model coupling.

Callers that want doctor-layer styling re-wrap ProbeError into errorx with their own property keys at the consumer boundary; the kit itself stays dependency-light.

func (*ProbeError) Error ¶

func (e *ProbeError) Error() string

Error implements error. It prefers Message, falling back to the wrapped error.

func (*ProbeError) Unwrap ¶

func (e *ProbeError) Unwrap() error

Unwrap exposes the underlying error for errors.Is / errors.As traversal.

type Server ¶

type Server struct {
	// contains filtered or unexported fields
}

Server is the Unix socket HTTP control plane for a daemon.

func NewServer ¶

func NewServer(sockPath string, opts ServerOptions, cfg ServerConfig) *Server

NewServer constructs a Server and registers all routes.

Process-level routes (/health, /status) are always registered. Component routes are registered by calling RegisterRoutes on each entry in opts.ComponentHandlers.

Route scheme: /<component>/<monitor>/<sub-resource>/<verb>

func (*Server) Start ¶

func (s *Server) Start(ctx context.Context) error

Start removes any stale socket file, listens on the Unix socket, serves requests, and shuts down cleanly when ctx is cancelled.

type ServerConfig ¶

type ServerConfig struct {
	// ReadHeaderTimeout is the maximum time to read request headers.
	// Defaults to 5 s if zero. Set to a shorter value in tests.
	ReadHeaderTimeout time.Duration
}

ServerConfig holds tunable parameters for Server. Zero values use defaults.

type ServerOptions ¶

type ServerOptions struct {
	// StatusFn returns the full daemon status for GET /status. The returned
	// value is serialised to JSON verbatim, so the concrete status payload type
	// stays in the consuming daemon. Nil disables the endpoint (returns an
	// empty JSON object).
	StatusFn func() any

	// ComponentHandlers registers per-component route sub-trees.
	// Each entry owns its own /<component>/ prefix.
	ComponentHandlers []ComponentHandler

	// Logger is the structured logger the server logs through. When nil the
	// server logs to a no-op discard logger, so it stays silent until a logger
	// is injected (it never writes to the global slog default implicitly).
	//
	// To route output through zerolog + lumberjack, pass a logx-backed logger:
	//
	//	opts.Logger = slog.New(logx.NewSlogHandler()) // github.com/automa-saga/logx
	Logger *slog.Logger
}

ServerOptions groups all injectable dependencies for NewServer.

type StatusError ¶

type StatusError struct {
	// Reason is a stable, machine-readable key matching the log reason field
	// (e.g. "UpgradeMonitorListError", "UpgradeDirOwnershipCheckFailed").
	Reason string `json:"reason"`

	// Message is the human-readable error string.
	Message string `json:"message"`

	// Resolution is an actionable command or instruction the operator should
	// run to resolve the issue. Empty when no specific remediation is known.
	Resolution string `json:"resolution,omitempty"`

	// Since is the RFC 3339 timestamp of when this error was first observed.
	Since string `json:"since"`
}

StatusError is a rich, operator-facing error descriptor used in /status for both monitor connectivity failures and component probe (disk prerequisite) failures. Every populated field gives the operator enough context to act without opening journalctl.

type StatusTracker ¶

type StatusTracker struct {
	// contains filtered or unexported fields
}

StatusTracker holds the latest observed state for a set of monitors. It is safe for concurrent use; SupervisedMonitor updates it on each state transition.

func NewStatusTracker ¶

func NewStatusTracker() *StatusTracker

NewStatusTracker returns an empty StatusTracker.

func (*StatusTracker) Snapshot ¶

func (t *StatusTracker) Snapshot() map[string]MonitorState

Snapshot returns a copy of all monitor states at the time of the call.

type SupervisorOptions ¶

type SupervisorOptions struct {
	// Tracker, when non-nil, is updated on every monitor state transition so the
	// /status endpoint can report per-monitor state without polling. May be nil.
	Tracker *StatusTracker

	// Logger is the structured logger the supervisor logs through (crash,
	// back-off, degradation, heartbeat, clean exit). When nil the supervisor
	// logs to a no-op discard logger and stays silent — it never writes to the
	// global slog default implicitly. Inject a logger
	// (e.g. slog.New(logx.NewSlogHandler())) to route diagnostics to your
	// logging backend.
	Logger *slog.Logger

	// HeartbeatInterval, when greater than zero, makes the supervisor emit a
	// periodic MonitorHeartbeat Info record (with the monitor name and its
	// current uptime) while the monitor is in the running state. This lets a
	// remote observability backend detect an alive-but-wedged monitor — one
	// blocked inside Run, never crashing and never logging — by the ABSENCE of
	// heartbeats. Zero (the default) disables heartbeats entirely.
	HeartbeatInterval time.Duration
}

SupervisorOptions groups the optional dependencies and tunables for SupervisedMonitor. The zero value is valid: no status tracking, a silent (discard) logger, and no heartbeat.

type TaggedProbe ¶

type TaggedProbe struct {
	Inner      Probe
	Reason     string
	Resolution string
}

TaggedProbe wraps a leaf Probe and attaches an operator-facing Reason code and Resolution hint to any error it returns. Use it inside RequiredProbe() implementations so that every prerequisite failure carries context-specific guidance for the operator.

On failure it returns a *ProbeError carrying Reason and Resolution as plain struct fields (no errorx property registry). The daemon boundary reads those fields directly to build a StatusError.

Example:

&daemonkit.TaggedProbe{
    Inner:      &daemonkit.DiskOwnershipProbe{Path: upgradeRoot, ...},
    Reason:     "UpgradeRootOwnershipCheckFailed",
    Resolution: "sudo chown hedera:hedera " + upgradeRoot,
}

func (*TaggedProbe) Probe ¶

func (p *TaggedProbe) Probe(ctx context.Context) error

Probe implements Probe. Delegates to Inner; on failure wraps the error in a *ProbeError carrying Reason and Resolution so that the daemon boundary can build a rich StatusError without errorx property extraction.

type WatchdogOptions ¶

type WatchdogOptions struct {
	// Logger, when non-nil, logs watchdog lifecycle and ping failures. When nil
	// the loop is silent (discard logger), consistent with the rest of the kit.
	Logger *slog.Logger

	// IsAlive, when non-nil, gates each keepalive: WATCHDOG=1 is sent only when
	// IsAlive() returns true. When it returns false the ping is withheld, so
	// systemd's WatchdogSec timer eventually fires and restarts the PROCESS.
	//
	// Use this ONLY when a process restart is the correct response to the
	// monitored condition — most cleanly a single-monitor daemon where the
	// process IS the monitor, so a restart has no healthy-monitor collateral.
	// Leave it nil for an unconditional keepalive that guards only against a
	// total process freeze. A multi-monitor daemon should generally leave this
	// nil: withholding pings bounces the whole process and resets every healthy
	// monitor too, which is rarely what you want.
	IsAlive func() bool
}

WatchdogOptions configures the optional systemd watchdog keepalive loop.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
eventlog
filepruner

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL