streamhash

package module
v0.0.0-...-9c240c3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 3, 2026 License: Apache-2.0 Imports: 19 Imported by: 0

README

StreamHash

A Go library for building and querying Minimal Perfect Hash Function (MPHF) indexes over billions of keys, using bounded RAM and streaming construction.

An MPHF maps N keys to N consecutive integers [0, N) with no collisions. This enables compact, read-only lookup tables where every key has a unique position — O(1) lookups without storing the keys themselves.

Features

  • Streaming construction — build indexes over 1B+ keys with ~1–75 MB of RAM, regardless of dataset size
  • Sorted and unsorted input — pre-sorted keys build with no temp disk; unsorted keys use per-writer temp files
  • Parallel builds — block-independent construction scales near-linearly with worker count
  • Concurrent writers — unsorted mode supports multiple concurrent writers for parallel key ingestion
  • Two algorithms — Bijection (compact, low RAM) and PTRHash (fast queries)
  • Optional payloads — store 1–8 byte fixed-size values alongside keys
  • Optional fingerprints — detect non-member keys with configurable false-positive rates
  • Single-file output — one mmap'd file for queries, no external dependencies

Performance

Index size: Bijection produces the most compact indexes (~2.5 bits/key MPHF overhead). PTRHash is slightly larger (~2.7 bits/key) but offers significantly faster queries.

Query latency: PTRHash queries are O(1) via direct pilot lookup, roughly 20× faster than Bijection's O(128) checkpoint-based decoding. Query latency is the same regardless of build mode. Each query requires one metadata read from disk; payload mode adds a second read.

Build throughput: Both algorithms build at similar speeds. Sorted input is faster than unsorted at higher worker counts because blocks are streamed directly to workers without an intermediate partition-read phase. Unsorted builds match sorted throughput at 1 worker and scale well with more workers but plateau below sorted peak throughput. Adding more workers via WithWorkers(n) scales build throughput near-linearly up to the number of blocks.

Memory: Sorted builds use very little heap (single-digit MB at 1 worker). Unsorted builds use more due to per-writer flush buffers (~12 MB per writer) and the read-phase pipeline, but remain bounded regardless of dataset size. With AddKeys(N, fn), the AddKey phase parallelizes across N concurrent writers.

Run go run ./cmd/bench on your hardware for concrete numbers (see Benchmarking).

Installation

go get github.com/tamirms/streamhash

Requires Go 1.26+.

Usage

Building an index (sorted input)

Keys must be at least 16 bytes and uniformly distributed. Use PreHash for non-uniform keys (strings, integers, etc.).

builder, err := streamhash.NewSortedBuilder(ctx, "index.idx", totalKeys,
    streamhash.WithPayload(4),
    streamhash.WithWorkers(4),
)
if err != nil {
    log.Fatal(err)
}
defer builder.Close()

for _, key := range sortedKeys {
    if err := builder.AddKey(key, payload); err != nil {
        log.Fatal(err)
    }
}
if err := builder.Finish(); err != nil {
    log.Fatal(err)
}
Building an index (unsorted input)
builder, err := streamhash.NewUnsortedBuilder(ctx, "index.idx", totalKeys, "/tmp/myindex",
    streamhash.WithWorkers(4),
)
if err != nil {
    log.Fatal(err)
}
defer builder.Close()

for _, key := range keys {
    if err := builder.AddKey(key, payload); err != nil {
        log.Fatal(err)
    }
}
if err := builder.Finish(); err != nil {
    log.Fatal(err)
}
Building with concurrent writers (unsorted input)
builder, err := streamhash.NewUnsortedBuilder(ctx, "index.idx", totalKeys, "/tmp/myindex",
    streamhash.WithWorkers(16),
)
if err != nil {
    log.Fatal(err)
}
defer builder.Close()

// AddKeys calls Finish internally.
if err := builder.AddKeys(8, func(writerID int, addKey func([]byte, uint64) error) error {
    for key, payload := range myPartition(writerID) {
        if err := addKey(key, payload); err != nil {
            return err
        }
    }
    return nil
}); err != nil {
    log.Fatal(err)
}
Querying
idx, err := streamhash.Open("index.idx")
if err != nil {
    log.Fatal(err)
}
defer idx.Close()

// MPHF mode: get the rank (0-based index) for a key
rank, err := idx.QueryRank(key)

// Payload mode: get the stored payload for a key
pi, err := idx.WithPayload()
rank, payload, err := pi.QueryPayload(key)
Pre-hashing non-uniform keys

If your keys are not already uniformly random (e.g., strings, sequential integers, UUIDs), pre-hash them before building and querying:

// Pre-hash and sort
hashedKeys := make([][]byte, len(keys))
for i, key := range keys {
    hashedKeys[i] = streamhash.PreHash(key)
}
sort.Slice(hashedKeys, func(i, j int) bool {
    return bytes.Compare(hashedKeys[i], hashedKeys[j]) < 0
})

// Build
builder, err := streamhash.NewSortedBuilder(ctx, "index.idx", uint64(len(hashedKeys)))
if err != nil {
    log.Fatal(err)
}
defer builder.Close()
for _, hk := range hashedKeys {
    if err := builder.AddKey(hk, 0); err != nil {
        log.Fatal(err)
    }
}
if err := builder.Finish(); err != nil {
    log.Fatal(err)
}

// Query: pre-hash the lookup key
rank, err := idx.QueryRank(streamhash.PreHash(originalKey))

Build Options

Option Description Default
WithWorkers(n) Parallel build workers 1
WithPayload(sizeBytes) Payload size in bytes (0-8) 0 (MPHF only)
WithFingerprint(sizeBytes) Fingerprint size in bytes (0-4) 0 (disabled)
WithAlgorithm(algo) AlgoBijection or AlgoPTRHash AlgoBijection
WithGlobalSeed(seed) Hash seed (change on build failure) fixed default
WithMetadata(data) Arbitrary metadata stored in the file none

Choosing an Algorithm

Bijection (default) — best for most use cases. Smallest indexes and lowest RAM usage. Query decoding is O(128) via checkpoint-based Elias-Fano/Golomb-Rice decoding.

PTRHash — best when query speed is critical. O(1) queries via direct pilot byte lookup, at the cost of slightly larger indexes and more RAM during construction.

Design

StreamHash partitions keys into fixed-size blocks by prefix, routes each key to its block, and delegates MPHF construction to a pluggable algorithm. This block-partitioning architecture enables:

  • Bounded RAM: each block is solved independently, using only O(block) memory
  • Parallelism: workers solve blocks concurrently while a coordinator sequences output
  • Locality: all metadata for a block is contiguous on disk — one read per query

The framework handles key routing, file layout, parallel coordination, and payload/fingerprint storage. Algorithms only need to implement a build-time solver and a query-time decoder.

See streamhash-spec.md for the full technical specification.

Benchmarking

# MPHF-only mode (core hash function performance)
go run ./cmd/bench -keys 10000000 -payload 0 -fp 0 -algo bijection

# With payload and fingerprint
go run ./cmd/bench -keys 10000000 -payload 4 -fp 1 -algo ptrhash -workers 4

License

Apache License 2.0 — see LICENSE.

Documentation

Overview

Package streamhash implements a Minimal Perfect Hash Function (MPHF) library with streaming build support and bounded RAM usage.

StreamHash is designed for building large-scale indexes (1B+ keys) efficiently. See the README for performance characteristics.

Basic Usage

Building an index (sorted input):

builder, err := streamhash.NewSortedBuilder(ctx, "index.idx", totalKeys)
if err != nil {
    log.Fatal(err)
}
defer builder.Close()
for key, payload := range sortedKeys {
    if err := builder.AddKey(key, payload); err != nil {
        log.Fatal(err)
    }
}
if err := builder.Finish(); err != nil {
    log.Fatal(err)
}

Building an index (unsorted input):

builder, err := streamhash.NewUnsortedBuilder(ctx, "index.idx", totalKeys, "/tmp")
if err != nil {
    log.Fatal(err)
}
defer builder.Close()
for key, payload := range keys {
    if err := builder.AddKey(key, payload); err != nil {
        log.Fatal(err)
    }
}
if err := builder.Finish(); err != nil {
    log.Fatal(err)
}

Querying an index:

idx, err := streamhash.Open("index.idx")
if err != nil {
    log.Fatal(err)
}
defer idx.Close()

rank, err := idx.QueryRank(streamhash.PreHash([]byte("mykey")))
if err != nil {
    log.Fatal(err)
}
fmt.Printf("Key rank: %d\n", rank)

Package Structure

The implementation is organized as follows:

  • Public API: builder.go (NewSortedBuilder), builder_unsorted.go (NewUnsortedBuilder), index.go (Open, QueryRank, PayloadIndex)
  • Configuration: builder_options.go (BuildOption, With* functions)
  • Serialization: header.go (header, footer, ramIndexEntry), index_writer.go
  • Key routing & hashing: key.go (fingerprint extraction, payload packing), prehash.go (PreHash); block routing is FastRange32 over the big-endian prefix, inline in builder.go/index.go
  • Algorithm dispatch: algorithm.go (blockBuilder/blockDecoder interfaces, factory functions)
  • Block algorithms: internal/bijection/ (EF/GR), internal/ptrhash/ (Cuckoo)
  • Platform: platform_*.go (OS-specific: fallocate)

Index

Constants

View Source
const MinKeySize = 16

MinKeySize is the minimum key length required for routing and hash computation. Keys must have at least 16 bytes to extract k0 (bytes 0-7) and k1 (bytes 8-15).

Variables

View Source
var (
	ErrBuilderClosed    = sherr.ErrBuilderClosed
	ErrEmptyIndex       = sherr.ErrEmptyIndex
	ErrTooManyKeys      = sherr.ErrTooManyKeys
	ErrKeyTooShort      = sherr.ErrKeyTooShort
	ErrKeyTooLong       = sherr.ErrKeyTooLong
	ErrPayloadOverflow  = sherr.ErrPayloadOverflow
	ErrDuplicateKey     = sherr.ErrDuplicateKey
	ErrUnsortedInput    = sherr.ErrUnsortedInput
	ErrKeyCountMismatch = sherr.ErrKeyCountMismatch
)

Build errors.

View Source
var (
	ErrPayloadTooLarge             = sherr.ErrPayloadTooLarge
	ErrFingerprintTooLarge         = sherr.ErrFingerprintTooLarge
	ErrSplitBucketSeedSearchFailed = sherr.ErrSplitBucketSeedSearchFailed
	ErrIndistinguishableHashes     = sherr.ErrIndistinguishableHashes
)

Construction errors.

View Source
var (
	ErrInvalidMagic   = sherr.ErrInvalidMagic
	ErrInvalidVersion = sherr.ErrInvalidVersion
	ErrChecksumFailed = sherr.ErrChecksumFailed
	ErrTruncatedFile  = sherr.ErrTruncatedFile
	ErrCorruptedIndex = sherr.ErrCorruptedIndex
)

Index errors.

View Source
var (
	ErrIndexClosed = sherr.ErrIndexClosed
	ErrNoPayload   = sherr.ErrNoPayload
	ErrNotFound    = sherr.ErrNotFound
)

Query errors.

Functions

func PreHash

func PreHash(key []byte) []byte

PreHash applies xxHash3-128 to a key, returning 16 bytes.

Use this function when your keys are not uniformly distributed (e.g., strings, URLs, sequential integers, JSON). Pre-hashing transforms arbitrary input into uniformly random 128-bit values required by the MPHF construction.

Usage

For unsorted builds, prehash keys before building:

hashedKeys := make([][]byte, len(keys))
for i, key := range keys {
    hashedKeys[i] = streamhash.PreHash(key)
}

For sorted builds (Builder), you must sort by the HASHED key, not the original key. The original keys are discarded; only hashed keys are indexed:

// 1. Prehash all keys (original keys are no longer needed)
hashedKeys := make([][]byte, len(keys))
for i, key := range keys {
    hashedKeys[i] = streamhash.PreHash(key)
}

// 2. Sort by hashed key bytes
sort.Slice(hashedKeys, func(i, j int) bool {
    return bytes.Compare(hashedKeys[i], hashedKeys[j]) < 0
})

// 3. Build using Builder
builder, _ := streamhash.NewSortedBuilder(ctx, path, uint64(len(hashedKeys)))
for _, hk := range hashedKeys {
    builder.AddKey(hk, 0)
}
err := builder.Finish()

When to use

  • Strings, URLs, file paths: highly non-uniform prefix distribution
  • Sequential integers: all keys share common high bits
  • UUIDs with common prefixes: e.g., all start with same version nibble
  • Any key where the first 16 bytes are not uniformly random

When NOT to use

  • Keys are already random 128-bit values (e.g., random UUIDs, crypto hashes)
  • Keys are already uniformly distributed across their prefix bits

Querying: If you prehash keys during build, you must also prehash during query:

rank, err := idx.QueryRank(streamhash.PreHash(originalKey))

func PreHashInPlace

func PreHashInPlace(dst []byte, key []byte)

PreHashInPlace applies xxHash3-128 to a key, writing the result to dst. dst must be at least 16 bytes. This avoids allocation when processing many keys in a loop.

Types

type Algorithm

type Algorithm uint16

Algorithm identifies the MPHF algorithm used for block construction. This is stored in the file header.

const (
	// AlgoBijection uses EF/GR encoding with O(128) query.
	AlgoBijection Algorithm = 0

	// AlgoPTRHash uses PTRHash-style Cuckoo with 8-bit pilots.
	AlgoPTRHash Algorithm = 1
)

func (Algorithm) String

func (a Algorithm) String() string

String returns the algorithm name.

type BuildOption

type BuildOption func(*buildConfig)

BuildOption is a functional option for configuring builds.

func WithAlgorithm

func WithAlgorithm(algo Algorithm) BuildOption

WithAlgorithm sets the block construction algorithm. Default is AlgoBijection.

func WithFingerprint

func WithFingerprint(sizeBytes int) BuildOption

WithFingerprint enables fingerprint verification. Size must be 0-4 bytes. 0 means no fingerprints (the default). With fingerprints enabled, QueryRank and QueryPayload return ErrNotFound for non-member keys.

func WithGlobalSeed

func WithGlobalSeed(seed uint64) BuildOption

WithGlobalSeed sets the global hash seed.

func WithMetadata

func WithMetadata(data []byte) BuildOption

WithMetadata sets the variable-length user metadata. The metadata is copied, so the caller can reuse the slice after this call.

func WithPayload

func WithPayload(sizeBytes int) BuildOption

WithPayload configures payload storage. Size must be 0-8 bytes. 0 means no payload (MPHF-only mode, the default).

func WithWorkers

func WithWorkers(n int) BuildOption

WithWorkers sets the number of parallel workers for block building during Finish.

type Index

type Index struct {
	// contains filtered or unexported fields
}

Index is a read-only StreamHash index for querying.

Thread Safety: - QueryRank, PayloadIndex.QueryPayload, and other read methods are safe for concurrent use - Close is NOT safe to call concurrently with queries - Close must only be called after all queries have completed - After Close returns, no methods may be called on the Index

func Open

func Open(path string) (*Index, error)

Open opens a StreamHash index file for querying. It opens the file, memory-maps it, and closes the file descriptor.

func OpenBytes

func OpenBytes(data []byte) (*Index, error)

OpenBytes creates a StreamHash index from an in-memory byte slice. No file is opened or memory-mapped; Close is a no-op. The caller must ensure data is not modified while the Index is in use.

func OpenFile

func OpenFile(f *os.File) (*Index, error)

OpenFile opens a StreamHash index by memory-mapping the given file. The caller is responsible for closing f. Per POSIX mmap(2), f may be closed immediately after OpenFile returns.

func (*Index) Close

func (idx *Index) Close() error

Close closes the index and releases resources.

func (*Index) NumBlocks

func (idx *Index) NumBlocks() uint32

NumBlocks returns the number of blocks in the index.

func (*Index) NumKeys

func (idx *Index) NumKeys() uint64

NumKeys returns the total number of keys in the index.

func (*Index) PayloadSize

func (idx *Index) PayloadSize() int

PayloadSize returns the payload size per key.

func (*Index) QueryRank

func (idx *Index) QueryRank(key []byte) (uint64, error)

QueryRank returns the rank (0-based index) for a key. This is the core MPHF operation. Returns ErrKeyTooShort if key is less than 16 bytes.

func (*Index) Stats

func (idx *Index) Stats() *Stats

Stats returns statistics for the index.

func (*Index) UserMetadata

func (idx *Index) UserMetadata() []byte

UserMetadata returns the variable-length user-defined metadata. The returned slice is backed by the memory-mapped file data.

func (*Index) Verify

func (idx *Index) Verify() error

Verify checks the integrity of the entire index. For separated layout, this verifies: 1. PayloadRegionHash (hash-of-hashes: H(H(b0) || H(b1) || ...)) 2. MetadataRegionHash (streaming hash of metadata region)

The footer (last 32 bytes) is decoded on each Verify call rather than at Open time, so Open() only touches the contiguous prefix and avoids a scattered page fault.

The hash-of-hashes approach matches the streaming hash computation during build, where workers compute per-block payload hashes that are folded in order.

func (*Index) WithPayload

func (idx *Index) WithPayload() (*PayloadIndex, error)

WithPayload returns a PayloadIndex if the index has payload data. Returns an error if the index was built without WithPayload.

type PayloadIndex

type PayloadIndex struct {
	*Index
}

PayloadIndex extends Index with payload query capability. Obtained via OpenPayload, OpenPayloadFile, OpenPayloadBytes, or Index.WithPayload().

func OpenPayload

func OpenPayload(path string) (*PayloadIndex, error)

OpenPayload opens a StreamHash index that has payload data. Returns an error if the index was built without WithPayload.

func OpenPayloadBytes

func OpenPayloadBytes(data []byte) (*PayloadIndex, error)

OpenPayloadBytes creates a payload-capable index from an in-memory byte slice. Returns an error if the index was built without WithPayload.

func OpenPayloadFile

func OpenPayloadFile(f *os.File) (*PayloadIndex, error)

OpenPayloadFile opens a payload-capable index from an open file. Returns an error if the index was built without WithPayload.

func (*PayloadIndex) QueryPayload

func (pi *PayloadIndex) QueryPayload(key []byte) (rank uint64, payload uint64, err error)

QueryPayload returns the rank and payload for a key. Returns ErrKeyTooShort if key is less than 16 bytes.

type SortedBuilder

type SortedBuilder struct {
	// contains filtered or unexported fields
}

SortedBuilder builds an index from keys that arrive in block-sorted order.

Usage:

builder, err := streamhash.NewSortedBuilder(ctx, "index.idx", totalKeys, opts...)
if err != nil { return err }
defer builder.Close()

for key, payload := range sortedData {
    if err := builder.AddKey(key, payload); err != nil { return err }
}
return builder.Finish()

func NewSortedBuilder

func NewSortedBuilder(ctx context.Context, output string, totalKeys uint64, opts ...BuildOption) (*SortedBuilder, error)

NewSortedBuilder creates a builder for sorted input. Keys must be added in block-sorted order via AddKey. Use WithWorkers(N) to parallelize block building during Finish.

func (*SortedBuilder) AddKey

func (sb *SortedBuilder) AddKey(key []byte, payload uint64) error

AddKey adds a key-payload pair. Keys must be in block-sorted order.

func (*SortedBuilder) Close

func (sb *SortedBuilder) Close() error

Close aborts the build and cleans up resources. Safe to call after Finish.

func (*SortedBuilder) Finish

func (sb *SortedBuilder) Finish() error

Finish completes the index and writes it to disk.

type Stats

type Stats struct {
	NumKeys         uint64
	NumBlocks       uint32
	BitsPerKey      float64
	PayloadSize     int
	FingerprintSize int // bytes per key (0 if disabled)
	FileSize        int64
	OverheadBPK     float64 // MPHF overhead bits per key (excludes payload + fingerprint)
	Algorithm       Algorithm
}

Stats holds index statistics.

func GetStats

func GetStats(path string) (*Stats, error)

GetStats returns statistics for an index file.

type UnsortedBuilder

type UnsortedBuilder struct {
	// contains filtered or unexported fields
}

UnsortedBuilder builds an index from keys in arbitrary order. Keys are buffered to per-writer temp files, partitioned, and built during Finish.

Two ingestion modes (mutually exclusive):

  • AddKey: single-threaded sequential addition, then call Finish
  • AddKeys: concurrent multi-writer callback, calls Finish internally

Usage (single-threaded):

builder, _ := streamhash.NewUnsortedBuilder(ctx, path, totalKeys, tempDir, opts...)
defer builder.Close()
for key, payload := range data {
    builder.AddKey(key, payload)
}
return builder.Finish()

Usage (concurrent):

builder, _ := streamhash.NewUnsortedBuilder(ctx, path, totalKeys, tempDir, opts...)
defer builder.Close()
return builder.AddKeys(8, func(writerID int, addKey func([]byte, uint64) error) error {
    for key, payload := range myPartition(writerID) {
        if err := addKey(key, payload); err != nil { return err }
    }
    return nil
})

func NewUnsortedBuilder

func NewUnsortedBuilder(ctx context.Context, output string, totalKeys uint64, tempDir string, opts ...BuildOption) (*UnsortedBuilder, error)

NewUnsortedBuilder creates a builder for unsorted input. Keys can be added in any order via AddKey or AddKeys.

tempDir specifies where partition files are created. Pass "" to use the system default (os.TempDir). The directory must exist and be on a local filesystem (ext4, xfs, btrfs). NFS is not supported. tmpfs works but is not recommended at scale since it stores data in RAM/swap.

Use WithWorkers(N) to parallelize block building during Finish.

func (*UnsortedBuilder) AddKey

func (ub *UnsortedBuilder) AddKey(key []byte, payload uint64) error

AddKey adds a key-payload pair. Keys can be in any order. Cannot be used after AddKeys has been called.

func (*UnsortedBuilder) AddKeys

func (ub *UnsortedBuilder) AddKeys(numWriters int, fn func(writerID int, addKey func(key []byte, payload uint64) error) error) error

AddKeys ingests keys in parallel using numWriters concurrent writers. The callback fn is invoked once per writer in its own goroutine. Each invocation receives a writerID (0-based) and an addKey function. AddKeys calls Finish internally — do not call Finish separately.

Cannot be used after AddKey has been called.

Example:

err := builder.AddKeys(8, func(writerID int, addKey func([]byte, uint64) error) error {
    for key, payload := range myPartition(writerID) {
        if err := addKey(key, payload); err != nil { return err }
    }
    return nil
})

func (*UnsortedBuilder) Close

func (ub *UnsortedBuilder) Close() error

Close aborts the build and cleans up resources. Safe to call after Finish.

func (*UnsortedBuilder) Finish

func (ub *UnsortedBuilder) Finish() error

Finish completes the index. Called automatically by AddKeys. Only call this directly when using AddKey (not AddKeys).

Directories

Path Synopsis
cmd
bench command
Bench is a benchmarking and data generation tool for StreamHash.
Bench is a benchmarking and data generation tool for StreamHash.
internal
bijection
Package bijection implements the Bijection algorithm for StreamHash.
Package bijection implements the Bijection algorithm for StreamHash.
bits
Package bits provides low-level bit manipulation primitives.
Package bits provides low-level bit manipulation primitives.
encoding
Package encoding provides serialization utilities for fingerprints and payloads.
Package encoding provides serialization utilities for fingerprints and payloads.
ptrhash
Package ptrhash implements the PTRHash algorithm for StreamHash.
Package ptrhash implements the PTRHash algorithm for StreamHash.
sherr
Package sherr defines error sentinels shared across the streamhash library.
Package sherr defines error sentinels shared across the streamhash library.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL