Documentation
¶
Overview ¶
Package streamhash implements a Minimal Perfect Hash Function (MPHF) library with streaming build support and bounded RAM usage.
StreamHash is designed for building large-scale indexes (1B+ keys) efficiently. See the README for performance characteristics.
Basic Usage ¶
Building an index (sorted input):
builder, err := streamhash.NewSortedBuilder(ctx, "index.idx", totalKeys)
if err != nil {
log.Fatal(err)
}
defer builder.Close()
for key, payload := range sortedKeys {
if err := builder.AddKey(key, payload); err != nil {
log.Fatal(err)
}
}
if err := builder.Finish(); err != nil {
log.Fatal(err)
}
Building an index (unsorted input):
builder, err := streamhash.NewUnsortedBuilder(ctx, "index.idx", totalKeys, "/tmp")
if err != nil {
log.Fatal(err)
}
defer builder.Close()
for key, payload := range keys {
if err := builder.AddKey(key, payload); err != nil {
log.Fatal(err)
}
}
if err := builder.Finish(); err != nil {
log.Fatal(err)
}
Querying an index:
idx, err := streamhash.Open("index.idx")
if err != nil {
log.Fatal(err)
}
defer idx.Close()
rank, err := idx.QueryRank(streamhash.PreHash([]byte("mykey")))
if err != nil {
log.Fatal(err)
}
fmt.Printf("Key rank: %d\n", rank)
Package Structure ¶
The implementation is organized as follows:
- Public API: builder.go (NewSortedBuilder), builder_unsorted.go (NewUnsortedBuilder), index.go (Open, QueryRank, PayloadIndex)
- Configuration: builder_options.go (BuildOption, With* functions)
- Serialization: header.go (header, footer, ramIndexEntry), index_writer.go
- Key routing & hashing: key.go (fingerprint extraction, payload packing), prehash.go (PreHash); block routing is FastRange32 over the big-endian prefix, inline in builder.go/index.go
- Algorithm dispatch: algorithm.go (blockBuilder/blockDecoder interfaces, factory functions)
- Block algorithms: internal/bijection/ (EF/GR), internal/ptrhash/ (Cuckoo)
- Platform: platform_*.go (OS-specific: fallocate)
Index ¶
- Constants
- Variables
- func PreHash(key []byte) []byte
- func PreHashInPlace(dst []byte, key []byte)
- type Algorithm
- type BuildOption
- type Index
- func (idx *Index) Close() error
- func (idx *Index) NumBlocks() uint32
- func (idx *Index) NumKeys() uint64
- func (idx *Index) PayloadSize() int
- func (idx *Index) QueryRank(key []byte) (uint64, error)
- func (idx *Index) Stats() *Stats
- func (idx *Index) UserMetadata() []byte
- func (idx *Index) Verify() error
- func (idx *Index) WithPayload() (*PayloadIndex, error)
- type PayloadIndex
- type SortedBuilder
- type Stats
- type UnsortedBuilder
Constants ¶
const MinKeySize = 16
MinKeySize is the minimum key length required for routing and hash computation. Keys must have at least 16 bytes to extract k0 (bytes 0-7) and k1 (bytes 8-15).
Variables ¶
var ( ErrBuilderClosed = sherr.ErrBuilderClosed ErrEmptyIndex = sherr.ErrEmptyIndex ErrTooManyKeys = sherr.ErrTooManyKeys ErrKeyTooShort = sherr.ErrKeyTooShort ErrKeyTooLong = sherr.ErrKeyTooLong ErrPayloadOverflow = sherr.ErrPayloadOverflow ErrDuplicateKey = sherr.ErrDuplicateKey ErrUnsortedInput = sherr.ErrUnsortedInput ErrKeyCountMismatch = sherr.ErrKeyCountMismatch )
Build errors.
var ( ErrPayloadTooLarge = sherr.ErrPayloadTooLarge ErrFingerprintTooLarge = sherr.ErrFingerprintTooLarge ErrSplitBucketSeedSearchFailed = sherr.ErrSplitBucketSeedSearchFailed ErrIndistinguishableHashes = sherr.ErrIndistinguishableHashes )
Construction errors.
var ( ErrInvalidMagic = sherr.ErrInvalidMagic ErrInvalidVersion = sherr.ErrInvalidVersion ErrChecksumFailed = sherr.ErrChecksumFailed ErrTruncatedFile = sherr.ErrTruncatedFile ErrCorruptedIndex = sherr.ErrCorruptedIndex )
Index errors.
var ( ErrIndexClosed = sherr.ErrIndexClosed ErrNoPayload = sherr.ErrNoPayload ErrNotFound = sherr.ErrNotFound )
Query errors.
Functions ¶
func PreHash ¶
PreHash applies xxHash3-128 to a key, returning 16 bytes.
Use this function when your keys are not uniformly distributed (e.g., strings, URLs, sequential integers, JSON). Pre-hashing transforms arbitrary input into uniformly random 128-bit values required by the MPHF construction.
Usage ¶
For unsorted builds, prehash keys before building:
hashedKeys := make([][]byte, len(keys))
for i, key := range keys {
hashedKeys[i] = streamhash.PreHash(key)
}
For sorted builds (Builder), you must sort by the HASHED key, not the original key. The original keys are discarded; only hashed keys are indexed:
// 1. Prehash all keys (original keys are no longer needed)
hashedKeys := make([][]byte, len(keys))
for i, key := range keys {
hashedKeys[i] = streamhash.PreHash(key)
}
// 2. Sort by hashed key bytes
sort.Slice(hashedKeys, func(i, j int) bool {
return bytes.Compare(hashedKeys[i], hashedKeys[j]) < 0
})
// 3. Build using Builder
builder, _ := streamhash.NewSortedBuilder(ctx, path, uint64(len(hashedKeys)))
for _, hk := range hashedKeys {
builder.AddKey(hk, 0)
}
err := builder.Finish()
When to use ¶
- Strings, URLs, file paths: highly non-uniform prefix distribution
- Sequential integers: all keys share common high bits
- UUIDs with common prefixes: e.g., all start with same version nibble
- Any key where the first 16 bytes are not uniformly random
When NOT to use ¶
- Keys are already random 128-bit values (e.g., random UUIDs, crypto hashes)
- Keys are already uniformly distributed across their prefix bits
Querying: If you prehash keys during build, you must also prehash during query:
rank, err := idx.QueryRank(streamhash.PreHash(originalKey))
func PreHashInPlace ¶
PreHashInPlace applies xxHash3-128 to a key, writing the result to dst. dst must be at least 16 bytes. This avoids allocation when processing many keys in a loop.
Types ¶
type Algorithm ¶
type Algorithm uint16
Algorithm identifies the MPHF algorithm used for block construction. This is stored in the file header.
type BuildOption ¶
type BuildOption func(*buildConfig)
BuildOption is a functional option for configuring builds.
func WithAlgorithm ¶
func WithAlgorithm(algo Algorithm) BuildOption
WithAlgorithm sets the block construction algorithm. Default is AlgoBijection.
func WithFingerprint ¶
func WithFingerprint(sizeBytes int) BuildOption
WithFingerprint enables fingerprint verification. Size must be 0-4 bytes. 0 means no fingerprints (the default). With fingerprints enabled, QueryRank and QueryPayload return ErrNotFound for non-member keys.
func WithGlobalSeed ¶
func WithGlobalSeed(seed uint64) BuildOption
WithGlobalSeed sets the global hash seed.
func WithMetadata ¶
func WithMetadata(data []byte) BuildOption
WithMetadata sets the variable-length user metadata. The metadata is copied, so the caller can reuse the slice after this call.
func WithPayload ¶
func WithPayload(sizeBytes int) BuildOption
WithPayload configures payload storage. Size must be 0-8 bytes. 0 means no payload (MPHF-only mode, the default).
func WithWorkers ¶
func WithWorkers(n int) BuildOption
WithWorkers sets the number of parallel workers for block building during Finish.
type Index ¶
type Index struct {
// contains filtered or unexported fields
}
Index is a read-only StreamHash index for querying.
Thread Safety: - QueryRank, PayloadIndex.QueryPayload, and other read methods are safe for concurrent use - Close is NOT safe to call concurrently with queries - Close must only be called after all queries have completed - After Close returns, no methods may be called on the Index
func Open ¶
Open opens a StreamHash index file for querying. It opens the file, memory-maps it, and closes the file descriptor.
func OpenBytes ¶
OpenBytes creates a StreamHash index from an in-memory byte slice. No file is opened or memory-mapped; Close is a no-op. The caller must ensure data is not modified while the Index is in use.
func OpenFile ¶
OpenFile opens a StreamHash index by memory-mapping the given file. The caller is responsible for closing f. Per POSIX mmap(2), f may be closed immediately after OpenFile returns.
func (*Index) PayloadSize ¶
PayloadSize returns the payload size per key.
func (*Index) QueryRank ¶
QueryRank returns the rank (0-based index) for a key. This is the core MPHF operation. Returns ErrKeyTooShort if key is less than 16 bytes.
func (*Index) UserMetadata ¶
UserMetadata returns the variable-length user-defined metadata. The returned slice is backed by the memory-mapped file data.
func (*Index) Verify ¶
Verify checks the integrity of the entire index. For separated layout, this verifies: 1. PayloadRegionHash (hash-of-hashes: H(H(b0) || H(b1) || ...)) 2. MetadataRegionHash (streaming hash of metadata region)
The footer (last 32 bytes) is decoded on each Verify call rather than at Open time, so Open() only touches the contiguous prefix and avoids a scattered page fault.
The hash-of-hashes approach matches the streaming hash computation during build, where workers compute per-block payload hashes that are folded in order.
func (*Index) WithPayload ¶
func (idx *Index) WithPayload() (*PayloadIndex, error)
WithPayload returns a PayloadIndex if the index has payload data. Returns an error if the index was built without WithPayload.
type PayloadIndex ¶
type PayloadIndex struct {
*Index
}
PayloadIndex extends Index with payload query capability. Obtained via OpenPayload, OpenPayloadFile, OpenPayloadBytes, or Index.WithPayload().
func OpenPayload ¶
func OpenPayload(path string) (*PayloadIndex, error)
OpenPayload opens a StreamHash index that has payload data. Returns an error if the index was built without WithPayload.
func OpenPayloadBytes ¶
func OpenPayloadBytes(data []byte) (*PayloadIndex, error)
OpenPayloadBytes creates a payload-capable index from an in-memory byte slice. Returns an error if the index was built without WithPayload.
func OpenPayloadFile ¶
func OpenPayloadFile(f *os.File) (*PayloadIndex, error)
OpenPayloadFile opens a payload-capable index from an open file. Returns an error if the index was built without WithPayload.
func (*PayloadIndex) QueryPayload ¶
func (pi *PayloadIndex) QueryPayload(key []byte) (rank uint64, payload uint64, err error)
QueryPayload returns the rank and payload for a key. Returns ErrKeyTooShort if key is less than 16 bytes.
type SortedBuilder ¶
type SortedBuilder struct {
// contains filtered or unexported fields
}
SortedBuilder builds an index from keys that arrive in block-sorted order.
Usage:
builder, err := streamhash.NewSortedBuilder(ctx, "index.idx", totalKeys, opts...)
if err != nil { return err }
defer builder.Close()
for key, payload := range sortedData {
if err := builder.AddKey(key, payload); err != nil { return err }
}
return builder.Finish()
func NewSortedBuilder ¶
func NewSortedBuilder(ctx context.Context, output string, totalKeys uint64, opts ...BuildOption) (*SortedBuilder, error)
NewSortedBuilder creates a builder for sorted input. Keys must be added in block-sorted order via AddKey. Use WithWorkers(N) to parallelize block building during Finish.
func (*SortedBuilder) AddKey ¶
func (sb *SortedBuilder) AddKey(key []byte, payload uint64) error
AddKey adds a key-payload pair. Keys must be in block-sorted order.
func (*SortedBuilder) Close ¶
func (sb *SortedBuilder) Close() error
Close aborts the build and cleans up resources. Safe to call after Finish.
func (*SortedBuilder) Finish ¶
func (sb *SortedBuilder) Finish() error
Finish completes the index and writes it to disk.
type Stats ¶
type Stats struct {
NumKeys uint64
NumBlocks uint32
BitsPerKey float64
PayloadSize int
FingerprintSize int // bytes per key (0 if disabled)
FileSize int64
OverheadBPK float64 // MPHF overhead bits per key (excludes payload + fingerprint)
Algorithm Algorithm
}
Stats holds index statistics.
type UnsortedBuilder ¶
type UnsortedBuilder struct {
// contains filtered or unexported fields
}
UnsortedBuilder builds an index from keys in arbitrary order. Keys are buffered to per-writer temp files, partitioned, and built during Finish.
Two ingestion modes (mutually exclusive):
- AddKey: single-threaded sequential addition, then call Finish
- AddKeys: concurrent multi-writer callback, calls Finish internally
Usage (single-threaded):
builder, _ := streamhash.NewUnsortedBuilder(ctx, path, totalKeys, tempDir, opts...)
defer builder.Close()
for key, payload := range data {
builder.AddKey(key, payload)
}
return builder.Finish()
Usage (concurrent):
builder, _ := streamhash.NewUnsortedBuilder(ctx, path, totalKeys, tempDir, opts...)
defer builder.Close()
return builder.AddKeys(8, func(writerID int, addKey func([]byte, uint64) error) error {
for key, payload := range myPartition(writerID) {
if err := addKey(key, payload); err != nil { return err }
}
return nil
})
func NewUnsortedBuilder ¶
func NewUnsortedBuilder(ctx context.Context, output string, totalKeys uint64, tempDir string, opts ...BuildOption) (*UnsortedBuilder, error)
NewUnsortedBuilder creates a builder for unsorted input. Keys can be added in any order via AddKey or AddKeys.
tempDir specifies where partition files are created. Pass "" to use the system default (os.TempDir). The directory must exist and be on a local filesystem (ext4, xfs, btrfs). NFS is not supported. tmpfs works but is not recommended at scale since it stores data in RAM/swap.
Use WithWorkers(N) to parallelize block building during Finish.
func (*UnsortedBuilder) AddKey ¶
func (ub *UnsortedBuilder) AddKey(key []byte, payload uint64) error
AddKey adds a key-payload pair. Keys can be in any order. Cannot be used after AddKeys has been called.
func (*UnsortedBuilder) AddKeys ¶
func (ub *UnsortedBuilder) AddKeys(numWriters int, fn func(writerID int, addKey func(key []byte, payload uint64) error) error) error
AddKeys ingests keys in parallel using numWriters concurrent writers. The callback fn is invoked once per writer in its own goroutine. Each invocation receives a writerID (0-based) and an addKey function. AddKeys calls Finish internally — do not call Finish separately.
Cannot be used after AddKey has been called.
Example:
err := builder.AddKeys(8, func(writerID int, addKey func([]byte, uint64) error) error {
for key, payload := range myPartition(writerID) {
if err := addKey(key, payload); err != nil { return err }
}
return nil
})
func (*UnsortedBuilder) Close ¶
func (ub *UnsortedBuilder) Close() error
Close aborts the build and cleans up resources. Safe to call after Finish.
func (*UnsortedBuilder) Finish ¶
func (ub *UnsortedBuilder) Finish() error
Finish completes the index. Called automatically by AddKeys. Only call this directly when using AddKey (not AddKeys).
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
cmd
|
|
|
bench
command
Bench is a benchmarking and data generation tool for StreamHash.
|
Bench is a benchmarking and data generation tool for StreamHash. |
|
internal
|
|
|
bijection
Package bijection implements the Bijection algorithm for StreamHash.
|
Package bijection implements the Bijection algorithm for StreamHash. |
|
bits
Package bits provides low-level bit manipulation primitives.
|
Package bits provides low-level bit manipulation primitives. |
|
encoding
Package encoding provides serialization utilities for fingerprints and payloads.
|
Package encoding provides serialization utilities for fingerprints and payloads. |
|
ptrhash
Package ptrhash implements the PTRHash algorithm for StreamHash.
|
Package ptrhash implements the PTRHash algorithm for StreamHash. |
|
sherr
Package sherr defines error sentinels shared across the streamhash library.
|
Package sherr defines error sentinels shared across the streamhash library. |