Documentation
¶
Overview ¶
YogaDB is an embedded persistent ordered key-value store built on the FlexSpace log-structured storage engine architecture. A three-layer architecture is deployed: at the lowest level is the FlexTree (a B-tree extent index with shift-propagation); at the middle level is the FlexSpace (log-structured file layer with GC); and at the top sits the FlexDB (the KV store with memtables, WAL, sparse index, and interval cache).
The Iterator Optimization History: ¶
Getting 42x faster at iteration: from 340 ns/key to 8 ns/key.
The Iter type went through a series of CPU-profile-driven optimizations to bring forward iteration throughput from ~340 ns/key down to ~8 ns/key, beating bbolt (~12 ns/key) and roughly 25x faster than Pebble (~210 ns/key). Each optimization was benchmarked individually using Benchmark_Iter_YogaDB_Ascend with 100K keys. The progression:
1. Stateful FlexSpace Cursor (flexCursor)
Replaced stateless re-seeking (O(log n) binary search per Next()) with a persistent flexCursor that maintains a pointer to the current sparse-index leaf node, anchor index, and position within the cached interval. Sequential Next() calls advance the cursor in O(1) amortized time by incrementing kvIdx within the current interval, only doing tree traversal at interval boundaries. The cursor holds a refcounted interval cache entry so KV data is accessed zero-copy.
2. HLC-Based Version Detection
Instead of re-acquiring topMutRW.RLock on every Next() call, the iterator snapshots the database's Hybrid Logical Clock (HLC) at seek time. On Next(), a single atomic load compares the current HLC against the snapshot. If they match, no mutations have occurred and the cursor remains valid - the entire lock acquisition and merged re-seek are skipped. When mutations are detected, the iterator falls back to a full re-seek. This turns the common case (iteration without concurrent writes) into an atomic load instead of a mutex round-trip.
3. Prefetch Buffer
Added a prefetch buffer that fills multiple KV entries per lock acquisition. A single RLock fills up to iterPreFetchKeyCount entries; subsequent Next() calls serve from the buffer with only the atomic HLC check. This amortized the per-key lock overhead to near zero, since one lock acquisition covers hundreds of keys.
4. Pointer-Based KV Access (*KV instead of embedded KV)
Changed the Iter struct from embedding a full KV (with Key/Value byte slices copied per entry) to holding a *KV pointer directly into the interval cache's kvs[] slice. On the fast path, Next() just updates a pointer - no struct copy, no byte slice copy. Key() and Value() read through the pointer. This eliminated ~80 bytes of copying per key. (~18 ns/key to ~16 ns/key.)
5. Batched Interval Recording in prefetchFill
Restructured prefetchFillFlexSpaceOnly to process entries in batches per interval rather than one at a time. Each interval contributes a contiguous block of entries, avoiding repeated per-entry function calls and pointer chasing across interval boundaries. (~16 ns/key to ~14 ns/key.)
6. Split Advance/Retreat into Fast Path + Boundary Crossing
Split flexCursorAdvance into a two-instruction fast path (kvIdx++ and bounds check) plus a separate flexCursorNextInterval function for the rare interval boundary crossing. The fast path is fully inlineable by the Go compiler. flexCursorRetreat was split similarly into flexCursorPrevInterval. This eliminated function call overhead from the hot loop. (~14 ns/key to ~13.5 ns/key.)
7. Span-Based Prefetch (O(intervals) not O(keys))
Replaced the per-key []*KV prefetch buffer with an array of prefetchSpan descriptors. Each span records a (kvs slice, pos, end) triple referencing a contiguous range within one interval's kvs[]. The fill function (prefetchFillFlexSpaceOnly) runs under the lock and is now O(intervals): it just records ~3 slice boundaries, touching no per-KV data. The per-KV work (tombstone checking, pointer dereferencing) is deferred to servePrefetch which runs outside the lock. The kvs slice is pre-sliced to [:count] as a bounds-check elimination (BCE) hint so the Go compiler can prove index safety within the inner loop. This reduced lock hold time proportional to the number of intervals (~3) rather than keys (~512). (~13.5 ns/key to ~13 ns/key, with the main benefit being reduced lock contention under concurrent access.)
8. Increased Prefetch Count + Eliminated Redundant Atomic Loads
Raised iterPreFetchKeyCount from 90 to 512. With span-based fill being O(intervals), larger prefetch counts add negligible fill cost while reducing the frequency of lock acquisitions for refills. Also eliminated a redundant db.hlc.Aload() in the Next() refill path by caching the HLC value read during the initial check. (~13-14 ns/key, reduced variance.)
9. Atomic Reference Counting in Interval Cache
Converted the interval cache entry's refcnt and access fields from mutex-protected int to atomic int32 operations. releaseEntry (called on every interval boundary crossing) went from mutex lock/unlock to a single atomic.AddInt32. The getEntry cache-hit path was reduced from two mutex lock/unlock pairs to one (for the anchor.fce read) plus an atomic store for the access field. calibrate and allocEntryForNewAnchor use atomic.LoadInt32/AddInt32/StoreInt32. This eliminated ~660ms of mutex overhead from releaseEntry and reduced getEntry's lock contention.
10. Tuning the overwrite in place slotted page sizes SLOTTED_PAGE_KB=4 and the flexdbUnsortedWriteQuota=6 brought iteration down to about 8 nsec/key. Using SLOTTED_PAGE_KB=128 we could cut that in half again, but we are already beating Bolt by 2x (why bother with 3x; it is available if you need it though) and we want to balance reads against insertion efficiency/on-disk space consumption.
Index ¶
- Constants
- Variables
- func FirstDiff(dbA, dbB *FlexDB) string
- type Batch
- type CommonNode
- func (z *CommonNode) DecodeMsg(dc *msgp.Reader) (err error)
- func (z *CommonNode) EncodeMsg(en *msgp.Writer) (err error)
- func (z *CommonNode) Gstring() (r string)
- func (z *CommonNode) MarshalMsg(b []byte) (o []byte, err error)
- func (z *CommonNode) Msgsize() (s int)
- func (z *CommonNode) UnmarshalMsg(bts []byte) (o []byte, err error)
- func (z *CommonNode) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)
- type Config
- type FlexDB
- func (db *FlexDB) BeginUpdate() *WriteTx
- func (db *FlexDB) BeginView() *ReadOnlyTx
- func (db *FlexDB) CheckIntegrity() []IntegrityError
- func (db *FlexDB) Clear(includeLarge bool) (allGone bool, err error)
- func (db *FlexDB) Close() *Metrics
- func (db *FlexDB) CumulativeMetrics() *Metrics
- func (db *FlexDB) Delete(key string) error
- func (db *FlexDB) DeleteRange(includeLarge bool, begKey, endKey string, begInclusive, endInclusive bool) (n int64, allGone bool, err error)
- func (db *FlexDB) FetchLarge(kv *KV) ([]byte, error)
- func (db *FlexDB) Find(smod SearchModifier, key string) (kvc *KVcloser, exact bool, err error)
- func (db *FlexDB) Get(key string) (value []byte, found bool, err error)
- func (db *FlexDB) GetKV(key string) (kv *KVcloser, err error)
- func (db *FlexDB) Len() int64
- func (db *FlexDB) LenBigSmall() (big int64, small int64)
- func (db *FlexDB) Merge(key string, ...) error
- func (db *FlexDB) NewBatch() (b *Batch)
- func (db *FlexDB) Put(key string, value []byte) error
- func (db *FlexDB) SessionMetrics() *Metrics
- func (db *FlexDB) Sync() error
- func (db *FlexDB) Update(fn func(rw *WriteTx) error) (err error)
- func (db *FlexDB) VacuumKV() (*VacuumKVStats, error)
- func (db *FlexDB) VacuumVLOG() (*VacuumVLOGStats, error)
- func (db *FlexDB) View(fn func(ro *ReadOnlyTx) error) (err error)
- type FlexSpace
- func (ff *FlexSpace) Close() (kvBlocksOnDiskFootprintBytes int64)
- func (ff *FlexSpace) Collapse(loff, length uint64) error
- func (ff *FlexSpace) Defrag(buf []byte, loff, length uint64) error
- func (ff *FlexSpace) Fallocate(loff, size uint64) error
- func (ff *FlexSpace) Ftruncate(size uint64) error
- func (ff *FlexSpace) GC()
- func (ff *FlexSpace) GetHandler(loff uint64) FlexSpaceHandler
- func (ff *FlexSpace) GetTag(loff uint64) (uint16, error)
- func (ff *FlexSpace) Insert(buf []byte, loff, length uint64) (int, error)
- func (ff *FlexSpace) Overwrite(buf []byte, loff uint64, length uint64) error
- func (ff *FlexSpace) Read(buf []byte, loff, length uint64) (int, error)
- func (ff *FlexSpace) ReadFragmentation(buf []byte, loff, length uint64) (int, uint64, error)
- func (ff *FlexSpace) SetTag(loff uint64, tag uint16) error
- func (ff *FlexSpace) Size() uint64
- func (ff *FlexSpace) Sync()
- func (ff *FlexSpace) Update(buf []byte, loff, length, olen uint64) (int, error)
- func (ff *FlexSpace) Write(buf []byte, loff, length uint64) (int, error)
- type FlexSpaceHandler
- func (fh *FlexSpaceHandler) Backward(step uint64)
- func (fh *FlexSpaceHandler) Forward(step uint64)
- func (fh *FlexSpaceHandler) ForwardExtent()
- func (fh *FlexSpaceHandler) GetTag() (uint16, error)
- func (fh *FlexSpaceHandler) Loff() uint64
- func (fh *FlexSpaceHandler) Poff() uint64
- func (fh *FlexSpaceHandler) Read(buf []byte, length uint64) (int, error)
- func (fh *FlexSpaceHandler) Valid() bool
- type FlexTree
- func (t *FlexTree) AllocInternal() (ie *InternalNode)
- func (t *FlexTree) AllocLeaf() (le *LeafNode)
- func (tree *FlexTree) Close()
- func (t *FlexTree) CloseCoW() error
- func (z *FlexTree) DecodeMsg(dc *msgp.Reader) (err error)
- func (tree *FlexTree) Delete(loff, length uint64) int
- func (tree *FlexTree) Diff(b *FlexTree) (diff string)
- func (z *FlexTree) EncodeMsg(en *msgp.Writer) (err error)
- func (t *FlexTree) FreeNode(id NodeID)
- func (t *FlexTree) GetInternal(id NodeID) *InternalNode
- func (t *FlexTree) GetLeaf(id NodeID) *LeafNode
- func (tree *FlexTree) GetMaxLoff() uint64
- func (t *FlexTree) GetNodeString(id NodeID) string
- func (tree *FlexTree) GetTag(loff uint64) (uint16, int)
- func (tree *FlexTree) GrandparentIdx(path *FlexTreePath) (index uint32)
- func (tree *FlexTree) GrandparentNode(path *FlexTreePath) (ie *InternalNode)
- func (z *FlexTree) Gstring() (r string)
- func (tree *FlexTree) Insert(loff, poff uint64, len uint32) int
- func (tree *FlexTree) InsertWTag(loff, poff uint64, len uint32, tag uint16) int
- func (t *FlexTree) MarkAllInternalsDirty()
- func (z *FlexTree) MarshalMsg(b []byte) (o []byte, err error)
- func (z *FlexTree) Msgsize() (s int)
- func (t *FlexTree) NodeSlotID(id NodeID) int64
- func (tree *FlexTree) PDelete(loff uint64) int
- func (tree *FlexTree) PQuery(loff uint64) uint64
- func (tree *FlexTree) ParentIdx(path *FlexTreePath) (index uint32, ok bool)
- func (tree *FlexTree) ParentNode(path *FlexTreePath) (ie *InternalNode, nodeID NodeID, ok bool)
- func (tree *FlexTree) ParentNodeIdx(path *FlexTreePath) (ie *InternalNode, index uint32, ok bool)
- func (tree *FlexTree) PosGet(loff uint64) Pos
- func (tree *FlexTree) Print()
- func (tree *FlexTree) Query(loff, length uint64) *FlextreeQueryResult
- func (tree *FlexTree) SetTag(loff uint64, tag uint16) int
- func (tree *FlexTree) String() string
- func (tree *FlexTree) Sync()
- func (t *FlexTree) SyncCoW() error
- func (z *FlexTree) UnmarshalMsg(bts []byte) (o []byte, err error)
- func (z *FlexTree) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)
- type FlexTreeExtent
- func (e *FlexTreeExtent) Address() uint64
- func (z *FlexTreeExtent) DecodeMsg(dc *msgp.Reader) (err error)
- func (z *FlexTreeExtent) EncodeMsg(en *msgp.Writer) (err error)
- func (z *FlexTreeExtent) Gstring() (r string)
- func (e *FlexTreeExtent) IsHole() bool
- func (z *FlexTreeExtent) MarshalMsg(b []byte) (o []byte, err error)
- func (z *FlexTreeExtent) Msgsize() (s int)
- func (e *FlexTreeExtent) Poff() uint64
- func (e *FlexTreeExtent) SetAddress(addr uint64)
- func (e *FlexTreeExtent) SetHole(isHole bool)
- func (e *FlexTreeExtent) SetPoff(poff uint64)
- func (e *FlexTreeExtent) SetTag(tag uint16)
- func (e *FlexTreeExtent) Tag() uint16
- func (z *FlexTreeExtent) UnmarshalMsg(bts []byte) (o []byte, err error)
- func (z *FlexTreeExtent) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)
- type FlexTreePath
- func (z *FlexTreePath) DecodeMsg(dc *msgp.Reader) (err error)
- func (z *FlexTreePath) EncodeMsg(en *msgp.Writer) (err error)
- func (z *FlexTreePath) Gstring() (r string)
- func (z *FlexTreePath) MarshalMsg(b []byte) (o []byte, err error)
- func (z *FlexTreePath) Msgsize() (s int)
- func (z *FlexTreePath) UnmarshalMsg(bts []byte) (o []byte, err error)
- func (z *FlexTreePath) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)
- type FlextreeQueryResult
- func (z *FlextreeQueryResult) DecodeMsg(dc *msgp.Reader) (err error)
- func (z *FlextreeQueryResult) EncodeMsg(en *msgp.Writer) (err error)
- func (z *FlextreeQueryResult) Gstring() (r string)
- func (z *FlextreeQueryResult) MarshalMsg(b []byte) (o []byte, err error)
- func (z *FlextreeQueryResult) Msgsize() (s int)
- func (z *FlextreeQueryResult) UnmarshalMsg(bts []byte) (o []byte, err error)
- func (z *FlextreeQueryResult) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)
- type HLC
- func (hlc *HLC) Aload() (r HLC)
- func (hlc *HLC) Count() int64
- func (hlc *HLC) CreateAndNow() (r HLC, now time.Time)
- func (hlc *HLC) CreateSendOrLocalEvent() (r HLC)
- func (hlc *HLC) LC() int64
- func (hlc *HLC) ReceiveMessageWithHLC(m HLC) (r HLC)
- func (hlc *HLC) String() string
- func (hlc *HLC) ToTime() time.Time
- func (hlc *HLC) ToTime48() (r time.Time)
- type HLCInterval
- type IntegrityError
- type InternalChild
- func (z *InternalChild) DecodeMsg(dc *msgp.Reader) (err error)
- func (z *InternalChild) EncodeMsg(en *msgp.Writer) (err error)
- func (z *InternalChild) Gstring() (r string)
- func (z *InternalChild) MarshalMsg(b []byte) (o []byte, err error)
- func (z *InternalChild) Msgsize() (s int)
- func (z *InternalChild) UnmarshalMsg(bts []byte) (o []byte, err error)
- func (z *InternalChild) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)
- type InternalNode
- func (z *InternalNode) DecodeMsg(dc *msgp.Reader) (err error)
- func (z *InternalNode) EncodeMsg(en *msgp.Writer) (err error)
- func (z *InternalNode) Gstring() (r string)
- func (z *InternalNode) MarshalMsg(b []byte) (o []byte, err error)
- func (z *InternalNode) Msgsize() (s int)
- func (ie *InternalNode) StringOnTree(tree *FlexTree) (r string)
- func (z *InternalNode) UnmarshalMsg(bts []byte) (o []byte, err error)
- func (z *InternalNode) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)
- type Iter
- func (it *Iter) Close()
- func (it *Iter) FetchV() ([]byte, error)
- func (it *Iter) GetAnySize() (key string, val []byte, found bool, err error)
- func (it *Iter) Hlc() HLC
- func (it *Iter) KV() *KV
- func (it *Iter) Key() string
- func (it *Iter) Large() bool
- func (it *Iter) Next()
- func (it *Iter) Prev()
- func (it *Iter) Seek(target string)
- func (it *Iter) SeekFirst()
- func (it *Iter) SeekLast()
- func (it *Iter) Valid() bool
- func (it *Iter) Vel() (val []byte, empty, large bool)
- func (it *Iter) Vin() (val []byte)
- type KV
- type KVcloser
- type LeafNode
- func (z *LeafNode) DecodeMsg(dc *msgp.Reader) (err error)
- func (z *LeafNode) EncodeMsg(en *msgp.Writer) (err error)
- func (z *LeafNode) Gstring() (r string)
- func (z *LeafNode) MarshalMsg(b []byte) (o []byte, err error)
- func (z *LeafNode) Msgsize() (s int)
- func (le *LeafNode) String() (r string)
- func (z *LeafNode) UnmarshalMsg(bts []byte) (o []byte, err error)
- func (z *LeafNode) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)
- type Metrics
- type NodeID
- func (z *NodeID) DecodeMsg(dc *msgp.Reader) (err error)
- func (z NodeID) EncodeMsg(en *msgp.Writer) (err error)
- func (id NodeID) Index() int32
- func (id NodeID) IsIllegal() bool
- func (id NodeID) IsInternal() bool
- func (id NodeID) IsLeaf() bool
- func (z NodeID) MarshalMsg(b []byte) (o []byte, err error)
- func (z NodeID) Msgsize() (s int)
- func (z *NodeID) UnmarshalMsg(bts []byte) (o []byte, err error)
- func (z *NodeID) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)
- type Offlen
- func (z *Offlen) DecodeMsg(dc *msgp.Reader) (err error)
- func (z *Offlen) EncodeMsg(en *msgp.Writer) (err error)
- func (z *Offlen) Gstring() (r string)
- func (z *Offlen) MarshalMsg(b []byte) (o []byte, err error)
- func (z *Offlen) Msgsize() (s int)
- func (z *Offlen) UnmarshalMsg(bts []byte) (o []byte, err error)
- func (z *Offlen) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)
- type PiggybackGCStats
- type Pos
- func (p *Pos) Backward(step uint64)
- func (z *Pos) DecodeMsg(dc *msgp.Reader) (err error)
- func (z *Pos) EncodeMsg(en *msgp.Writer) (err error)
- func (p *Pos) Forward(step uint64)
- func (p *Pos) ForwardExtent()
- func (p *Pos) GetLoff() uint64
- func (p *Pos) GetPoff() uint64
- func (p *Pos) GetTag() (uint16, bool)
- func (z *Pos) Gstring() (r string)
- func (z *Pos) MarshalMsg(b []byte) (o []byte, err error)
- func (z *Pos) Msgsize() (s int)
- func (p *Pos) Rewind()
- func (z *Pos) UnmarshalMsg(bts []byte) (o []byte, err error)
- func (z *Pos) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)
- func (p *Pos) Valid() bool
- type ReadOnlyDB
- type ReadOnlyTx
- func (roTx *ReadOnlyTx) Ascend(pivot string, iter func(key string, value []byte) bool)
- func (roTx *ReadOnlyTx) AscendRange(greaterOrEqual, lessThan string, iter func(key string, value []byte) bool)
- func (rtx *ReadOnlyTx) Close()
- func (roTx *ReadOnlyTx) Descend(pivot string, iter func(key string, value []byte) bool)
- func (roTx *ReadOnlyTx) DescendRange(lessOrEqual, greaterThan string, iter func(key string, value []byte) bool)
- func (roTx *ReadOnlyTx) FetchLarge(kv *KV) ([]byte, error)
- func (roTx *ReadOnlyTx) Find(smod SearchModifier, key string) (kvc *KVcloser, exact bool, err error)
- func (roTx *ReadOnlyTx) FindIt(smod SearchModifier, key string) (kvc *KVcloser, exact bool, err error, it *Iter)
- func (roTx *ReadOnlyTx) Get(key string) (value []byte, found bool, err error)
- func (roTx *ReadOnlyTx) GetKV(key string) (kv *KVcloser, err error)
- func (roTx *ReadOnlyTx) Len() int64
- func (roTx *ReadOnlyTx) LenBigSmall() (big, small int64)
- func (roTx *ReadOnlyTx) NewIter() *Iter
- type SearchModifier
- type VPtr
- type VacuumKVStats
- type VacuumVLOGStats
- type WritableDB
- type WriteTx
- func (tx *WriteTx) Ascend(pivot string, iter func(key string, value []byte) bool)
- func (tx *WriteTx) AscendRange(greaterOrEqual, lessThan string, iter func(key string, value []byte) bool)
- func (tx *WriteTx) Clear(includeLarge bool) (allGone bool, err error)
- func (wtx *WriteTx) Close()
- func (tx *WriteTx) Delete(key string) error
- func (tx *WriteTx) DeleteRange(includeLarge bool, begKey, endKey string, begInclusive, endInclusive bool) (n int64, allGone bool, err error)
- func (tx *WriteTx) Descend(pivot string, iter func(key string, value []byte) bool)
- func (tx *WriteTx) DescendRange(lessOrEqual, greaterThan string, iter func(key string, value []byte) bool)
- func (tx *WriteTx) FetchLarge(kv *KV) ([]byte, error)
- func (tx *WriteTx) Find(smod SearchModifier, key string) (kvc *KVcloser, exact bool, err error)
- func (tx *WriteTx) FindIt(smod SearchModifier, key string) (kvc *KVcloser, exact bool, err error, it *Iter)
- func (tx *WriteTx) Get(key string) (value []byte, found bool, err error)
- func (tx *WriteTx) GetKV(key string) (kv *KVcloser, err error)
- func (tx *WriteTx) Len() int64
- func (tx *WriteTx) LenBigSmall() (big, small int64)
- func (tx *WriteTx) Merge(key string, ...) error
- func (tx *WriteTx) NewIter() *Iter
- func (tx *WriteTx) Put(key string, value []byte) error
- func (tx *WriteTx) Sync() error
Constants ¶
const ( // the address space that flexspace.go manages FLEXSPACE_MAX_OFFSET = 800 << 30 // 800 GB logical address space // block config FLEXSPACE_BLOCK_BITS = 22 // 4 MB blocks FLEXSPACE_BLOCK_SIZE = 1 << FLEXSPACE_BLOCK_BITS // 4_194_304 bytes FLEXSPACE_BLOCK_COUNT = FLEXSPACE_MAX_OFFSET >> FLEXSPACE_BLOCK_BITS // 204800 blocks FLEXSPACE_MAX_EXTENT_BIT = 5 FLEXSPACE_MAX_EXTENT_SIZE = FLEXSPACE_BLOCK_SIZE >> FLEXSPACE_MAX_EXTENT_BIT // 131_072 bytes == 128 KB (1/32) // logical logging FLEXSPACE_LOG_MEM_CAP = 8 << 20 // 8 MB in-memory log buffer FLEXSPACE_LOG_MAX_SIZE = 2 << 30 // 2 GB max on-disk log // garbage collector FLEXSPACE_GC_QUEUE_DEPTH = 8192 FLEXSPACE_GC_THRESHOLD = 64 // free blocks below this triggers GC // block manager FLEXSPACE_BM_BLKDIST_BITS = 16 FLEXSPACE_BM_BLKDIST_SIZE = (FLEXSPACE_BLOCK_SIZE >> FLEXSPACE_BM_BLKDIST_BITS) + 1 // 65 buckets )
const ( // SLOTTED_PAGE_KB is the target page size for slotted page intervals. // Tune this for different workloads. Larger pages amortize overhead // better but increase rewrite cost. At 4 KB pages we are 2x faster at // load than pebble and 2x faster at read than Bolt. At 64 KB pages // our load time is the same as pebble, but we are 3x faster than Bolt. // See go test -v -tags memfs -run=xxx -bench Iter // And go test -v -tags memfs -run=xxx -bench BigRandomRWBatch // // 10 is a nice middle ground: 2x write vs Pebble,2x read vs Bolt. With tight splits (Config.PaddedSplits = false, the default). SLOTTED_PAGE_KB = 10 // 10:5.214 8:5.84 12:7.874 // up from 2 to 10 seems to help scans alot. 10: (9.984, 10.01, 9.817 ns/key); 2: (23.72 ns/key); 20:14.84 128: 4ns/key sequential scan, nice. 8: 5.347 ns/key. 4: very fast insert. 7.6 ns/key full scan. 32:4.473 ns/key. choice for now: keep at 4 for a litle balance between insert and scan through. but 64:4.168 ns/key, but random rw slows 2x. // MAX_KEY_BYTES is the maximum key size in bytes that a slotted page // can hold. Derived from SLOTTED_PAGE_KB to ensure at least one key // always fits in a page. MAX_KEY_BYTES = SLOTTED_PAGE_KB * 512 )
const DefaultFlexTreeMaxExtentSizeLimit = (64 << 20) // 64 MB
const FLEXTREE_HOLE = (1 << 47) // highest bit set to indicate a hole
const FLEXTREE_INTERNAL_CAP = 30
const FLEXTREE_INTERNAL_CAP_PLUS_ONE = FLEXTREE_INTERNAL_CAP + 1
const FLEXTREE_LEAF_CAP = 60
Tunable parameters
const FLEXTREE_MAX_EXTENT_SIZE_LIMIT = 64 << 20
const FLEXTREE_PATH_DEPTH = 7 // at most 7 levels
const FLEXTREE_POFF_MASK = 0xffffffffffff // 48 bits
const (
// MaxKeySize is actually 16 bytes smaller (really the limit is 4080 bytes).
MaxKeySize = 4096
)
Variables ¶
var ErrKeyEmpty = fmt.Errorf("key cannot be the empty string")
Functions ¶
Types ¶
type Batch ¶
type Batch struct {
// contains filtered or unexported fields
}
Batch submits a set of writes all together at once for load efficiency and/or atomic change to the database.
func (*Batch) Close ¶
func (s *Batch) Close()
Close forgets any existing queued up puts, and frees any other resources associated with the Batch.
func (*Batch) Commit ¶
func (s *Batch) Commit(doFsync bool) (interv HLCInterval, err error)
Commit flushes the batch atomically all the way to disk but does not fsync unless set doFsync true.
After Commit the batch is empty and can be re-used immediately.
Returns the half-open HLC interval [Begin, Endx) assigned to this batch.
Again, with doFsync false, we do not wait for the data to be fdatasynced to disk. Set doFsync true to fsync into a MEMWAL log, or do multiple batches and then db.Sync() if you need durability across power restarts. Usually if performance is required this is done once after all your batches are loaded.
Metrics are useful, but relatively expensive as we must scan all of the FlexSpace blocks linearly; use CommitGetMetrics() to view them. Commit() itself now skips them for speed.
func (*Batch) CommitGetMetrics ¶
func (s *Batch) CommitGetMetrics(doFsync bool) (HLCInterval, *Metrics, error)
CommitGetMetrics does Commit, and then returns metrics on the flex space for garbage collection and write-amplification study purposes; hence it is slower. It does a linear scan through all the FLEXSPACE.KV.SLOT_BLOCKS to see how much free space could be reclaimed.
type CommonNode ¶
type CommonNode struct {
// SlotID is the on-disk page slot for CoW persistence; -1 = unassigned
SlotID int64 `zid:"0"`
// know thyself.
NodeID NodeID `zid:"1"`
Count uint32 `zid:"2"`
IsLeaf bool `zid:"3"`
Dirty bool `zid:"4"`
Freed bool `zid:"5"`
}
CommonNode is embedded in both LeafNode and InternalNode. 20 bytes at the moment.
func (*CommonNode) DecodeMsg ¶
func (z *CommonNode) DecodeMsg(dc *msgp.Reader) (err error)
DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.
func (*CommonNode) EncodeMsg ¶
func (z *CommonNode) EncodeMsg(en *msgp.Writer) (err error)
EncodeMsg implements msgp.Encodable
func (*CommonNode) Gstring ¶
func (z *CommonNode) Gstring() (r string)
func (*CommonNode) MarshalMsg ¶
func (z *CommonNode) MarshalMsg(b []byte) (o []byte, err error)
MarshalMsg implements msgp.Marshaler
func (*CommonNode) Msgsize ¶
func (z *CommonNode) Msgsize() (s int)
Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message
func (*CommonNode) UnmarshalMsg ¶
func (z *CommonNode) UnmarshalMsg(bts []byte) (o []byte, err error)
UnmarshalMsg implements msgp.Unmarshaler
func (*CommonNode) UnmarshalMsgWithCfg ¶
func (z *CommonNode) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)
type Config ¶
type Config struct {
CacheMB uint64 // default 32 (for 32 MB)
// NoDisk runs the entire database in-memory using MemVFS.
// No files are created on disk. Useful for testing.
NoDisk bool
// FS overrides the filesystem implementation. When nil, RealVFS{}
// is used (or MemVFS if NoDisk is true). Allows injecting a
// custom VFS for testing (e.g. fault injection).
FS vfs.FS
// DisableVLOG disables the value log. When true, all values are stored
// inline in FlexSpace regardless of size (original behavior).
DisableVLOG bool
// OmitFlexSpaceOpsRedoLog skips FlexSpace redo-log writes and instead
// calls SyncCoW() on every Sync(). This eliminates ~0.86x write
// amplification from the redo log at the cost of slightly more CoW
// tree page writes. The net effect is lower total write amp (~3.2x
// vs ~4.1x).
OmitFlexSpaceOpsRedoLog bool
// LowBlockUtilizationPct sets the threshold (0.0–1.0) for counting
// blocks as "low utilization" in Metrics.BlocksWithLowUtilization.
// A block whose live bytes / FLEXSPACE_BLOCK_SIZE is below this
// fraction is counted. Default 0.25 (25%) when zero.
LowBlockUtilizationPct float64
// OmitMemWalFsync true means we do not durably fdatasync the MEMWAL1/2 files.
// This is useful for batch loading a lot of data quickly, and then doing
// one fsync at the end for durability. The proviso of course is that
// if your process crashes you have no intermediate state and have to
// start again at the beginning; which may be fine. True also disables
// the LARGE.VLOG fsyncs until db.Sync() is called.
OmitMemWalFsync bool
// PiggybackGC_on_SyncOrFlush enables automatic GC at the end of
// every Sync() and flush operation. GC runs only if the garbage
// fraction (wasted bytes / total bytes in used blocks) exceeds
// GCGarbagePct. Default false (disabled).
PiggybackGC_on_SyncOrFlush bool
// GCGarbagePct is the minimum fraction of wasted bytes in used
// blocks (garbage / (garbage + live)) required to trigger piggyback GC.
// Value between 0.0 and 1.0. Default 0.50 (50%) when zero and
// PiggybackGC_on_SyncOrFlush is true.
GCGarbagePct float64
// DisableBackgroundFlush disables the background flush worker goroutine.
// When true, memtable flushes only happen on explicit Sync() or Close() calls.
// This is useful for fuzz testing where background goroutines can crash
// the entire fuzz worker subprocess if they panic (the test's recover()
// only catches panics on the test goroutine, not background goroutines).
DisableBackgroundFlush bool
// PaddedSplits controls whether treeInsertAnchor pads both split
// halves to slottedPageMaxSize. Default false uses tight encoding,
// which cuts space amplification substantially. When true, the old
// padded behavior is used (useful for A/B comparison).
// Padding allows in-place additions to a slotted page to
// occur, making key updates more efficient. The trade-off
// is pre-allocating space for new additions.
PaddedSplits bool
}
Config allows configuration of a FlexDB.
type FlexDB ¶
type FlexDB struct {
Path string
// Write-byte counters (accessed atomically)
MemWALBytesWritten int64 // WAL (FLEXDB.MEMWAL) bytes written
LogicalBytesWritten int64 // user payload bytes (key+value)
// contains filtered or unexported fields
}
FlexDB is a persistent ordered key-value store backed by FlexSpace. It is thread-safe, except for iteration via Ascend/Descend--which allows deletions and updates on the fly.
func OpenFlexDB ¶
OpenFlexDB opens or creates a FlexDB at the given directory path. cacheMB is the cache capacity in megabytes.
func (*FlexDB) BeginUpdate ¶ added in v0.9.8
func (*FlexDB) BeginView ¶ added in v0.9.8
func (db *FlexDB) BeginView() *ReadOnlyTx
func (*FlexDB) CheckIntegrity ¶
func (db *FlexDB) CheckIntegrity() []IntegrityError
CheckIntegrity performs a read-only consistency check of the FlexDB. It flushes the memtable first, then acquires a read lock on FlexSpace.
Checks performed:
- FlexTree leaf linked list: no cycles, prev/next consistency
- Extent validity: every non-hole extent has poff + len within file bounds
- Extent readability: data at every extent can be read from disk
- Block usage: recomputed from FlexTree matches the block manager's state
- Sparse index: every anchor interval is readable and kv128-decodable
- Sorted keys: keys within each decoded interval are in sorted order
- Anchor coverage: anchor loff+psize spans tile the FlexSpace without gaps/overlaps
- VLOG blake3: for every KV with a VPtr, read the VLOG entry and verify hdrCRC, valCRC, and blake3 checksum of value bytes
Returns nil if no errors found.
func (*FlexDB) Clear ¶
Clear deletes all keys in the database.
When includeLarge is true, the entire database is wiped and re-initialized (fast path). When false, only keys with inline (small) values are deleted; keys with large values stored in the VLOG survive.
Returns allGone=true when the database was re-initialized. In that case, ALL previously held iterators, cursors, and pointers into the database are invalid and must be re-acquired.
Goroutine safe. Acquires the database write lock for the duration of the call, serializing against all other operations.
func (*FlexDB) CumulativeMetrics ¶
CumulativeMetrics reports file sizes on disk, reflecting the cumulative history of all sessions. Each physical metric is the current file size (via Stat), so it captures bytes written by previous sessions as well. LogicalBytesWritten uses FlexSpace's MaxLoff as an approximation of total user payload stored (it includes kv128 encoding overhead of ~10-20 bytes per entry; values separated to VLOG are represented by 16-byte VPtrs).
func (*FlexDB) DeleteRange ¶
func (db *FlexDB) DeleteRange(includeLarge bool, begKey, endKey string, begInclusive, endInclusive bool) (n int64, allGone bool, err error)
DeleteRange deletes all keys in the range [begKey, endKey] with configurable inclusivity on each bound.
Returns:
- n: number of tombstones written (0 when allGone is true)
- allGone: true if the entire database was wiped and re-initialized. When true, ALL previously held iterators, cursors, and pointers into the database are invalid and must be re-acquired.
- err: non-nil on failure
When includeLarge is false, keys whose values are stored in the VLOG (large values, > 64 bytes) are skipped and survive the deletion.
The begInclusive and endInclusive parameters control whether the bounds are inclusive or exclusive:
DeleteRange(true, a, z, true, true) // [a, z] - both inclusive, include large values DeleteRange(true, a, z, true, false) // [a, z) - half-open, include large values DeleteRange(false, a, z, true, true) // [a, z] - both inclusive, skip large values
Goroutine safe. Concurrent reads and writes are serialized via the database write lock. However, when allGone is returned true, all previously held iterators, cursors, and references are invalidated.
func (*FlexDB) FetchLarge ¶ added in v0.7.0
FetchLarge retrieves the value bytes for a KV whose value is stored in the VLOG (kv.Large() returns true). For inline values, it simply returns kv.Value. The returned bytes are a fresh copy safe to retain.
Goroutine safe. Acquires the read lock internally.
func (*FlexDB) Find ¶ added in v0.6.2
Find allows GTE, GT, LTE, LT, and Exact searches.
GTE: find the smallest key greater-than-or-equal to key.
GT: find the smallest key strictly greater-than key.
LTE: find the largest key less-than-or-equal to key.
LT: find the largest key strictly less-than key.
Exact: find a matching key exactly.
If key is the empty string, then GTE and GT return the first key in the tree, while LTE and LT return the last key.
Any of the LAZY* set of flag can be bitwise-OR-ed with the smod to request that large/small/all values not be returned unless and until we decide we want them with an explicit FetchLarge() call. For example: Find(Exact|LAZY, "needle")
The returned *KVcloser contains the found key and its value; unless laziness was requested.
The returned bool, 'exact', indicates an exact match to the query key.
If the returned kvc *KVcloser is nil, this means that the key was not found, or there was an I/O error. The caller should always check the returned error (err) first rule out I/O error before concluding the key was not found from a nil kvc.
A typical call sequence would be:
kvc, _, err := dbHaystack.Find(Exact, "needle")
if err != nil {
return err
}
if kvc != nil {
// found exact match! (we know, because Exact was
// requested; if this was a GTE search we would need
// to check the 'exact' bool return to know if we found
// our "needle", or went past it).
// Here all value sizes are automatically pulled in, since
// none of the (LAZY_SMALL, LAZY_LARGE, LAZY) smod were requested
// For performance, we do not copy kvc.Value for you.
// So you must copy kvc.Value, if you need it later,
// before doing kvc.Close().
processKeyAndValueAtHlcTimestamp(kvc.Key, kvc.Value, kvc.Hlc)
kvc.Close() // unpin from internal caches. Allows zero-copy reads.
}
The returned iterator is a locked iterator (holds the exclusive Find looks up the first key matching the SearchModifier and returns an owned copy of the KV (safe to retain indefinitely). For scanning beyond the found key, use Find inside a View or Update transaction.
Goroutine safe. Acquires the read lock internally.
Warning: if kvc != nil, the user must call Close() on the returned kvc *KVcloser when done copying any Value out, or else memory and resource leaks will ensue.
The kvc.Close() can be skipped if kvc is nil (key not found). However it is always fine to do the Close() even then, as kvc.Close() is a no-op if kvc is nil.
func (*FlexDB) Get ¶
Get retrieves the value for key. Returns nil, false if not found. Get can return nil, true if a nil value was stored with the key. Get is value size agnostic. It returns large and small values immediately. This is tested at, for example, gc_test.go Test_GC1K_write_1k_keys_with_large_values.
func (*FlexDB) GetKV ¶ added in v0.9.1
GetKV is like Get but allows lazy loading of Large values; they are not fetched automatically. If the user sees kv.Large() true, then db.FetchLarge(kv) will return the large value. GetKV is equivalent to db.Find(Exact, key).
func (*FlexDB) Len ¶
Len returns the total number of live (non-tombstone) keys in the database. O(1) - reads a pre-maintained counter. Goroutine safe.
func (*FlexDB) LenBigSmall ¶
LenBigSmall returns the live key count partitioned by storage location. big: keys whose values are stored in the VLOG (> 64 bytes). small: keys whose values are stored inline. O(1) - reads pre-maintained counters. Goroutine safe.
func (*FlexDB) Merge ¶
func (db *FlexDB) Merge(key string, fn func(oldVal []byte, exists bool) (newVal []byte, write bool, doDelete bool)) error
Merge performs an atomic read-modify-write on key. fn is always called: when the key exists, oldVal is its current value and exists=true; when the key is absent or deleted, oldVal=nil and exists=false (allowing conditional creation). Return write=false to skip the write, or doDelete=true to delete the key.
This mirrors C's flexdb_merge: it looks up the old value across all layers (active memtable, inactive memtable, FlexSpace), applies the user function, and writes the result atomically.
Q: When should the callback return doWrite=false, doDelete=false? A: This means "do nothing." The main scenarios:
Conditional creation: The callback inspects exists and decides not to create the key. E.g., "only increment if the key already exists" - if exists=false, return a bare return (all zeros = no-op).
Conditional update: The callback inspects the old value and decides no change is needed. E.g., "set to X only if current value isn't already X."
Read-only peek: The callback just wants to see the current value (though Get is simpler for that).
.
func (*FlexDB) Put ¶
Put writes key -> value. len(value) == 0 is fine, if desired. Call Delete instead of Put to delete a key and any associated value.
Values of any size are accepted. Values > vlogInlineThreshold (64 bytes) are stored in the VLOG file; smaller values are stored inline in the FLEXSPACE.KV.SLOT_BLOCKS file with the keys.
Large values are written exactly once: to the VLOG. The WAL stores only the VPtr (16 bytes), not the full value.
Puts are not durably on disk until after the user has also completed a db.Sync() call. This allows the user to control the rate of fsyncs and trade that against their durability requirements.
func (*FlexDB) SessionMetrics ¶
Metrics returns a snapshot of write-byte counters aggregated from all layers.
func (*FlexDB) Sync ¶
Sync flushes all in-memory data in the active memtable to disk in FLEXSPACE.KV128.BLOCKS and fsyncs it. Users must call Sync after Puts for them to be durable.
func (*FlexDB) Update ¶
Update runs fn inside an exclusive write transaction. The write lock (topMutRW.Lock()) is held for the duration of fn, blocking all other readers and writers including the flush worker.
All iterators created within fn via rwDB.NewIter() or rwDB.FindIt() are automatically closed when fn returns.
Writes via rwDB.Put/rwDB.Delete are applied immediately to the database (no buffering, no Commit needed).
Do NOT call db.Put/db.Get/db.Delete/db.Sync inside fn - use rwDB methods instead (deadlock).
func (*FlexDB) VacuumKV ¶
func (db *FlexDB) VacuumKV() (*VacuumKVStats, error)
VacuumKV reclaims dead FLEXSPACE.KV.SLOT_BLOCKS space by rewriting all live extents sequentially to a new file and replacing the old file. This is an exclusive operation that acquires topMutRW.
Crash safety: if the process crashes before the rename completes, the old FLEXSPACE.KV.SLOT_BLOCKS and old FlexTree remain intact. The stale .vacuum file (if present) is harmless and will be overwritten on the next vacuum.
VacuumKV does a one-time compaction of already-bloated databases. Algorithm: 1. Flush memtable, acquire exclusive locks 1b. Compact slotted page padding: decode each page, compute tight size,
Collapse the zero-padding in reverse loff order
2. Walk FlexTree leaf linked list, read/rewrite all live extents sequentially to a .vacuum file 3. Close old fd, rename .vacuum -> FLEXSPACE.KV.SLOT_BLOCKS, reopen 4. Rebuild block manager, checkpoint FlexTree 5. Rebuild anchor tree from FlexTree tags (replaces stale anchor loffs) 6. Clean up stale .vacuum files on OpenFlexSpaceCoW
See the tests: TestFlexDB_VacuumKV_Basic - overwrites 200 keys, vacuums, verifies data integrity across reopen TestFlexDB_VacuumKV_WithDeletes - deletes half of 100 keys, vacuums, verifies correct keys survive .
func (*FlexDB) VacuumVLOG ¶
func (db *FlexDB) VacuumVLOG() (*VacuumVLOGStats, error)
VacuumVLOG reclaims dead LARGE.VLOG space by copying live values to a new VLOG file and rewriting their VPtrs in FlexSpace. This is an exclusive operation that acquires topMutRW.
Crash safety: if the process crashes before the rename completes, the old VLOG and old intervals remain intact. The stale VLOG.new file (if present) is harmless and will be overwritten on the next vacuum.
func (*FlexDB) View ¶
func (db *FlexDB) View(fn func(ro *ReadOnlyTx) error) (err error)
View runs fn inside a read-only transaction. The read lock (topMutRW.RLock()) is held for the duration of fn, blocking writers (including the flush worker) but allowing concurrent readers.
All iterators created within fn via roDB.NewIter() or roDB.FindIt() are automatically closed when fn returns.
Do NOT call db.Get inside fn - use roDB methods instead (deadlock).
type FlexSpace ¶
type FlexSpace struct {
Path string
// Write-byte counters (accessed atomically)
KV128BytesWritten int64 // FLEXSPACE.KV.SLOT_BLOCKS file bytes written
REDOLogBytesWritten int64 // redo LOG bytes written
// contains filtered or unexported fields
}
FlexSpace is a log-structured file-like address space providing insert-range, collapse-range, read, write, and GC operations. Corresponds to struct flexfile in flexfile.h/flexfile.c.
FlexSpace itself does no internal locking. Clients must guarantee single user at a time; e.g. via FlexDB.ffMu.
The main data file within the FLEXSPACE directory, the file that holds all the keys + small values is "FLEXSPACE.KV.SLOT_BLOCKS".
Here's how the block management works:
The FLEXSPACE.KV.SLOT_BLOCKS file is divided into 4 MB blocks. The block manager (blockManager) keeps one 4 MB in-memory write buffer and appends data into it sequentially:
1. Writing: When FlexDB flushes a memtable, it calls putPassthrough which calls ff.Insert(). Insert copies the kv128-encoded bytes into the block manager's buffer at the current offset (bm.buf[blkoff:]). If the current block can't fit the data, nextBlock() writes the filled buffer to disk at blkid × 4MB and moves to the next empty block.
2. Mapping: Each chunk written gets a physical offset (poff = blkid × 4MB + blkoff). That poff is inserted into the FlexTree, which maps logical offset -> physical offset. So the FlexTree is the indirection layer that lets you read kv128 intervals by logical position even though they're scattered across 4 MB blocks on disk.
3. GC: The block manager tracks per-block usage (blkusage[]). When data is deleted (Collapse), the block's usage count decreases. GC reclaims blocks with low utilization by rewriting their live extents into the current write block, then freeing the old block.
So "block-managed" means the FLEXSPACE.KV.SLOT_BLOCKS file is a pool of 4 MB blocks with an append-only write cursor, a FlexTree for logical -> physical mapping, and a GC that reclaims fragmented blocks. It's essentially a log-structured store at the block level - writes always go to the current block, never overwrite in place.
func OpenFlexSpaceCoW ¶
OpenFlexSpaceCoW opens or creates a FlexSpace using CoW page-based persistence for the FlexTree instead of greenpack full-tree serialization. The directory will contain: FLEXSPACE.KV.SLOT_BLOCKS, FLEXTREE.COMMIT, FLEXTREE.PAGES, FLEXSPACE.REDO.LOG. When omitRedoLog is true, redo log writes are skipped and SyncCoW is called on every Sync.
func (*FlexSpace) Defrag ¶
Defrag rewrites the len bytes at loff as a fresh contiguous physical extent.
func (*FlexSpace) Fallocate ¶
Fallocate pre-allocates size bytes starting at loff by inserting zero data. Bug fix: C code used FLEXSPACE_MAX_EXTENT_BIT (5) instead of FLEXSPACE_MAX_EXTENT_SIZE (131072).
func (*FlexSpace) Ftruncate ¶
Ftruncate truncates the FlexSpace to size bytes by collapsing the tail.
func (*FlexSpace) GC ¶
func (ff *FlexSpace) GC()
GC runs garbage collection if free blocks are below the threshold. Iterates up to 4 rounds of decreasing aggressiveness.
func (*FlexSpace) GetHandler ¶
func (ff *FlexSpace) GetHandler(loff uint64) FlexSpaceHandler
GetHandler returns a handler positioned at loff.
func (*FlexSpace) Overwrite ¶
Overwrite writes buf directly to the physical location backing the extent at loff, without mutating the FlexTree or writing redo log entries. The extent at loff must already exist and have length == len(buf). This creates zero garbage - the same physical blocks are reused in-place.
func (*FlexSpace) Read ¶
Read reads len bytes from loff into buf. Returns bytes read or -1 on error.
func (*FlexSpace) ReadFragmentation ¶
ReadFragmentation reads and also returns the number of physical extents (frag).
type FlexSpaceHandler ¶
type FlexSpaceHandler struct {
// contains filtered or unexported fields
}
FlexSpaceHandler is a stateful read-only cursor over a FlexSpace. Corresponds to struct flexfile_handler.
func (*FlexSpaceHandler) Backward ¶
func (fh *FlexSpaceHandler) Backward(step uint64)
Backward moves the handler's position backward by step bytes.
func (*FlexSpaceHandler) Forward ¶
func (fh *FlexSpaceHandler) Forward(step uint64)
Forward advances the handler's position by step bytes.
func (*FlexSpaceHandler) ForwardExtent ¶
func (fh *FlexSpaceHandler) ForwardExtent()
ForwardExtent advances the handler to the start of the next extent.
func (*FlexSpaceHandler) GetTag ¶
func (fh *FlexSpaceHandler) GetTag() (uint16, error)
GetTag returns the tag at the handler's current position.
func (*FlexSpaceHandler) Loff ¶
func (fh *FlexSpaceHandler) Loff() uint64
Loff returns the current logical offset.
func (*FlexSpaceHandler) Poff ¶
func (fh *FlexSpaceHandler) Poff() uint64
Poff returns the current physical offset.
func (*FlexSpaceHandler) Read ¶
func (fh *FlexSpaceHandler) Read(buf []byte, length uint64) (int, error)
Read reads len bytes from the handler's current position into buf.
func (*FlexSpaceHandler) Valid ¶
func (fh *FlexSpaceHandler) Valid() bool
Valid returns true if the handler points to a valid extent.
type FlexTree ¶
type FlexTree struct {
MaxLoff uint64 `zid:"0"`
MaxExtentSize uint32 `zid:"1"`
Path string `zid:"2"`
NodeCount uint64 `zid:"3"`
PersistentVersion uint64 `zid:"4"`
Root NodeID `zid:"5"`
LeafHead NodeID `zid:"6"` // linked list head
// the arenas that hold all nodes.
InternalArena []InternalNode `zid:"7"`
LeafArena []LeafNode `zid:"8"`
// Stack of available internal node indices
FreeInternals []int32 `zid:"9"`
// Stack of available leaf node indices
FreeLeaves []int32 `zid:"10"`
// Write-byte counters (accessed atomically)
FlexTreePagesBytesWritten int64 // FLEXTREE.PAGES + FLEXTREE.COMMIT bytes written to disk
MaxHLC int64 // highest HLC ever assigned; restored on reopen
// contains filtered or unexported fields
}
FlexTree acts as our Arena allocator and entry point.
func NewFlexTree ¶
NewTree initializes the tree and pre-allocates some capacity to avoid early slice reallocations.
Warning: after getting a new node with AllocLeaf() or AllocInternal() be aware that the Go allocator may have just re-sized the underlying arena to grow it, which means any existing address (rather than index) will need to be re-converted from index into pointer before use. We had several early bugs from this. You have been warned!
We could and maybe should just allocate the arenas with malloc directly from the operating system... but meh. It's usually fragile to get into the business of creating your own memory manager.
func OpenFlexTreeCoW ¶
OpenFlexTreeCoW opens an existing CoW FlexTree from dirPath, or creates a new one if the directory does not contain FLEXTREE.COMMIT/FLEXTREE.PAGES files.
func (*FlexTree) AllocInternal ¶
func (t *FlexTree) AllocInternal() (ie *InternalNode)
AllocInternal mirrors the leaf logic for internal nodes.
func (*FlexTree) DecodeMsg ¶
DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.
func (*FlexTree) FreeNode ¶
FreeNode releases a node's index back to the pool, strictly checking for double-free corruption.
func (*FlexTree) GetInternal ¶
func (t *FlexTree) GetInternal(id NodeID) *InternalNode
GetInternal retrieves a pointer to an internal node.
func (*FlexTree) GetMaxLoff ¶
func (*FlexTree) GetNodeString ¶
func (*FlexTree) GrandparentIdx ¶
func (tree *FlexTree) GrandparentIdx(path *FlexTreePath) (index uint32)
func (*FlexTree) GrandparentNode ¶
func (tree *FlexTree) GrandparentNode(path *FlexTreePath) (ie *InternalNode)
func (*FlexTree) InsertWTag ¶
func (*FlexTree) MarkAllInternalsDirty ¶
func (t *FlexTree) MarkAllInternalsDirty()
MarkAllInternalsDirty sets Dirty=true on every allocated internal node. This is needed before SyncCoW when leaf nodes have been modified (e.g., by VacuumKV) without going through the normal insert/delete path that propagates dirty flags up the tree.
func (*FlexTree) MarshalMsg ¶
MarshalMsg implements msgp.Marshaler
func (*FlexTree) Msgsize ¶
Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message
func (*FlexTree) NodeSlotID ¶
NodeSlotID returns the SlotID for the node identified by id.
func (*FlexTree) ParentIdx ¶
func (tree *FlexTree) ParentIdx(path *FlexTreePath) (index uint32, ok bool)
func (*FlexTree) ParentNode ¶
func (tree *FlexTree) ParentNode(path *FlexTreePath) (ie *InternalNode, nodeID NodeID, ok bool)
func (*FlexTree) ParentNodeIdx ¶
func (tree *FlexTree) ParentNodeIdx(path *FlexTreePath) (ie *InternalNode, index uint32, ok bool)
func (*FlexTree) PosGet ¶
PosGet returns a Pos cursor positioned at loff in the tree. If loff is not within any extent, returns an invalid (node==nil) Pos.
func (*FlexTree) Query ¶
func (tree *FlexTree) Query(loff, length uint64) *FlextreeQueryResult
func (*FlexTree) SyncCoW ¶
SyncCoW writes only dirty nodes to disk using copy-on-write. The FLEXTREE.COMMIT file is the commit point: written only after FLEXTREE.PAGES is synced.
func (*FlexTree) UnmarshalMsg ¶
UnmarshalMsg implements msgp.Unmarshaler
func (*FlexTree) UnmarshalMsgWithCfg ¶
type FlexTreeExtent ¶
type FlexTreeExtent struct {
Loff uint32 `zid:"0"`
Len uint32 `zid:"1"`
// Combines tag (16 bits) and poff (48 bits)
TagPoff uint64 `zid:"2"`
}
FlexTreeExtent is a foundation piece.
Q: FlexTreeExtent is defined with the logical offset Loff uint32. How it is possible that the small 32-bit uint32 is correct here, rather than a larger uint64? It seems that the logical address space needs to span 62 or 63 bits, does it not? Please help me understand the design here.
A: Great question. Here's the key insight.
The Loff in each FlexTreeExtent is not a global logical offset. It's a leaf-local relative offset. The global 64-bit logical address is reconstructed on-the-fly during tree traversal by accumulating Shift values down the path.
Here's how it works:
1. Internal nodes carry Shift int64 per child
type InternalChild struct {
NodeID NodeID
Shift int64 // <- this is the 64-bit accumulator
}
When traversing from root to leaf (line 697), the global loff uint64 is reduced at each level: loff -= uint64(ie.Children[target].Shift)
By the time you reach a leaf, loff has been transformed into a small, leaf-local coordinate. The extents in that leaf use uint32 offsets relative to this local frame.
2. Inserts propagate shifts upward
When data is inserted at a logical offset, FlexTree.shiftUpPropagate() (line 752) adds the inserted length to the Shift of sibling children at each level -- but only for siblings to the right of the insertion point. This means only the path from root to the affected leaf is updated; all other subtrees lazily absorb the shift when they are next traversed.
3. rebase() prevents uint32 overflow
Line 736, in FlexTree.rebase():
if le.Extents[le.Count-1].Loff >= ^uint32(0) - tree.MaxExtentSize*2 {
newBase := le.Extents[0].Loff
parent.Children[pIdx].Shift += int64(newBase)
for i := uint32(0); i < le.Count; i++ {
le.Extents[i].Loff -= newBase
}
}
When a leaf's local offsets approach uint32 max (~4 GB), rebase() shifts the entire leaf's coordinates back to near zero by absorbing the base offset into the parent's Shift. This is O(leaf capacity) = O(60) work -- cheap.
Summary: The 64-bit logical address space lives in the tree edges (Shift int64 on each InternalChild), not in the leaf extents. Each leaf only needs to represent offsets within its own local window, which never exceeds ~4 GB thanks to rebase().
This is the paper's key trick -- it keeps the FlexTreeExtent struct at exactly 16 bytes (matching a cache line quarter), which is critical for the B-tree's I/O efficiency. Storing a uint64 Loff would bloat each extent to 20+ bytes and reduce the leaf fan-out from 60 to ~48. .
func (*FlexTreeExtent) Address ¶
func (e *FlexTreeExtent) Address() uint64
Address extracts the 47-bit physical offset address
func (*FlexTreeExtent) DecodeMsg ¶
func (z *FlexTreeExtent) DecodeMsg(dc *msgp.Reader) (err error)
DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.
func (*FlexTreeExtent) EncodeMsg ¶
func (z *FlexTreeExtent) EncodeMsg(en *msgp.Writer) (err error)
EncodeMsg implements msgp.Encodable
func (*FlexTreeExtent) Gstring ¶
func (z *FlexTreeExtent) Gstring() (r string)
func (*FlexTreeExtent) IsHole ¶
func (e *FlexTreeExtent) IsHole() bool
IsHole returns true if the 48th bit of poff is set
func (*FlexTreeExtent) MarshalMsg ¶
func (z *FlexTreeExtent) MarshalMsg(b []byte) (o []byte, err error)
MarshalMsg implements msgp.Marshaler
func (*FlexTreeExtent) Msgsize ¶
func (z *FlexTreeExtent) Msgsize() (s int)
Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message
func (*FlexTreeExtent) Poff ¶
func (e *FlexTreeExtent) Poff() uint64
Poff extracts the upper 48 bits
func (*FlexTreeExtent) SetAddress ¶
func (e *FlexTreeExtent) SetAddress(addr uint64)
SetAddress sets the 47-bit address without altering the hole flag
func (*FlexTreeExtent) SetHole ¶
func (e *FlexTreeExtent) SetHole(isHole bool)
SetHole sets or clears the 1-bit hole flag without altering the address
func (*FlexTreeExtent) SetPoff ¶
func (e *FlexTreeExtent) SetPoff(poff uint64)
SetPoff replaces the upper 48 bits while keeping the lower 16 bits intact
func (*FlexTreeExtent) SetTag ¶
func (e *FlexTreeExtent) SetTag(tag uint16)
SetTag replaces the lower 16 bits while keeping the upper 48 bits intact
func (*FlexTreeExtent) UnmarshalMsg ¶
func (z *FlexTreeExtent) UnmarshalMsg(bts []byte) (o []byte, err error)
UnmarshalMsg implements msgp.Unmarshaler
func (*FlexTreeExtent) UnmarshalMsgWithCfg ¶
func (z *FlexTreeExtent) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)
type FlexTreePath ¶
type FlexTreePath struct {
Level uint8 `zid:"0"`
Path [FLEXTREE_PATH_DEPTH]uint8 `zid:"1"`
Nodes [FLEXTREE_PATH_DEPTH]NodeID `zid:"2"`
}
func (*FlexTreePath) DecodeMsg ¶
func (z *FlexTreePath) DecodeMsg(dc *msgp.Reader) (err error)
DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.
func (*FlexTreePath) EncodeMsg ¶
func (z *FlexTreePath) EncodeMsg(en *msgp.Writer) (err error)
EncodeMsg implements msgp.Encodable
func (*FlexTreePath) Gstring ¶
func (z *FlexTreePath) Gstring() (r string)
func (*FlexTreePath) MarshalMsg ¶
func (z *FlexTreePath) MarshalMsg(b []byte) (o []byte, err error)
MarshalMsg implements msgp.Marshaler
func (*FlexTreePath) Msgsize ¶
func (z *FlexTreePath) Msgsize() (s int)
Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message
func (*FlexTreePath) UnmarshalMsg ¶
func (z *FlexTreePath) UnmarshalMsg(bts []byte) (o []byte, err error)
UnmarshalMsg implements msgp.Unmarshaler
func (*FlexTreePath) UnmarshalMsgWithCfg ¶
func (z *FlexTreePath) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)
type FlextreeQueryResult ¶
type FlextreeQueryResult struct {
Loff uint64 `zid:"0"`
Len uint64 `zid:"1"`
Count uint64 `zid:"2"`
V []Offlen `zid:"3"`
}
func (*FlextreeQueryResult) DecodeMsg ¶
func (z *FlextreeQueryResult) DecodeMsg(dc *msgp.Reader) (err error)
DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.
func (*FlextreeQueryResult) EncodeMsg ¶
func (z *FlextreeQueryResult) EncodeMsg(en *msgp.Writer) (err error)
EncodeMsg implements msgp.Encodable
func (*FlextreeQueryResult) Gstring ¶
func (z *FlextreeQueryResult) Gstring() (r string)
func (*FlextreeQueryResult) MarshalMsg ¶
func (z *FlextreeQueryResult) MarshalMsg(b []byte) (o []byte, err error)
MarshalMsg implements msgp.Marshaler
func (*FlextreeQueryResult) Msgsize ¶
func (z *FlextreeQueryResult) Msgsize() (s int)
Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message
func (*FlextreeQueryResult) UnmarshalMsg ¶
func (z *FlextreeQueryResult) UnmarshalMsg(bts []byte) (o []byte, err error)
UnmarshalMsg implements msgp.Unmarshaler
func (*FlextreeQueryResult) UnmarshalMsgWithCfg ¶
func (z *FlextreeQueryResult) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)
type HLC ¶
type HLC int64
HLC is a hybrid logical/physical clock, see the literature references in the README for citations.
Goroutine safety: all HLC routines are goroutine safe except for the argument m to ReceiveMessageWithHLC(m) by default, since m is typically from a network message that is exclusive owned. If m could be shared, read its HLC with Aload() first before passing it to ReceiveMessageWithHLC().
Warning about 32-bit CPU systems: this package uses atomic.LoadInt64() which is safe and fine on 64-bit systems but which can be buggy on 32-bit systems if the compiler happens to not 64-bit align your data in memory.
So: if you need this package on a 32-bit system, ensure your HLCs are always the very first field(s) in your struct, so they are 64-bit aligned.
See https://pkg.go.dev/sync/atomic#pkg-note-BUG where it discusses 32-bit systems (i.e. ARM not ARM64):
"On ARM, 386, and 32-bit MIPS, it is the caller's responsibility to arrange for 64-bit alignment of 64-bit words accessed atomically via the primitive atomic functions (types atomic.Int64 and atomic.Uint64 are automatically aligned). The first word in an allocated struct, array, or slice; in a global variable; or in a local variable (because on 32-bit architectures, the subject of 64-bit atomic operations will escape to the heap) can be relied upon to be 64-bit aligned."
func AssembleHLC ¶
AssembleHLC does the simple addition, but takes care of the type conversation too. For safety, it masks off the low 16 bits of lc that should always be 0 anyway before doing the addition.
func PhysicalTime48 ¶
func PhysicalTime48() HLC
PhysicalTime48 rounds up to the 16th bit the UnixNano() of the current time, as requested by the Hybrid-Logical-Clock algorithm. The low order 16 bits are used for a logical counter rather than nanoseconds. The low 16 bits are always zero on return from this function.
func (*HLC) Aload ¶
Aload does an atomic load of hlc and returns it. Callers of ReceiveMessageWithHLC, for instance, should use Aload() to read their HLC atomically (for the argument m) before calling ReceiveMessageWithHLC(m), if the possibility of data races exists (if more than one goroutine can see m).
func (*HLC) Count ¶
Count atomically reads the hlc and returns the lower 16 bits. The naming reflects the terminology in the original paper.
func (*HLC) CreateAndNow ¶
CreateAndNow is the same as CreateSendOrLocalEvent but also returns the raw time.Time before hlc conversion of the low 16 bits.
func (*HLC) CreateSendOrLocalEvent ¶
CreateSendOrLocalEvent updates the local hybrid clock j based on PhysicalTime48. POST: r == *hlc
func (*HLC) LC ¶
LC atomically reads the hlc and returns the upper 48 bits. The naming reflects the terminology in the original paper.
func (*HLC) ReceiveMessageWithHLC ¶
ReceiveMessageWithHLC updates the local hybrid clock hlc based on the received message m's hybrid clock.
PRE: to avoid data races, m should be owned exclusively by the calling goroutine, or if shared, m should be the result of an atomic load with m := sourceHLC.Aload().
POST: r == *hlc
type HLCInterval ¶
type HLCInterval struct {
Begin HLC // first HLC assigned
Endx HLC // exclusive upper bound (one past last)
}
HLCInterval represents a half-open interval [Begin, Endx) of HLC timestamps assigned during a Batch.Commit.
type IntegrityError ¶
type IntegrityError struct {
Check string // which check failed
Detail string // human-readable details
Fatal bool // if true, subsequent checks may be unreliable
}
IntegrityError describes a single integrity violation.
func (IntegrityError) Error ¶
func (e IntegrityError) Error() string
type InternalChild ¶
func (*InternalChild) DecodeMsg ¶
func (z *InternalChild) DecodeMsg(dc *msgp.Reader) (err error)
DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.
func (*InternalChild) EncodeMsg ¶
func (z *InternalChild) EncodeMsg(en *msgp.Writer) (err error)
EncodeMsg implements msgp.Encodable
func (*InternalChild) Gstring ¶
func (z *InternalChild) Gstring() (r string)
func (*InternalChild) MarshalMsg ¶
func (z *InternalChild) MarshalMsg(b []byte) (o []byte, err error)
MarshalMsg implements msgp.Marshaler
func (*InternalChild) Msgsize ¶
func (z *InternalChild) Msgsize() (s int)
Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message
func (*InternalChild) UnmarshalMsg ¶
func (z *InternalChild) UnmarshalMsg(bts []byte) (o []byte, err error)
UnmarshalMsg implements msgp.Unmarshaler
func (*InternalChild) UnmarshalMsgWithCfg ¶
func (z *InternalChild) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)
type InternalNode ¶
type InternalNode struct {
CommonNode `zid:"0"` // 20 bytes
// given n pivots,
Pivots [FLEXTREE_INTERNAL_CAP]uint64 `zid:"1"` // 8 bytes * 30 = 240 bytes (260 total)
// we need n+1 children.
Children [FLEXTREE_INTERNAL_CAP_PLUS_ONE]InternalChild `zid:"2"` // 12 bytes * 31 == 372 bytes (632 total)
// On-disk slot IDs for CoW persistence: tracks which page slot
// each child occupies on disk. During CoW sync, dirty nodes are
// written to freshly allocated slots, and the parent's
// ChildSlotIDs[idx] records the child's new slot location.
ChildSlotIDs [FLEXTREE_INTERNAL_CAP_PLUS_ONE]int64 `zid:"3"` // 8 bytes * 31 == 248 bytes (880 total)
}
InternalNode contains keys and offsets to children. Size is fixed (arrays not slices).
func (*InternalNode) DecodeMsg ¶
func (z *InternalNode) DecodeMsg(dc *msgp.Reader) (err error)
DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.
func (*InternalNode) EncodeMsg ¶
func (z *InternalNode) EncodeMsg(en *msgp.Writer) (err error)
EncodeMsg implements msgp.Encodable
func (*InternalNode) Gstring ¶
func (z *InternalNode) Gstring() (r string)
func (*InternalNode) MarshalMsg ¶
func (z *InternalNode) MarshalMsg(b []byte) (o []byte, err error)
MarshalMsg implements msgp.Marshaler
func (*InternalNode) Msgsize ¶
func (z *InternalNode) Msgsize() (s int)
Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message
func (*InternalNode) StringOnTree ¶
func (ie *InternalNode) StringOnTree(tree *FlexTree) (r string)
func (*InternalNode) UnmarshalMsg ¶
func (z *InternalNode) UnmarshalMsg(bts []byte) (o []byte, err error)
UnmarshalMsg implements msgp.Unmarshaler
func (*InternalNode) UnmarshalMsgWithCfg ¶
func (z *InternalNode) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)
type Iter ¶
type Iter struct {
// contains filtered or unexported fields
}
func (*Iter) Close ¶
func (it *Iter) Close()
Close marks the iterator as closed and releases cursor state. The transaction manages lock lifetime, not the iterator.
func (*Iter) FetchV ¶
FetchV fetches the current iterator value from the VLOG if the iterator points to a large value. For inline values, this is equivalent to Vin(). Returns the value bytes and any error from the VLOG read. If the value is inline and not large, it will still be returned (and the error will be nil).
func (*Iter) GetAnySize ¶ added in v0.9.7
GetAnySize returns values large or small, if available.
func (*Iter) KV ¶ added in v0.7.1
KV returns the current *KV with any inline value resolved (lazy-copied from cache memory). It does not fetch large values from the VLOG. The returned pointer is owned by the iterator and must not be retained past the next Next()/Prev()/Seek() call. Copy Key and Value if you need them to outlive the iterator step.
func (*Iter) Key ¶
Key returns the current key. On the fast path this is a direct pointer into cache memory (zero-copy). Call dupBytes(it.Key) if you need to keep a copy beyond the next Next()/Prev() call.
func (*Iter) Large ¶
Large returns false if the current value is stored inline (small value). When true, the value is stored in the VLOG and must be fetched via FetchV(). Iteration remains cheap until you really need to see the large value.
func (*Iter) SeekFirst ¶ added in v0.11.0
func (it *Iter) SeekFirst()
SeekFirst positions the iterator at the first key.
func (*Iter) SeekLast ¶ added in v0.11.0
func (it *Iter) SeekLast()
SeekLast positions the iterator at the last key.
func (*Iter) Vel ¶
Vel can be more ergonomic than Vin. Vel returns the current inline Value as well as the two important flags to tell you how to interpret the zero len(val) situation: whether the value is truly empty, or just large and sitting in the VLOG waiting for a separate FetchV call.
Vel is a mnemonic for the returns: value, empty, large.
Plus it's fun to say with your best Young Frankenstein style faux German accent (Gene Wilder, Mel Brooks 1974)
"Vel, vel, vel... vhat have ve here? A large value??"
func (*Iter) Vin ¶
Vin returns the current value for _inline_ values. For large values stored in VLOG, Vin returns nil. If you get back nil, you must use Large() to check if the value is actually large, and then FetchV() to fetch it on demand if so. Or just use Vel() below to find out what you've got in one call.
On the fast path (empty memtable), the first call copies from the cache reference into valBuf. Subsequent calls return valBuf directly.
type KV ¶
type KV struct {
Key string
Value []byte
Vptr VPtr // Vptr.Length==1 means tombstone; Length>1: VLOG pointer; Length==0: inline/nil
Hlc HLC // hybrid logical clock timestamp. LSN like per mini batch, but has big gaps.
}
KV is a key-value pair. Tombstones are marked by Vptr.Length == tombstoneVPtrLength == 1 A nil Value with Vptr.Length == 0 is a live key with nil value (just a key that is present but has no value; this is fine).
When Vptr.Length > tombstoneVPtrLength (== 1), the value is stored in the VLOG file and Vptr contains the location. Use kv.HasVPtr() to test this.
KV is currently 64 bytes, a cache line on most systems. Be very wary of making it any bigger, as this could really slow things down.
The Key is just a string. There is no loss of generality over []byte, just advantages: a) being immutable, we can avoid slow copies; and being smaller (2 words, not 3) than a []byte our KV now fits in one 64B cache line. Moreover users cannot corrupt the Key, so it is safe to return from our internal caches. The technical reason memory-mapped systems must use []byte keys is that they are read straight from a read-only memory map. Since we do not use a memory-mapped design we can get compile time safety, cache speed, and the benefits of immutability by making Key a string.
func (*KV) HasVPtr ¶
HasVPtr returns true if the value is stored in the VLOG file. Real VLOG entries always have Length > tombstoneVPtrLength.
type KVcloser ¶ added in v0.9.25
type KVcloser struct {
KV
// contains filtered or unexported fields
}
KVcloser is the result of a Find or GetKV call. The user must call Close() on the KVcloser when done copying any value out, or else memory and resource leaks will ensue.
func (*KVcloser) Close ¶ added in v0.9.25
func (s *KVcloser) Close()
Close must be called when done with the non-nil *KVcloser result of a GetKV or Find call. Otherwise memory and resource leaks will ensue. Close() is a no-op if called on a nil *KVcloser.
func (*KVcloser) Fetch ¶ added in v0.9.25
Fetch retrieves the large value from the VLOG if this KV has a VPtr (kvc.Large() == true). For inline values, Fetch is a no-op. After Fetch returns nil error, kvc.Value holds the bytes. Fetch is only needed when LAZY_LARGE was used in the Find call.
Much more detail: after Find() returns, kvc.Value is always populated for inline (small) values, regardless of whether LAZY_SMALL was used. Only SKIP_VALUES will result in kvc.Values always being nil (even for non-nil Values in the db; SKIP_VALUES means we do not retrieve them). What "lazy" means in each case:
LAZY_LARGE: Value is not fetched from VLOG. kvc.Value == nil, kvc.Large() == true. You must call Fetch() to get bytes.
LAZY_SMALL: Value is present in kvc.Value, but it's a zero-copy alias into cache memory instead of an owned copy. The "lazy" here means "lazy about copying", not "lazy about providing the value".
So after Find() with LAZY_SMALL:
- Inline values: kvc.Value points into cache memory (or a copy if it fell back). Ready to use. - Large values (no LAZY_LARGE): auto-fetched, kvc.Value populated. Ready to use. - Large values (with LAZY_LARGE): kvc.Value == nil, need Fetch().
Requiring Fetch() only applies to the VLOG/large-value case. It doesn't need a LAZY_SMALL path because the small value is already there, it is just borrowed rather than copied.
The only obligation LAZY_SMALL imposes is that you must call Close() to release the cache pin (so the entry can be evicted).
Currently without LAZY_SMALL, Close() is a no-op (the value is an owned copy); but we reserve the right to alter this, and so require that users properly use Close() if kvc != nil.
type LeafNode ¶
type LeafNode struct {
CommonNode `zid:"0"` //
Extents [FLEXTREE_LEAF_CAP]FlexTreeExtent `zid:"1"`
// leaf linked list
Prev NodeID `zid:"2"`
Next NodeID `zid:"3"`
}
LeafNode contains keys and values. Size is fixed.
func (*LeafNode) DecodeMsg ¶
DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.
func (*LeafNode) MarshalMsg ¶
MarshalMsg implements msgp.Marshaler
func (*LeafNode) Msgsize ¶
Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message
func (*LeafNode) UnmarshalMsg ¶
UnmarshalMsg implements msgp.Unmarshaler
func (*LeafNode) UnmarshalMsgWithCfg ¶
type Metrics ¶
type Metrics struct {
Session bool
LiveKeyCount int64
KV128BytesWritten int64 // FlexSpace FLEXSPACE.KV.SLOT_BLOCKS file
MemWALBytesWritten int64 // FlexDB WAL (FLEXDB.MEMWAL)
REDOLogBytesWritten int64 // FLEXSPACE.REDO.LOG
FlexTreePagesBytesWritten int64 // CoW FLEXTREE.PAGES + FLEXTREE.COMMIT
VLOGBytesWritten int64 // VLOG value log
LogicalBytesWritten int64 // user payload (key + value)
TotalBytesWritten int64 // sum of all physical writes
// WriteAmp returns the write amplification factor (total physical / logical).
// Returns 0 if no logical bytes have been written.
WriteAmp float64 // TotalBytesWritten / LogicalBytesWritten
CumulativeWriteAmp float64 // totalPhysicalBytesWrit / totalLogicalBytesWrit
// Garbage metrics computed from FlexSpace block usage tracking.
// TotalFreeBytesInBlocks is the sum of dead (unused) bytes across all
// non-empty blocks. A block with 1000 live bytes out of 4 MB has
// 4_193_304 garbage bytes. Completely empty blocks are not counted
// (they are already free for reuse).
TotalFreeBytesInBlocks int64
// BlocksInUse shows how many 4MB blocks FLEXSPACE.KV.SLOT_BLOCKS is using.
BlocksInUse int64
// BlocksWithLowUtilization is the count of non-empty blocks whose
// utilization (live bytes / block size) is below the configured
// LowBlockUtilizationPct threshold (default 25%).
BlocksWithLowUtilization int64
// KVBlocksTotalLiveBytes is the sum of live (used) bytes across all blocks,
// as tracked by the block manager.
KVBlocksTotalLiveBytes int64
// KVBlocksOnDiskFootprintBytes is the on-disk footprint of bytes for FLEXSPACE.KV.SLOT_BLOCKS.
KVBlocksOnDiskFootprintBytes int64
// VlogOnDiskFootprintBytes is the on-disk footprints for LARGE.VLOG.
VlogOnDiskFootprintBytes int64
// LowBlockUtilizationPct we used; copied from Config
// or what default we used if not set.
LowBlockUtilizationPct float64
// PiggybackGCRuns is the number of piggyback GC runs during this session.
PiggybackGCRuns int64
// PiggybackGCLastDurMs is the duration of the last piggyback GC run in milliseconds.
PiggybackGCLastDurMs int64
// contains filtered or unexported fields
}
Metrics holds byte-level write counters for computing write amplification.
type NodeID ¶
type NodeID int32
NodeID acts as our union pointer.
A FlexTree LeafNode will have a NodeID > 0. A FlexTree InternalNode will have a NodeID < 0.
A zero NodeID is reserved as IllegalID so we can terminate our linked list of NodeIDs properly. It is naturally neither positive nor negative, so neither Leaf nor Internal.
const IllegalID NodeID = 0
IllegalID is used instead of nil pointers to represent an uninitialized NodeID.
func NewInternalID ¶
func (*NodeID) DecodeMsg ¶
DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.
func (NodeID) IsInternal ¶
func (NodeID) MarshalMsg ¶
MarshalMsg implements msgp.Marshaler
func (NodeID) Msgsize ¶
Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message
func (*NodeID) UnmarshalMsg ¶
UnmarshalMsg implements msgp.Unmarshaler
func (*NodeID) UnmarshalMsgWithCfg ¶
type Offlen ¶
func (*Offlen) DecodeMsg ¶
DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.
func (*Offlen) MarshalMsg ¶
MarshalMsg implements msgp.Marshaler
func (*Offlen) Msgsize ¶
Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message
func (*Offlen) UnmarshalMsg ¶
UnmarshalMsg implements msgp.Unmarshaler
func (*Offlen) UnmarshalMsgWithCfg ¶
type PiggybackGCStats ¶ added in v0.6.2
type PiggybackGCStats struct {
LastGCTime time.Time
LastGCDuration time.Duration
TotalGCRuns int64
}
PiggybackGCStats tracks statistics for piggyback GC runs.
type Pos ¶
type Pos struct {
Loff uint64 `zid:"0"` // logical offset of the position
Idx uint32 `zid:"1"` // index of the entry
Diff uint32 `zid:"2"` // the current position within the entry
// contains filtered or unexported fields
}
low level position on leaf nodes
func (*Pos) DecodeMsg ¶
DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.
func (*Pos) ForwardExtent ¶
func (p *Pos) ForwardExtent()
ForwardExtent advances the position to the end of the current extent.
func (*Pos) GetTag ¶
GetTag returns the tag of the current extent. Returns (0, false) if not at the start of an extent.
func (*Pos) MarshalMsg ¶
MarshalMsg implements msgp.Marshaler
func (*Pos) Msgsize ¶
Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message
func (*Pos) Rewind ¶
func (p *Pos) Rewind()
Rewind moves the position to the start of the current leaf node. Bug fix: C code used i < fp->idx-1 which missed the last extent when idx==1.
func (*Pos) UnmarshalMsg ¶
UnmarshalMsg implements msgp.Unmarshaler
func (*Pos) UnmarshalMsgWithCfg ¶
type ReadOnlyDB ¶ added in v0.7.0
type ReadOnlyDB interface {
Get(key string) ([]byte, bool, error)
GetKV(key string) (kv *KVcloser, err error)
Find(smod SearchModifier, key string) (kvc *KVcloser, exact bool, err error)
FindIt(smod SearchModifier, key string) (kvc *KVcloser, exact bool, err error, it *Iter)
FetchLarge(kv *KV) ([]byte, error)
NewIter() *Iter
Ascend(pivot string, iter func(key string, value []byte) bool)
Descend(pivot string, iter func(key string, value []byte) bool)
AscendRange(greaterOrEqual, lessThan string, iter func(key string, value []byte) bool)
DescendRange(lessOrEqual, greaterThan string, iter func(key string, value []byte) bool)
Len() int64
LenBigSmall() (big, small int64)
}
ReadOnlyDB provides read-only access to the database within a View transaction. Methods must not be used after the callback returns. ReadOnlyDB is implemented by ReadOnlyTx.
To avoid indirect call overhead, this interface is not actually used in the View API. It is nice for documention purposes though; to get an overview of the available methods.
type ReadOnlyTx ¶ added in v0.9.0
type ReadOnlyTx struct {
// contains filtered or unexported fields
}
ReadOnlyTx provides read-only access to the database within a View transaction. Methods must not be used after the callback returns.
func (*ReadOnlyTx) Ascend ¶ added in v0.9.0
func (roTx *ReadOnlyTx) Ascend(pivot string, iter func(key string, value []byte) bool)
Ascend iterates keys >= pivot in ascending order until iter returns false. Use pivot="" to start from the first key.
func (*ReadOnlyTx) AscendRange ¶ added in v0.9.0
func (roTx *ReadOnlyTx) AscendRange(greaterOrEqual, lessThan string, iter func(key string, value []byte) bool)
AscendRange iterates keys in [greaterOrEqual, lessThan) in ascending order. Use "" for either bound to leave it open.
func (*ReadOnlyTx) Close ¶ added in v0.9.8
func (rtx *ReadOnlyTx) Close()
func (*ReadOnlyTx) Descend ¶ added in v0.9.0
func (roTx *ReadOnlyTx) Descend(pivot string, iter func(key string, value []byte) bool)
Descend iterates keys <= pivot in descending order until iter returns false. Use pivot="" to start from the last key.
func (*ReadOnlyTx) DescendRange ¶ added in v0.9.0
func (roTx *ReadOnlyTx) DescendRange(lessOrEqual, greaterThan string, iter func(key string, value []byte) bool)
DescendRange iterates keys in (greaterThan, lessOrEqual] in descending order. Use "" for either bound to leave it open.
func (*ReadOnlyTx) FetchLarge ¶ added in v0.9.0
func (roTx *ReadOnlyTx) FetchLarge(kv *KV) ([]byte, error)
FetchLarge retrieves the full value for a KV. For VLOG-stored values it reads from disk; for inline values it returns kv.Value directly.
func (*ReadOnlyTx) Find ¶ added in v0.9.0
func (roTx *ReadOnlyTx) Find(smod SearchModifier, key string) (kvc *KVcloser, exact bool, err error)
Find seeks to a key relative to the given key per smod (Exact, GTE, GT, LTE, LT) and returns a KVcloser. nil KVcloser means not found. exact is true when the returned key equals the query. Supports LAZY_SMALL, LAZY_LARGE, and LAZY flags.
func (*ReadOnlyTx) FindIt ¶ added in v0.9.0
func (roTx *ReadOnlyTx) FindIt(smod SearchModifier, key string) (kvc *KVcloser, exact bool, err error, it *Iter)
FindIt is like Find but also returns an iterator positioned at the result. The returned KVcloser is independent of the iterator - the user must call kvc.Close() when done with the initial result, and use the iterator for continued scanning. The iterator is auto-closed when the transaction ends.
func (*ReadOnlyTx) Get ¶ added in v0.9.0
func (roTx *ReadOnlyTx) Get(key string) (value []byte, found bool, err error)
Get retrieves the value for key. Returns (nil, false, nil) if not found or deleted. The returned []byte is a copy, safe to retain.
func (*ReadOnlyTx) GetKV ¶ added in v0.9.1
func (roTx *ReadOnlyTx) GetKV(key string) (kv *KVcloser, err error)
GetKV is equivalent to roTx.Find(Exact, key).
func (*ReadOnlyTx) Len ¶ added in v0.9.0
func (roTx *ReadOnlyTx) Len() int64
Len returns the total number of live keys in the database. See also LenBigSmall to get the count partitioned by the size class (small inline, large in the VLOG).
func (*ReadOnlyTx) LenBigSmall ¶ added in v0.9.0
func (roTx *ReadOnlyTx) LenBigSmall() (big, small int64)
LenBigSmall returns live key counts partitioned by size class. The returned count big counts the keys in VLOG, while small gives the number of keys with inline values. See also Len to get the total directly.
func (*ReadOnlyTx) NewIter ¶ added in v0.9.0
func (roTx *ReadOnlyTx) NewIter() *Iter
NewIter returns a new iterator over the database. It is automatically closed when the transaction ends. It is legal to Close it sooner if you want to release resources early.
type SearchModifier ¶ added in v0.6.2
type SearchModifier int
SearchModifier controls the matching behavior of Find.
const ( // Exact matches only; like a hash table. Exact SearchModifier = 0 // GTE finds the smallest key greater-than-or-equal to the query. GTE SearchModifier = 1 // LTE finds the largest key less-than-or-equal to the query. LTE SearchModifier = 2 // GT finds the smallest key strictly greater-than the query. GT SearchModifier = 3 // LT finds the largest key strictly less-than the query. LT SearchModifier = 4 // SKIP_VALUES returns KV.Values = nil; we make // no effort to retieve values, only keys. This is // useful for very fast full-table scans of just the keys, // when the user knows they will not inspect values // at all. In contrast, LAZY keeps open the option to // look at the values, but will pay some time in overhead. SKIP_VALUES SearchModifier = 16 // LAZY_SMALL requests zero-copy return of inline values. // The returned KVcloser.Value aliases interval cache memory. // The caller MUST call Close() to release the cache pin. // If the result came from a memtable (not yet flushed), // then we must do a copy; Value is copied as usual // to avoid returning stale/non-linearizable data // (best-effort zero-copy). // LAZY_SMALL can be | or-ed with Exact, GTE, GT, LTE, or LT. LAZY_SMALL SearchModifier = 32 // LAZY_LARGE means we do not fetch LARGE.VLOG values // automatically. The User must call FetchLarge() explicitly // when they are desired. // LAZY_LARGE can be | or-ed with Exact, GTE, GT, LTE, or LT. LAZY_LARGE SearchModifier = 64 // LAZY means do both LAZY_SMALL and LAZY_LARGE LAZY SearchModifier = 96 )
type VPtr ¶
type VPtr struct {
Offset uint64 // byte offset in VLOG file
Length uint64 // value length in bytes
}
VPtr is a pointer to a value stored in the VLOG file.
type VacuumKVStats ¶
type VacuumKVStats struct {
OldFileSize int64
NewFileSize int64
BytesReclaimed int64
PaddingReclaimed int64
ExtentsRewritten int64
}
VacuumKVStats reports the results of a VacuumKV operation.
func (*VacuumKVStats) String ¶
func (z *VacuumKVStats) String() (r string)
type VacuumVLOGStats ¶ added in v0.7.0
type VacuumVLOGStats struct {
OldVLOGSize int64
NewVLOGSize int64
BytesReclaimed int64
EntriesCopied int64
IntervalsRewritten int64
}
VacuumVLOGStats reports the results of a VacuumVLOG operation.
func (*VacuumVLOGStats) String ¶ added in v0.7.0
func (z *VacuumVLOGStats) String() (r string)
type WritableDB ¶ added in v0.7.0
type WritableDB interface {
ReadOnlyDB
Put(key string, value []byte) error
Delete(key string) error
DeleteRange(includeLarge bool, begKey, endKey string, begInclusive, endInclusive bool) (n int64, allGone bool, err error)
Clear(includeLarge bool) (allGone bool, err error)
Merge(key string, fn func(oldVal []byte, exists bool) (newVal []byte, write bool, doDelete bool)) error
Sync() error
}
WritableDB extends ReadOnlyDB with mutation methods. Passed to Update transaction callbacks. Writes are applied immediately to the database (no buffering). Since there is only ever a single writer at a time, and no concurrent readers, there is no point in waiting to apply each action, and "reading your own writes" is often desired/expected/required. WritableDB is implemented by WriteTx.
To avoid indirect call overhead, this interface is not actually used in the Update API. It is nice for documention purposes though; to get an overview of the available methods.
type WriteTx ¶ added in v0.9.0
type WriteTx struct {
// contains filtered or unexported fields
}
WriteTx extends ReadTx with mutation methods. Passed to Update transaction callbacks. Writes are applied immediately to the database (no buffering). Since there is only ever a single writer at a time, and no concurrent readers, there is no point in waiting to apply each action, and "reading your own writes" is often desired/expected/required.
func (*WriteTx) Ascend ¶ added in v0.9.0
Ascend iterates keys >= pivot in ascending order until iter returns false. Use pivot="" to start from the first key.
func (*WriteTx) AscendRange ¶ added in v0.9.0
func (tx *WriteTx) AscendRange(greaterOrEqual, lessThan string, iter func(key string, value []byte) bool)
AscendRange iterates keys in [greaterOrEqual, lessThan) in ascending order. Use "" for either bound to leave it open.
func (*WriteTx) Clear ¶ added in v0.9.0
Clear deletes all keys. When includeLarge is false, only inline-value keys are deleted; VLOG-stored keys survive. When allGone is true, the database was reinitialized and all previously obtained iterators and KV references are invalidated.
func (*WriteTx) DeleteRange ¶ added in v0.9.0
func (tx *WriteTx) DeleteRange(includeLarge bool, begKey, endKey string, begInclusive, endInclusive bool) (n int64, allGone bool, err error)
DeleteRange deletes keys in the range [begKey, endKey] with configurable inclusivity. When includeLarge is false, VLOG-stored keys are skipped. If allGone is true, the entire database was reinitialized and all previously obtained iterators and KV references are invalidated.
func (*WriteTx) Descend ¶ added in v0.9.0
Descend iterates keys <= pivot in descending order until iter returns false. Use pivot="" to start from the last key.
func (*WriteTx) DescendRange ¶ added in v0.9.0
func (tx *WriteTx) DescendRange(lessOrEqual, greaterThan string, iter func(key string, value []byte) bool)
DescendRange iterates keys in (greaterThan, lessOrEqual] in descending order. Use "" for either bound to leave it open.
func (*WriteTx) FetchLarge ¶ added in v0.9.0
FetchLarge retrieves the full value for a KV. For VLOG-stored values it reads from disk; for inline values it returns kv.Value directly.
func (*WriteTx) Find ¶ added in v0.9.0
Find seeks to a key relative to the given key per smod (Exact, GTE, GT, LTE, LT) and returns a KVcloser. Getting back a nil *KVcloser means not found. The returned bool 'exact' is true when the returned key equals the query. The smod supports bitwise OR-ing with the LAZY_SMALL, LAZY_LARGE, and LAZY flags to control value flow. e.g. LAZY allows near-zero-copy full-table key-only scans.
Warning: the user must call Close() on the kvc *KVcloser when done copying any value out, or else memory and resource leaks will ensue.
func (*WriteTx) FindIt ¶ added in v0.9.0
func (tx *WriteTx) FindIt(smod SearchModifier, key string) (kvc *KVcloser, exact bool, err error, it *Iter)
FindIt is like Find but also returns an iterator positioned at the result. The returned kvc *KVcloser is independent of the iterator - the user must call kvc.Close() when done with the initial result, and use the iterator for continued scanning. The iterator is auto-closed when the transaction ends.
func (*WriteTx) Get ¶ added in v0.9.0
Get retrieves the value for key. Returns (nil, false, nil) if not found or deleted. The returned []byte is a copy, safe to retain.
func (*WriteTx) LenBigSmall ¶ added in v0.9.0
LenBigSmall returns live key counts partitioned by storage (VLOG vs inline).
func (*WriteTx) Merge ¶ added in v0.9.0
func (tx *WriteTx) Merge(key string, fn func(oldVal []byte, exists bool) (newVal []byte, write bool, doDelete bool)) error
Merge performs an atomic read-modify-write on key. fn is always called: when the key exists, oldVal is its current value and exists=true; when the key is absent or deleted, oldVal=nil and exists=false (allowing conditional creation). Return write=false to skip the write, or doDelete=true to delete the key.
This mirrors C's flexdb_merge: it looks up the old value across all layers (active memtable, inactive memtable, FlexSpace), applies the user function, and writes the result atomically.
Q: When should the callback return doWrite=false, doDelete=false? A: This means "do nothing." The main scenarios:
Conditional creation: The callback inspects exists and decides not to create the key. E.g., "only increment if the key already exists" - if exists=false, return a bare return (all zeros = no-op).
Conditional update: The callback inspects the old value and decides no change is needed. E.g., "set to X only if current value isn't already X."
Read-only peek: The callback just wants to see the current value (though Get is simpler for that).
.
func (*WriteTx) NewIter ¶ added in v0.9.0
NewIter returns a new iterator over the database. It is automatically closed when the transaction ends. It is legal to Close it sooner if you want to release resources early.
func (*WriteTx) Put ¶ added in v0.9.0
Put writes key -> value. len(value) == 0 is fine, if desired. Call Delete instead of Put to delete a key and any associated value.
Values of any size are accepted. Values > vlogInlineThreshold (64 bytes) are stored in the VLOG file; smaller values are stored inline in the FLEXSPACE.KV.SLOT_BLOCKS file with the keys.
Large values are written exactly once: to the VLOG. The WAL stores only the VPtr (16 bytes), not the full value.
Puts are not durably on disk until after the user has also completed a db.Sync() call. This allows the user to control the rate of fsyncs and trade that against their durability requirements.
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
cmd
|
|
|
ydiff
command
|
|
|
yload
command
|
|
|
yogabench
command
cmd/yogabench/main.go — Go benchmark harness for YogaDB, equivalent to the C FlexDB benchmarks in study/eurosys22-artifact/flexdb_benchmark/.
|
cmd/yogabench/main.go — Go benchmark harness for YogaDB, equivalent to the C FlexDB benchmarks in study/eurosys22-artifact/flexdb_benchmark/. |
|
yvac
command
|
|
|
yview
command
|
|