yogadb

package module
v0.11.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 8, 2026 License: MIT Imports: 34 Imported by: 0

README

YogaDB: a flexible-address-space key-value store in Go (golang)

Bottom line up front: with minor modifications, and some sophisticated system design, B-trees can beat LSM-trees on both write throughput and latency. (And B-trees have always stomped LSM-trees on reads).

benchmarks up front:

Let's compare three embedded, persistent, ordered key-value stores in Go. Each of these can support more data than can fit in main memory, persists its data on disk to survive reboots, and each supports range queries and reading keys back in sorted order.

  • random writes
Pebble (github.com/cockroachdb/pebble): Log Structured Merge (LSM) tree
-------------------------
1.2 seconds to write 100K random 21 byte keys and values.

BoltDB (go.etcd.io/bbolt): LMDB based memory-mapped B-tree
-------------------------
9 seconds to write 100K random 21 byte keys and values.

YogaDB: FlexSpace architecture
-----------------
400 msec to write 100K random 21 byte keys and values.

Here YogaDB is 2x faster than Pebble and 20x faster than BoltDB at random writes.

  • read (iterate sequentially through all keys)
Pebble      114.9 iter_ns/key
Bolt         16.7 iter_ns/key
YogaDB        7.6 iter_ns/key

YogaDB is about 2x faster than BoltDB for full-table-scan read performance, and 15x faster than Pebble.

what is YogaDB?

YogaDB is a Go port of the https://github.com/flexible-address-space/flexspace FlexSpace C project. YogaDB is an embedded key-value engine-as-a-library, similar in form to BoltDB(bbolt), Sqlite3, RocksDB, and LevelDB. YogaDB offers sorted-order key-range iteration methods Ascend/Descend alongside simple Put, Get, and Delete. Write batching and transactions are available. Large values are stored in a separate VLOG. Small values are stored inline with their keys.

Originally called FlexSpaceKV, we changed the name to YogaDB because (a) yoga keeps you flexible, and (b) there were too many other things with "flex" in the name[4]. Below you may see both terms; YogaDB in Go is the port of FlexDB in C. In the architecture, in both cases, the name FlexDB refers to the top layer. FlexDB calls down into the FlexSpace, which in turn calls down to the FlexTree.

Personally, I find it charming that a "back to basics" design is competitive with the latest "fractured", "splintered"[2], or "turtled" write-optimized research[3] that the $\beta^{\epsilon}$-tree folks have been pursuing. These could be worth reviewing, but to my thinking are overly baroque (and difficult to build; their C/C++ dependencies are a nightmare).

image

Figure 2: the fundamentally new data structure is the FlexTree.

The FlexSpace architecture is heavily inspired by Log-structured File Systems (LFS), and how filesystems in general handle extents (contiguous spans of disk space). In addition, FlexSpace introduces a new data structure called the FlexTree; a B-tree variant that has asymptotically better big-O performance for Insert-Range extent operations. The third idea is to separate logical from physical address space, and to sort in logical space to minimize physical data movement.

  • Why YogaDB: faster reads and writes than LSM trees. About 2x faster writes.

Part of the speed is due to deferring compaction and on-disk garbage collection (often called vacuuming, cf PostgreSQL). So YogaDB becomes a space-time tradeoff. You can (temporarily) use more disk space to go faster. Then call VacuumKV() and VacuumVLOG() only when you are ready.

Quick benchmark loading the lines in assets/ as both key and value:

I. Pebble (LSM-tree based engine under CockroachDB; in Go): 6.45 seconds.

jaten@jbook ~/yogadb/load_pebble (master) $ make
rm -rf bench_pebble.db
rm -f load_pebble
go build
./load_pebble ~/all
2026/03/05 20:56:02 Found 0 WALs

pebble.go:75 [goID 1] 2026-03-05 20:56:02.667426000 +0000 UTC note: duplicating key as value too to see VLOG fill. This will double our logical write!
Starting Batched Pebble ingestion to dbPath: bench_pebble.db ...
Starting Batched Pebble ingestion...
Pebble ingestion finished in: 6.447456109s
--------------------------------------------------
Total Bytes Ingested (Logical):  24129150 bytes
Total Bytes Flushed (Physical):  34065912 bytes
Total Bytes Compacted:           21021586 bytes
Total Write Amplification (WA):  2.28
du -sh bench_pebble.db
 14M	bench_pebble.db
jaten@jbook ~/yogadb/load_pebble (master) $ 

II. YogaDB (herein; the Flexspace design in Go): 3.57 seconds.

jaten@jbook ~/yogadb/load_yogadb (master) $ make
rm -rf bench.db
rm -f load_yogadb
go build
./load_yogadb ~/all
using REDO.LOG: true

yoga.go:120 [goID 1] 2026-03-05 20:55:44.415275000 +0000 UTC note: duplicating key as value too to see VLOG fill. This will double our logical write!
Starting Batched YogaDB ingestion to dbPath: bench.db ...
YogaDB ingestion finished in: 3.565796103s
--------------------------------------------------
finMetrics = Metrics{
      (just this) Session: true
        KV128 BytesWritten: 92434432
       MemWAL BytesWritten: 27422057
      REDOLog BytesWritten: 1354032
FlexTreePages BytesWritten: 823472
   LARGE.VLOG BytesWritten: 2910390
      Logical BytesWritten: 22816132
        Total BytesWritten: 124944383
                 WriteAmp: 5.476

   -------- lifetime totals over all sessions  --------  
    TotalLogical BytesWrit: 22816132
   TotalPhysical BytesWrit: 124944383
       CumulativeWriteAmp: 5.476

   -------- free space / block utilization --------  
            TotalLiveBytes: 46217216 (44.08 MB)
    TotalFreeBytesInBlocks: 4114432 (3.92 MB)
      FLEXSPACE_BLOCK_SIZE: 4.00 MB
               BlocksInUse: 12  (48.00 MB)
  BlocksWithLowUtilization: 1

   -------- based on parameters used --------  
    LowBlockUtilizationPct: 50.0 %
}

du -sh bench.db
 49M	bench.db
 
III. Run the batch_bench_test.go random write benchmarks
(versus the assets/ which are kind of sorted already)

YogaDB   no redo log: 3427 ns/random write
YogaDB with redo log: 2955 ns/random write
Pebble:               6011 ns/random write

These benchmarks write in batches, and defer the one fsync
until the end of all writes for maximum performance.

YogaDB is 2x faster if the Redo log is used, a tiny bit
slower without the redo log (omitting the redo log
has the advantage of reducing the write amplification
a little; again it is a space-time trade-off).

go test -v -run=xxx -bench BigRandomRWBatch
goos: darwin
goarch: amd64
pkg: github.com/glycerine/yogadb
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz

BenchmarkYogaDB_BigRandomRWBatch
BenchmarkYogaDB_BigRandomRWBatch/batch_1000/redo=0/write=1_
BenchmarkYogaDB_BigRandomRWBatch/batch_1000/redo=0/write=1_-8         	  313790	      3427 ns/op	      1000 batch_size	  41277492 disk_bytes	  17572240 logical_bytes	      3427 our_PUT_ns/op	         2.349 write_amp	 103005932 yoga_total_physical_bytes_written	         5.862 yoga_write_amp

BenchmarkYogaDB_BigRandomRWBatch/batch_1000/redo=1/write=1_
BenchmarkYogaDB_BigRandomRWBatch/batch_1000/redo=1/write=1_-8         	  369402	      2955 ns/op	      1000 batch_size	  48995128 disk_bytes	  20686512 logical_bytes	      2955 our_PUT_ns/op	         2.368 write_amp	 121808896 yoga_total_physical_bytes_written	         5.888 yoga_write_amp

BenchmarkPebble_BigRandomRWBatch
BenchmarkPebble_BigRandomRWBatch/batch_1000_write=1
BenchmarkPebble_BigRandomRWBatch/batch_1000_write=1-8                 	  168066	      6011 ns/op	      1000 batch_size	  14579319 disk_bytes	   9411696 logical_bytes	      6011 our_PUT_ns/op	         1.926 pebble_write_amp	         1.549 write_amp

finished at Thu Mar  5 17:43:01
  • How: Data is write-ahead-log appended Bitcask/Haystack style to disk once, and then (almost) never re-written until manual vacuuming is invoked for on-disk GC. Keys are sorted in memory in a FlexSpace, which is optimized for logical sorting. Persistence of the sorted structure is done with Copy-on-write. YogaDB's FlexSpace can handle more keys than can fit in RAM.

  • The result: the C version gets 2x the throughput of RocksDB (on both writes and reads). See pages 10-14 of the paper. Also, much better latency.

  • Bottom line: Your SSD and NVMe drives are your friends. They want to help you write quickly while not sacrificing read performance. Let them! (This design is somewhat optimized for them rather than traditional spinning platters).

image

A note about Table 2: the Write-Amplification circa 1.03 only applies to the middle layer, FlexSpace, alone. Write Amplification is much greater (3-6x, not atypical for comparable systems) once the entire system is assembled.

[2] https://arxiv.org/abs/2510.05518 https://splinterdb.org/

[3] https://arxiv.org/abs/2509.10714

[4] The FlexSpace paper should not be confused with "Flex-KV: Enabling High-performance and Flexible KV Systems" 2012 by Phanishayee et al. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/10/flexkv_mbds12.pdf which is from a completely different research group.

more thorough (YCSB) benchmarks (these are ports in progress)

external; from PingCap: https://github.com/glycerine/go-ycsb/tree/yogadb

internal: https://github.com/glycerine/yogadb/tree/master/cmd/yogabench

Reference C code repository

[5] https://github.com/flexible-address-space/flexspace

theory of operation (the paper)

[6] "Building an Efficient Key-Value Store in a Flexible Address Space", by C. Chen, W. Zhong, X. Wu, EuroSys ’22, April 5–8, 2022, RENNES, France.

[ACM DL] [Paper PDF] [Slides]

design summary

In his Recap slide #56, Dr. Chen says,

"FlexDB manages sorted data without using extra persistent indirections."

The approach is to map keys to physical locations on disk without maintaining a massive, constantly updating on-disk index.

The strategy is to split the problem into two in-memory structures.

The Logical Illusion (FlexDB): FlexDB uses the insert-range API to shift logical addresses. The keys then appear perfectly sorted in a contiguous logical space. FlexDB only needs a tiny, lightweight, sparse index (e.g., "Keys A through M are in Logical Block 1"), which easily fits entirely in RAM.

The Extent Map (FlexSpace/FlexTree): The FlexTree maps those logical blocks to physical SSD blocks. Because it tracks large extents (megabytes of data at a time) rather than individual keys, it is also highly compressed and also fits entirely in RAM.

comment on the API / goroutine safety

  • API notes: all Go API calls are goroutine safe. At most one writer at a time is enforced with a top-level sync.RWMutex.

getting started

go get github.com/glycerine/yogadb

import "github.com/glycerine/yogadb"

The small example programs that have a single aim are probably the easiest way to understand how to embed yogadb. See cmd/yload to load a bunch of key/value pairs. See cmd/yview to dump them back out. See cmd/yvac to vacuum your database. load_yogadb does all of the above and was used for running some benchmarks; it is another example.

api documentation

https://pkg.go.dev/github.com/glycerine/yogadb

VLOG data integrity with Blake3

  1. vlog.read() now verifies Blake3 cryptographic hash of the value bytes on every VLOG read, in addition to the existing hdrCRC and valCRC checks. Any VLOG corruption (bit flip, truncation, wrong entry) will be caught.

  2. db.CheckIntegrity() has a new Check 8 which walks every decoded KV in every anchor interval; for each KV with a VPtr, calls vlog.read() which verifies hdrCRC, valCRC, and blake3. It reports the key name and VPtr location on failure.

  3. TestFlexDB_CheckIntegrity_VLOGBlake3 is a dedicated test that writes 50 keys with large values (VLOG-bound), overwrites half with same values (exercises dedup path) and half with different values, then verifies integrity before and after close/reopen.

grab bag of implementation notes

in no particular order -- and mostly for my reference. Users/casual readers can ignore the rest of this file(!)

Current Architecture

FlexDB has three layers:
1. FlexTree (flextree.go, pages.go) - B-tree extent index with CoW page persistence
2. FlexSpace (flexspace.go) - Log-structured file layer, stores data in 4MB blocks
3. FlexDB (db.go) - Full KV store with memtable, sparse index, interval cache

Current read path (FlexDB.Get):

1. Check memtables (in-memory btree) - no disk I/O
2. Check FlexSpace via sparse index -> getPassthrough() -> loadInterval():
  - Allocates itvbuf := make([]byte, anchor.psize)
  - Calls db.ff.ReadFragmentation(itvbuf, ...) which does fd.ReadAt() (pread)
  - Calls kv128Decode(src) which allocates new slices for key and value (make([]byte, klen), copy)
  - Cache entries hold the decoded KVs (heap-allocated copies)
  - Get() then copies the value again before returning to caller

Current write path:

- FlexSpace.fd is opened O_RDWR
- Block manager has a 4MB in-memory write buffer
- Writes go to buffer first, then fd.WriteAt() when buffer is full or on sync
- Reads check buffer first, fall back to fd.ReadAt() for committed data

File layout:

FlexDB dir/
├── FLEXSPACE.KV.SLOT_BLOCKS <- All KV data (random pread/pwrite, grows)
├── FLEXSPACE.REDO.LOG     <- Redo log (append-only)
├── FLEXTREE.COMMIT        <- CoW commit records (append-only, 64KB)
├── FLEXTREE.PAGES         <- CoW pages (random read/write, 1024-byte pages)
├── FLEXDB.MEMWAL1         <- WAL for memtable 0
├── FLEXDB.MEMWAL2         <- WAL for memtable 1
└── LARGE.VLOG             <- values > 64 bytes are stored separately here.

notes:

- kv128Decode already copies keys and values into new allocations - NOT zero-copy
- FlexTree nodes are decoded from 1024-byte pages into Go structs - NOT zero-copy
- Get() copies the value before returning - caller always gets an independent copy
- The block manager buffer handles recent writes; pread handles committed data

VLOG: the Value Log (big values (those > 64 bytes) are kept here).

VLOG Properties

Not all values go to VLOG. Only values > vlogInlineThreshold (64 bytes) 
are stored in VLOG. Smaller values are stored inline in 
the kv128-encoded entry within FlexSpace. The split is:

┌──────────────┬─────────────────────────────────┬───────────────────────────────────────────────┐
│  Value size  │            Stored in            │                 WAL contains                  │
├──────────────┼─────────────────────────────────┼───────────────────────────────────────────────┤
│nil(tombstone)│ nothing                         │ kv128 tombstone marker                        │
├──────────────┼─────────────────────────────────┼───────────────────────────────────────────────┤
│<= 64 bytes   │ inline in FlexSpace kv128 entry │ full inline kv128 entry                       │
├──────────────┼─────────────────────────────────┼───────────────────────────────────────────────┤
│> 64 bytes    │ VLOG file                       │ kv128 entry with 12-byte VPtr (offset+length) │
└──────────────┴─────────────────────────────────┴───────────────────────────────────────────────┘

VLOG can be disabled entirely with Config{DisableVLOG: true}, in 
which case all values are inline regardless of size.

VLOG entry format (post-HLC) has 5 fields (HLC = hybrid logical/physical clock; BE = big endian)

field 1: [header CRC: 4-byte CRC32C]  bytes [0:4)
field 2: [8-byte HLC BE]              bytes [4:12)
field 3: [8-byte length: N]           bytes [12:20)
field 4: [value CRC: 4-byte CRC32C]   bytes [20:24)
field 5: [N-byte value]               bytes [24:Z) where Z=24+N

legend:
[header CRC: 4-byte CRC32C] = covering bytes [4:24) also known as fields 2, 3, and 4: 
from the HLC through the value CRC inclusive.

[ value CRC: 4-byte CRC32C] = covering only the N-byte value that follows it.

The length field N stores only the value length (not including HLC). 
So the total on-disk entry size is 4 + 8 + 8 + 4 + len(value) == 24 + len(value).

VLOG's role in crash recovery

The VLOG plays no active role in the recovery procedure. 
Recovery (db.go:2282) does two things:

1. Rebuilds the sparse index by scanning FlexSpace tags
2. Replays WAL logs - decoding kv128 entries and 
applying them via putPassthrough/deletePassthrough

The WAL entries for large values contain VPtrs (not the actual values). 
The actual values already reside durably in the VLOG file because 
the VLOG is always fsynced before the WAL entry referencing it is
written:

- Put(): calls vlog.appendAndSync() (fsync), then writes VPtr to WAL
- Batch.Commit(): calls vlog.appendBatchAndSync() (single fsync 
for all large values), then writes VPtr kv128 entries to WAL

So after a crash, the WAL replay reconstructs VPtrs in FlexSpace 
that point to already-durable VLOG entries. The VLOG file itself 
is never scanned or replayed - it's just a passive store that
resolveVPtr() reads from at query time.

About VLOG garbage collection:

Dead VLOG entries (from overwrites/deletes) accumulate until 
VacuumVLOG() is called explicitly. 
VacuumVLOG() walks all live intervals, copies live VPtr values
to a new VLOG file (preserving HLCs), rewrites VPtrs
in FlexSpace, then atomically renames new -> old.

Transactions (Tx)

YogaDB offers linearizable data access. A sync.RWMutex allows one writer at a time, and multiple readers can read simultaneously (when there is no writer).

Read-only (View) and read-write (Update) transactions are available.


  In db.go:
   - topMutRW sync.RWMutex in FlexDB for serializing write transactions
   - Tx struct with atomic state, COW btree snapshots, write buffer
   - db.Update(fn) - serializable read-write transaction (single-writer via txMu)
   - db.View(fn) - concurrent read-only transaction
   - tx.Get() - reads write buffer > memtable snapshots > FlexSpace
   - tx.Put() / tx.Delete() - buffers writes locally (Update only)
   - tx.Commit() / tx.Cancel() - first-wins semantics
          
 Tests of Tx: flexdb_tx_test.go (11 tests):
   - Double Commit, Double Cancel (both idempotent)
   - Cancel->Commit (returns ErrTxDone), Commit->Cancel (Cancel is no-op)
   - Update basic, View basic, Update sees own writes
   - Cancel discards writes, Auto-cancel on return without Commit/Cancel
   - View rejects Put/Delete with ErrTxNotWritable
   - Serialized concurrent Updates (10 goroutines increment a counter to exactly 10)

Q and A

Q1: What is the system architecture? I see mention of FlexTree, FlexSpace, and FlexDB. How are they related?

A1:

System Architecture (Three Layers)

Note that these three layers are the same in the Go version and the C version. The C/paper calls the top layer FlexDB. For ease of comprehension and better continuity, we keep the name FlexDB for the top level Go data struct in db.go. To make the Go versus C distinction clear, we gave the Go project as a whole its own distinct name (YogaDB). The architecture is fundamentally the same, although the Go version has added features above and beyond the C. The slotted page design supporting updates in place; the hybrid logical clock timestamps to give each pair a last written timestamp; the iterator design; and the transaction design: these are original and specific to the Go project.


FlexDB (kv store)               <- db.go
    └── FlexSpace (file layer)  <- flexspace.go
            └── FlexTree (B-tree extent index)  <- flextree.go
                    └── CoW persistence         <- pages.go

(bottom) Layer 1: FlexTree (flextree.h/.c -> flextree.go)

A B-tree that maps logical offsets (loff) to physical offsets (poff). Each entry is an Extent. Gentle reader, this is largely a metaphor, inspired by the filesystem extent idea. It is not actually referring to actual filesystem extents. Sublimely, however, the physical offset, the poff, is actually pointing to an actual file/offset on disk. Insertions shift all subsequent logical offsets, so the tree supports true insert-range semantics (not just overwrite).

Key insight: Each internal node child carries a shift accumulator. When data is inserted at a logical offset, only the path from root to that leaf is updated; other subtrees lazily absorb the shift when traversed. This gives ~O(log n) insert cost and write amplification near 1.

(middle) Layer 2: FlexSpace (flexfile.h/.c -> flexspace.go)

A log-structured file abstraction on top of FlexTree providing:

  • Log-structured writes with 8 MB in-memory buffer, 2 GB max on-disk log
  • 4 MB blocks with a block manager
  • Garbage collection (GC threshold: 64 blocks)
  • Crash recovery by replaying the logical log
  • Operations: Read, Write, Insert, Collapse (delete range), Sync
  • Handler API (FlexSpaceHandler) for stateful cursors with sequential reads

C vs Go I/O: The C version uses io_uring for async I/O. The Go port uses synchronous pread/pwrite via os.File.ReadAt/WriteAt. This is a known difference; Go's goroutine scheduler provides concurrency at a different level.

(top) Layer 3: FlexDB (flexdb.h/.c -> db.go) a.k.a YogaDB

A full persistent ordered key-value store on top of FlexSpace providing:

  • MemTable: tidwall/btree B-tree (replaces C's Wormhole skip list); 1 GB cap, dual-memtable for concurrent flush
  • Sparse KV Index: Separate B-tree of anchors with configurable interval (default 32 KV pairs per anchor)
  • Interval Cache: 1024-partition LRU cache of decoded KV intervals from FlexSpace
  • WAL: Write-ahead log for crash recovery with CRC32 checksums
  • Flush Worker: Background goroutine that flushes MemTable to FlexSpace + updates sparse index
  • Operations: Put, Get, Del, Scan

Q2: We use the very fast in-memory B-tree https://github.com/tidwall/btree as the top level memtable for the top YogaDB layer. This is nice since it has Copy-on-Write snapshots. But why are there two memtables?

A2: The dual-memtable design enables non-blocking writes during flush.

With a single memtable, when it's time to flush to FlexSpace, there are two (bad) options:

a) Block all writes while the flush completes (the memtable is being read by the flush worker and can't accept new writes simultaneously)

b) Copy the entire memtable before flushing. This is expensive for a 1 GB memtable

With two memtables, the flush is just a pointer swap.

The active memtable (say mt[0]) fills up or a timer fires

doFlush() atomically swaps: mt[1] becomes the new active (accepting writes immediately), mt[0] becomes inactive.

The flush worker drains mt[0] to FlexSpace in the background. Writes to mt[1] proceed concurrently with zero blocking. When the flush finishes, mt[0] is cleared and ready for the next swap. The mtMu.Lock() is held only for the swap itself (a pointer swap - nanoseconds), not for the entire flush duration. This is the same pattern used by LSM-tree stores like LevelDB/RocksDB (they call it the "immutable memtable").

write-amplification measurement

We note when bytes are written to disk along the write path, to accurately measure write amplification.

A. planning:

File: flexspace.go
  --------------------------------------------------------

  | Line | Operation | File Desc | Size | Type | Context |
  |------|-----------|-----------|------|------|---------|
  | 198  | WriteAt   | fd (KV.SLOT_BLOCKS) | VAR  | KV.SLOT_BLOCKS | nextBlock() flush 4MB block |
  | 207  | Sync      | fd (KV.SLOT_BLOCKS) | -    | SYNC | blockMgr.flush() |
  | 313  | WriteAt   | logFD     | VAR  | KV.SLOT_BLOCKS | logSync() append log buffer |
  | 317  | Sync      | logFD     | -    | SYNC | logSync() fsync log |
  | 322  | Truncate  | logFD     | 0    | TRUNC| logTruncate() reset |
  | 329  | WriteAt   | logFD     | 8B   | FLEXTREE.COMMIT | writeLogVersion() header |

  Key Constants (flexspace.go):
  - FLEXSPACE_BLOCK_SIZE = 4 MB (1 << 22)
  - FLEXSPACE_LOG_MEM_CAP = 8 MB
  - FLEXSPACE_LOG_MAX_SIZE = 2 GB
  - flexLogVersionSize = 8 bytes

  File: db.go
  --------------------------------------------------------

  | Line | Operation | File Desc | Size | Type | Context |
  |------|-----------|-----------|------|------|---------|
  | 1095 | Write     | logFD     | VAR  | KV.SLOT_BLOCKS | logFlushLocked() WAL append |
  | 1108 | Truncate  | logFD     | 0    | TRUNC| logTruncate() reset WAL |
  | 1111 | WriteAt   | logFD     | 8B   | FLEXTREE.COMMIT | logTruncate() timestamp |

  Key Constants (db.go):
  - flexdbMemtableLogBufCap = 4 MB
  - flexdbSparseInterval = 32 KVs
  - KV128 encoded: varint(klen) + varint(vlen) + key + value

  File: saver.go
  -------------------------------------------

  | Line | Operation | File Desc | Size | Type | Context |
  |------|-----------|-----------|------|------|---------|
  | 95   | Sync      | f (TREE)  | -    | SYNC | saveFlexTree() fsync checkpoint |

  Note: Actual data writes are in msgp.Writer.Write/Flush
  - Full FlexTree serialization (greenpack msgpack format)
  - Variable size, potentially 100s of MB for large trees

  File: fdatasync_linux.go
  -------------------------------------------

  | Line | Function | Syscall | Notes |
  |------|----------|---------|-------|
  | 21   | fdatasync| Fdatasync| Linux-specific, data-only flush |
  | 30   | fallocate| Fallocate| Pre-allocate blocks |

  | Line | Function | Syscall | Notes |
  |------|----------|---------|-------|
  | 21   | fdatasync| Fdatasync| Linux-specific, data-only flush |
  | 30   | fallocate| Fallocate| Pre-allocate blocks |

  File: fdatasync_other.go
  -------------------------------------------

  | Line | Function | Fallback | Notes |
  |------|----------|----------|-------|
  | 18   | fdatasync| f.Sync() | macOS uses F_FULLFSYNC |

  ================================================================
  WRITE CALL GRAPH
  ================================================================

  User Operation: Put(key, value)
    └─> FlexDB.Put()
        └─> dbMemtable.put()
            └─> [memtable btree operation - no disk write]
        └─> dbMemtable.logAppend() [OPTIONAL: only if logBufSize full]
            └─> db.go:1095 Write(logFD) [WAL WRITE 1]
                └─> [appends to memtable WAL: MEMWAL1 or MEMWAL2]

  Periodic: Flush worker detects memtable full
    └─> FlexDB.flush()
        └─> FlexSpace.Write(flattened_memtable_kvs)
            └─> flexspace.go:719 blockManager.write()
                └─> flexspace.go:198 WriteAt(fd, KV.SLOT_BLOCKS) [KV.SLOT_BLOCKS WRITE 2]
            └─> flexspace.go:721 logWrite(flexOpTreeInsert)
                └─> [appends 16-byte entry to logBuf]
        └─> FlexSpace.Sync() [if logFull]
            └─> flexspace.go:313 WriteAt(logFD, REDO.LOG) [REDO.LOG WRITE 3]
            └─> flexspace.go:207 Sync(fd, KV.SLOT_BLOCKS) [SYNC 1]
            └─> flexspace.go:317 Sync(logFD) [SYNC 2]

  Checkpoint: FlexSpace.Sync() when log exceeds 2 GB
    └─> saveFlexTree(tree, FLEXTREE)
        └─> msgp.Writer.Write() [TREE WRITE 4]
        └─> saver.go:95 Sync(f) [SYNC 3]
        OR
    └─> FlexTree.SyncCoW()
        └─> pages.go:373 WriteAt(nodeFD, FLEXTREE.PAGES) [NODE WRITE 4]
        └─> pages.go:411 WriteAt(nodeFD, FLEXTREE.PAGES) [NODE WRITE 4]
        └─> pages.go:306 fdatasync(nodeFD) [SYNC 3]
        └─> pages.go:332 WriteAt(metaFD, FLEXTREE.COMMIT) [FLEXTREE.COMMIT WRITE 5]
        └─> pages.go:335 fdatasync(metaFD) [SYNC 4]

  Growth events:
    └─> pages.go:274 fallocateFile(nodeFD) [PRE-ALLOC]
    └─> pages.go:278 Sync(nodeFD) [GROWTH SYNC]

  ================================================================
  WRITE AMPLIFICATION ANALYSIS
  ================================================================

  Example: Insert 1 KB of KV data

  Path 1: Memtable -> FlexSpace -> Greenpack checkpoint
  -------------------------------------------------
  1. Memtable WAL (MEMWAL1):
     - KV128 encode: ~20 bytes (varint headers) + 1000 bytes = ~1020 bytes
     - Buffered; flushed when buffer >= 4 MB
     - WRITE COUNT: ~1020 bytes (amortized across batch)

  2. FlexSpace -> KV.SLOT_BLOCKS block:
     - Data write: 1024 bytes (to 4 MB block)
     - LogOp entry: 16 bytes
     - WRITE COUNT: 1024 bytes data + 16 bytes log

  3. FlexSpace REDO.LOG:
     - Flush when >= 8 MB or explicit Sync
     - Contains 16-byte entries per operation
     - WRITE COUNT: amortized 16 bytes per operation

  4. FlexTree Checkpoint (greenpack):
     - Full tree serialization when log exceeds 2 GB
     - Variable size; compressed msgpack
     - Estimate: ~30-50 bytes per leaf extent node state
     - WRITE COUNT: depends on tree size (sparse data structure)

  Total for one 1 KB operation:
  - WAL: 1020 bytes
  - KV.SLOT_BLOCKS: 1024 bytes
  - REDO.LOG: 16 bytes
  - Checkpoint: amortized 30-50 bytes (infrequent)
  - ≈ 2060+ bytes per 1000 bytes data = 2x+ amplification minimum

  Path 2: Memtable -> FlexSpace -> CoW checkpoint
  -------------------------------------------------
  Same as Path 1, but checkpoint is:

  4. FlexTree Checkpoint (CoW):
     - Write dirty nodes only: 1024 bytes per node
     - Memtable write affects leaf node (holds extents)
     - FLEXTREE.COMMIT record: 64 bytes per checkpoint
     - WRITE COUNT: 1024 bytes (node) + 64 bytes (FLEXTREE.COMMIT)

  Total for one 1 KB operation:
  - WAL: 1020 bytes
  - KV.SLOT_BLOCKS: 1024 bytes
  - REDO.LOG: 16 bytes
  - CoW: 1024 bytes (node) + 64 bytes (FLEXTREE.COMMIT) if dirty
  - ≈ 3148+ bytes per 1000 bytes data = 3x+ amplification

  ================================================================
  INSTRUMENTATION OPPORTUNITIES
  ================================================================

  Current Write Sites:

  1. pages.go:373 - Leaf page write
     Counter: bytes += 1024 (or len(pageBuf[:]))

  2. pages.go:411 - Internal page write
     Counter: bytes += 1024

  3. flexspace.go:198 - KV.SLOT_BLOCKS block write
     Counter: bytes += bm.blkoff

  4. flexspace.go:313 - REDO.LOG buffer write
     Counter: bytes += ff.logBufSize

  5. db.go:1095 - Memtable WAL write
     Counter: bytes += len(m.logBuf[:m.logBufSize])

  6. saver.go - Greenpack write (via msgp.Writer)
     Counter: Wrap msgp.Writer to count bytes

  7. pages.go:332 - FLEXTREE.COMMIT record write
     Counter: bytes += 64

  Total formula:
  WA = (CoW_pages + REDO.LOG_buffer + KV.SLOT_BLOCKS + WAL_buffer) / original_kv_size

  ================================================================

summary document with the complete mapping:

Summary

I have completed a thorough inventory of every disk write operation in the yogadb codebase. Here are the findings:

Complete Write Operations Inventory

Total Unique Write Sites: 20 operations across 6 files

Layer 1: FlexTree CoW Persistence (pages.go)

┌──────┬───────────┬───────────────┬─────────────┬───────────┬─────────────────────┐
│ Line │ Operation │       FD      │    Size     │   Type    │       Context       │
├──────┼───────────┼───────────────┼─────────────┼───────────┼─────────────────────┤
│ 373  │ WriteAt   │FLEXTREE.PAGES │ 1024B       │ Data      │ Leaf page write     │
├──────┼───────────┼───────────────┼─────────────┼───────────┼─────────────────────┤
│ 411  │ WriteAt   │FLEXTREE.PAGES │ 1024B       │ Data      │ Internal page write │
├──────┼───────────┼───────────────┼─────────────┼───────────┼─────────────────────┤
│ 332  │ WriteAt   │FLEXTREE.COMMIT│ 64B         │ Meta      │ Commit record       │
├──────┼───────────┼───────────────┼─────────────┼───────────┼─────────────────────┤
│ 306  │ fdatasync │FLEXTREE.PAGES │ -           │ Sync      │ Data flush          │
├──────┼───────────┼───────────────┼─────────────┼───────────┼─────────────────────┤
│ 335  │ fdatasync │FLEXTREE.COMMIT│ -           │ Sync      │ Data flush          │
├──────┼───────────┼───────────────┼─────────────┼───────────┼─────────────────────┤
│ 274  │ fallocate │FLEXTREE.PAGES │ 2x previous │ Pre-alloc │ Doubling growth     │
├──────┼───────────┼───────────────┼─────────────┼───────────┼─────────────────────┤
│ 278  │ Sync      │FLEXTREE.PAGES │ -           │ Sync      │ Growth fsync        │
└──────┴───────────┴───────────────┴─────────────┴───────────┴─────────────────────┘

Layer 2: FlexSpace Log-Structured (flexspace.go)

┌──────┬───────────┬─────────────┬──────────┬───────┬───────────────────┐
│ Line │ Operation │     FD      │   Size   │ Type  │      Context      │
├──────┼───────────┼─────────────┼──────────┼───────┼───────────────────┤
│ 198  │ WriteAt   │ fd (KV.SLOT_BLOCKS)   │ variable │ Data  │ 4MB block flush   │
├──────┼───────────┼─────────────┼──────────┼───────┼───────────────────┤
│ 313  │ WriteAt   │ logFD (REDO.LOG) │ variable │ Data  │ Log buffer append │
├──────┼───────────┼─────────────┼──────────┼───────┼───────────────────┤
│ 329  │ WriteAt   │ logFD (REDO.LOG) │ 8B       │ Meta  │ Version header    │
├──────┼───────────┼─────────────┼──────────┼───────┼───────────────────┤
│ 207  │ Sync      │ fd (KV.SLOT_BLOCKS)   │ -        │ Sync  │ Block flush sync  │
├──────┼───────────┼─────────────┼──────────┼───────┼───────────────────┤
│ 317  │ Sync      │ logFD (REDO.LOG) │ -        │ Sync  │ Log flush sync    │
├──────┼───────────┼─────────────┼──────────┼───────┼───────────────────┤
│ 322  │ Truncate  │ logFD (REDO.LOG) │ 0        │ Reset │ Log reset         │
└──────┴───────────┴─────────────┴──────────┴───────┴───────────────────┘

Layer 3: FlexDB Memtable (db.go)

┌──────┬───────────┬────────────────┬──────────┬───────┬──────────────────┐
│ Line │ Operation │       FD       │   Size   │ Type  │     Context      │
├──────┼───────────┼────────────────┼──────────┼───────┼──────────────────┤
│ 1095 │ Write     │ logFD (MEMWAL1/2) │ variable │ Data  │ WAL append       │
├──────┼───────────┼────────────────┼──────────┼───────┼──────────────────┤
│ 1111 │ WriteAt   │ logFD (MEMWAL1/2) │ 8B       │ Meta  │ Timestamp header │
├──────┼───────────┼────────────────┼──────────┼───────┼──────────────────┤
│ 1108 │ Truncate  │ logFD (MEMWAL1/2) │ 0        │ Reset │ WAL reset        │
└──────┴───────────┴────────────────┴──────────┴───────┴──────────────────┘

Files Created on Disk

CoW Mode:
- FLEXTREE.PAGES - 1024-byte pages (one per tree node)
- FLEXTREE.COMMIT - 64-byte commit records (append-only)
- REDO.LOG - 8-byte header + 16-byte redo entries
- KV.SLOT_BLOCKS - 4 MB blocks

FlexDB:
- MEMWAL1, MEMWAL2 - memtable WALs (one per memtable)
- FLEXSPACE/ - complete FlexSpace directory

Write Amplification Paths

For a single 1 KB write:

Put(key, value)
  ├─ Memtable WAL write: ~1020 bytes (KV128 encoded)
  ├─ FlexSpace KV.SLOT_BLOCKS block: 1024 bytes (to 4 MB block)
  ├─ FlexSpace REDO.LOG entry: 16 bytes
  ├─ Tree checkpoint (CoW): 1024 bytes (node) + 64 bytes (FLEXTREE.COMMIT)
  └─ Tree checkpoint (greenpack): 30-50 bytes (amortized)

Total: 2.5x-3.2x write amplification minimum

Key Files and Line Numbers

Complete mapping for instrumentation:

1. /Users/me/go/src/github.com/glycerine/yogadb/pages.go:373 - CoW leaf write
2. /Users/me/go/src/github.com/glycerine/yogadb/pages.go:411 - CoW internal write
3. /Users/me/go/src/github.com/glycerine/yogadb/pages.go:332 - FLEXTREE.COMMIT commit
4. /Users/me/go/src/github.com/glycerine/yogadb/flexspace.go:198 - KV.SLOT_BLOCKS block write
5. /Users/me/go/src/github.com/glycerine/yogadb/flexspace.go:313 - REDO.LOG buffer write
6. /Users/me/go/src/github.com/glycerine/yogadb/db.go:1095 - Memtable WAL write

These 6 sites account for all data writes; wrapping with 
atomic counters at these points will give complete write 
amplification visibility.

B. final write amplification work, summary of changes

  Changes Summary

  flextree.go

  - Added TreeBytesWritten int64 counter field to FlexTree struct

  pages.go

  - Added sync/atomic import
  - Instrumented 3 write sites with atomic.AddInt64(&t.TreeBytesWritten, ...):
    - Leaf page write (cowPageSize = 1024 bytes)
    - Internal page write (cowPageSize = 1024 bytes)
    - FLEXTREE.COMMIT record write (cowMetaSize = 64 bytes)

  flexspace.go

  - Added sync/atomic import
  - Added DataBytesWritten int64 and LogBytesWritten int64 counter fields to FlexSpace struct
  - Instrumented 3 write sites:
    - KV.SLOT_BLOCKS block flush (variable size up to 4MB)
    - Redo REDO.LOG write (variable size)
    - Log version header write (8 bytes)

  db.go

  - Added WALBytesWritten int64 and LogicalBytesWritten int64 counter fields to FlexDB struct
  - Added walCounter *int64 field to dbMemtable struct (pointer back to FlexDB.WALBytesWritten)
  - Wired up walCounter in OpenFlexDB
  - Instrumented 2 WAL write sites via the walCounter pointer
  - Added logical byte tracking in Put() and Batch.Commit()
  - Added Metrics struct with WriteAmp() method
  - Added FlexDB.Metrics() method that aggregates counters from all three layers

  batch_bench_test.go

  - Updated BenchmarkBatchYogaDB and BenchmarkPutYogaDB to report yoga_write_amp from db.Metrics().WriteAmp()
  

Plan for more aggressive Garbage Collection: Piggyback GC on Sync/Flush for YogaDB

The flush-worker goroutine already tries to flush every 5 seconds in the background. The user can choose to have that flush also do GC at the end of its flush, or to manually run Sync themselves with the Config flags set to indicate that a GC should be done at the end of a Sync; or as a third very manual option, simply call VacuumKV and/or VacuumVLOG when they wish.

## Context

YogaDB's FlexSpace GC is currently synchronous and demand-driven: it only runs when a write needs a new block and free blocks < 64. For workloads with heavy overwrites/deletes, fragmentation accumulates in `FLEXSPACE.KV.SLOT_BLOCKS` without being reclaimed until critically low on blocks. Adding a simple opt-in flag to run GC at every Sync/flush boundary when garbage exceeds a threshold.

## Approach: Piggyback on Sync/Flush (Mode A only)

Run GC at the end of `writeLockHeldSync()` and `doFlush()` when the write lock is already held. No new goroutines. No timer/cooldown — just check garbage fraction on every sync.

## Config Fields (db.go ~line 533)

```go
// PiggybackGC_on_SyncOrFlush enables automatic GC at the end of
// every Sync() and flush operation. GC runs only if the garbage
// fraction (wasted bytes / total bytes in used blocks) exceeds
// GCGarbagePct. Default false (disabled).
PiggybackGC_on_SyncOrFlush bool

// GCGarbagePct is the minimum fraction of wasted bytes in used
// blocks (garbage / (garbage + live)) required to trigger piggyback GC.
// Value between 0.0 and 1.0. Default 0.50 (50%) when zero and
// PiggybackGC_on_SyncOrFlush is true.
GCGarbagePct float64
```

## FlexDB Fields (db.go ~line 541)

```go
piggyGCStats PiggybackGCStats
```

## New Types

```go
type PiggybackGCStats struct {
    LastGCTime     time.Time
    LastGCDuration time.Duration
    TotalGCRuns    int64
}
```

## New Method: `maybePiggybackGC()` (db.go)

```go
func (db *FlexDB) maybePiggybackGC() {
    if !db.cfg.PiggybackGC_on_SyncOrFlush {
        return
    }
    threshold := db.cfg.GCGarbagePct
    if threshold <= 0 {
        threshold = 0.50
    }
    live, garbage, _, _ := db.ff.garbageMetrics(db.cfg.LowBlockUtilizationPct)
    total := live + garbage
    if total == 0 || float64(garbage)/float64(total) < threshold {
        return
    }
    start := time.Now()
    db.ff.GC()
    db.piggyGCStats.LastGCTime = time.Now()
    db.piggyGCStats.LastGCDuration = time.Since(start)
    db.piggyGCStats.TotalGCRuns++
}
```

## Call Sites

1. `writeLockHeldSync()` (db.go:1881) — after `db.ff.Sync()`:
   ```go
   db.ff.Sync()
   db.maybePiggybackGC()  // <-- insert here
   db.verifyAnchorTags()
   ```

2. `doFlush()` (db.go:3461) — after `db.ff.Sync()`:
   ```go
   db.ff.Sync()
   db.maybePiggybackGC()  // <-- insert here
   ```

## Metrics (db.go Metrics struct ~line 945)

Add to `Metrics`:
```go
PiggybackGCRuns      int64
PiggybackGCLastDurMs int64
```

Populate in `SessionMetrics()`, `CumulativeMetrics()`, `finalMetrics()`.

## Defaults
- `PiggybackGC_on_SyncOrFlush = false` → disabled (opt-in)
- `GCGarbagePct = 0` → default 0.50 when PiggybackGC_on_SyncOrFlush is true

## Files to Modify

| File | Changes |
|------|---------|
| `db.go` | Config fields, FlexDB field, `maybePiggybackGC()`, `PiggybackGCStats` type, Metrics fields, call sites in `writeLockHeldSync` and `doFlush` |
| `gc_test.go` | Add piggyback GC tests |

## Verification

1. Unit tests in `gc_test.go`:
   - `TestPiggybackGC_TriggersOnSync` — enable flag, create fragmentation, Sync, verify `TotalGCRuns > 0`
   - `TestPiggybackGC_RespectsGarbageThreshold` — set threshold=0.99, verify GC does NOT run
   - `TestPiggybackGC_DisabledByDefault` — default config, verify no piggyback GC
   - `TestPiggybackGC_ReclaimsSpace` — write, delete half, enable, Sync, check free blocks increased

2. Run full test suite: `go test ./...` — ensure no regressions

## Critical Files

- `db.go` — Config (line 498), FlexDB struct (line 541), `writeLockHeldSync` (line 1857), `doFlush` (line 3440), Metrics (line 945)
- `flexspace.go` — `FlexSpace.GC()` (line 1272), `garbageMetrics()` (line 1081)
- `gc_test.go` — existing GC test patterns, test helpers (`openTestDB`, etc.)


  db.go

  1. Config fields (after line 533): Added PiggybackGC_on_SyncOrFlush bool and GCGarbagePct float64
  2. PiggybackGCStats type (before FlexDB struct): Tracks LastGCTime, LastGCDuration, TotalGCRuns
  3. FlexDB field: Added piggyGCStats PiggybackGCStats
  4. maybePiggybackGC() method: Checks if piggyback GC is enabled, computes garbage fraction via garbageMetrics(), runs
   ff.GC() if above threshold (default 50%)
  5. Call sites: Inserted db.maybePiggybackGC() after db.ff.Sync() in both writeLockHeldSync() and doFlush()
  6. Metrics fields: Added PiggybackGCRuns int64 and PiggybackGCLastDurMs int64
  7. Metrics population: Added piggyback stats to writeLockHeldSessionMetrics(), finalMetrics(), and
  CumulativeMetrics()
  8. String(): Conditionally prints piggyback GC section when PiggybackGCRuns > 0

  gc_test.go

  Added 4 tests:
  - TestPiggybackGC_TriggersOnSync — enables flag, creates fragmentation, verifies GC runs
  - TestPiggybackGC_RespectsGarbageThreshold — threshold=1.0, verifies GC does NOT run
  - TestPiggybackGC_DisabledByDefault — default config, verifies no piggyback GC
  - TestPiggybackGC_ReclaimsSpace — write, delete half, sync, verify data correctness


Author: Jason E. Aten, Ph.D.

YogaDB is Copyright (C) 2026, Jason E. Aten, Ph.D. All rights reserved.

LICENSE: MIT. See LICENSE file.

Documentation

Overview

YogaDB is an embedded persistent ordered key-value store built on the FlexSpace log-structured storage engine architecture. A three-layer architecture is deployed: at the lowest level is the FlexTree (a B-tree extent index with shift-propagation); at the middle level is the FlexSpace (log-structured file layer with GC); and at the top sits the FlexDB (the KV store with memtables, WAL, sparse index, and interval cache).

The Iterator Optimization History:

Getting 42x faster at iteration: from 340 ns/key to 8 ns/key.

The Iter type went through a series of CPU-profile-driven optimizations to bring forward iteration throughput from ~340 ns/key down to ~8 ns/key, beating bbolt (~12 ns/key) and roughly 25x faster than Pebble (~210 ns/key). Each optimization was benchmarked individually using Benchmark_Iter_YogaDB_Ascend with 100K keys. The progression:

1. Stateful FlexSpace Cursor (flexCursor)

Replaced stateless re-seeking (O(log n) binary search per Next()) with a persistent flexCursor that maintains a pointer to the current sparse-index leaf node, anchor index, and position within the cached interval. Sequential Next() calls advance the cursor in O(1) amortized time by incrementing kvIdx within the current interval, only doing tree traversal at interval boundaries. The cursor holds a refcounted interval cache entry so KV data is accessed zero-copy.

2. HLC-Based Version Detection

Instead of re-acquiring topMutRW.RLock on every Next() call, the iterator snapshots the database's Hybrid Logical Clock (HLC) at seek time. On Next(), a single atomic load compares the current HLC against the snapshot. If they match, no mutations have occurred and the cursor remains valid - the entire lock acquisition and merged re-seek are skipped. When mutations are detected, the iterator falls back to a full re-seek. This turns the common case (iteration without concurrent writes) into an atomic load instead of a mutex round-trip.

3. Prefetch Buffer

Added a prefetch buffer that fills multiple KV entries per lock acquisition. A single RLock fills up to iterPreFetchKeyCount entries; subsequent Next() calls serve from the buffer with only the atomic HLC check. This amortized the per-key lock overhead to near zero, since one lock acquisition covers hundreds of keys.

4. Pointer-Based KV Access (*KV instead of embedded KV)

Changed the Iter struct from embedding a full KV (with Key/Value byte slices copied per entry) to holding a *KV pointer directly into the interval cache's kvs[] slice. On the fast path, Next() just updates a pointer - no struct copy, no byte slice copy. Key() and Value() read through the pointer. This eliminated ~80 bytes of copying per key. (~18 ns/key to ~16 ns/key.)

5. Batched Interval Recording in prefetchFill

Restructured prefetchFillFlexSpaceOnly to process entries in batches per interval rather than one at a time. Each interval contributes a contiguous block of entries, avoiding repeated per-entry function calls and pointer chasing across interval boundaries. (~16 ns/key to ~14 ns/key.)

6. Split Advance/Retreat into Fast Path + Boundary Crossing

Split flexCursorAdvance into a two-instruction fast path (kvIdx++ and bounds check) plus a separate flexCursorNextInterval function for the rare interval boundary crossing. The fast path is fully inlineable by the Go compiler. flexCursorRetreat was split similarly into flexCursorPrevInterval. This eliminated function call overhead from the hot loop. (~14 ns/key to ~13.5 ns/key.)

7. Span-Based Prefetch (O(intervals) not O(keys))

Replaced the per-key []*KV prefetch buffer with an array of prefetchSpan descriptors. Each span records a (kvs slice, pos, end) triple referencing a contiguous range within one interval's kvs[]. The fill function (prefetchFillFlexSpaceOnly) runs under the lock and is now O(intervals): it just records ~3 slice boundaries, touching no per-KV data. The per-KV work (tombstone checking, pointer dereferencing) is deferred to servePrefetch which runs outside the lock. The kvs slice is pre-sliced to [:count] as a bounds-check elimination (BCE) hint so the Go compiler can prove index safety within the inner loop. This reduced lock hold time proportional to the number of intervals (~3) rather than keys (~512). (~13.5 ns/key to ~13 ns/key, with the main benefit being reduced lock contention under concurrent access.)

8. Increased Prefetch Count + Eliminated Redundant Atomic Loads

Raised iterPreFetchKeyCount from 90 to 512. With span-based fill being O(intervals), larger prefetch counts add negligible fill cost while reducing the frequency of lock acquisitions for refills. Also eliminated a redundant db.hlc.Aload() in the Next() refill path by caching the HLC value read during the initial check. (~13-14 ns/key, reduced variance.)

9. Atomic Reference Counting in Interval Cache

Converted the interval cache entry's refcnt and access fields from mutex-protected int to atomic int32 operations. releaseEntry (called on every interval boundary crossing) went from mutex lock/unlock to a single atomic.AddInt32. The getEntry cache-hit path was reduced from two mutex lock/unlock pairs to one (for the anchor.fce read) plus an atomic store for the access field. calibrate and allocEntryForNewAnchor use atomic.LoadInt32/AddInt32/StoreInt32. This eliminated ~660ms of mutex overhead from releaseEntry and reduced getEntry's lock contention.

10. Tuning the overwrite in place slotted page sizes SLOTTED_PAGE_KB=4 and the flexdbUnsortedWriteQuota=6 brought iteration down to about 8 nsec/key. Using SLOTTED_PAGE_KB=128 we could cut that in half again, but we are already beating Bolt by 2x (why bother with 3x; it is available if you need it though) and we want to balance reads against insertion efficiency/on-disk space consumption.

Index

Constants

View Source
const (
	// the address space that flexspace.go manages
	FLEXSPACE_MAX_OFFSET = 800 << 30 // 800 GB logical address space

	// block config
	FLEXSPACE_BLOCK_BITS  = 22                                           // 4 MB blocks
	FLEXSPACE_BLOCK_SIZE  = 1 << FLEXSPACE_BLOCK_BITS                    // 4_194_304 bytes
	FLEXSPACE_BLOCK_COUNT = FLEXSPACE_MAX_OFFSET >> FLEXSPACE_BLOCK_BITS // 204800 blocks

	FLEXSPACE_MAX_EXTENT_BIT  = 5
	FLEXSPACE_MAX_EXTENT_SIZE = FLEXSPACE_BLOCK_SIZE >> FLEXSPACE_MAX_EXTENT_BIT // 131_072 bytes == 128 KB (1/32)

	// logical logging
	FLEXSPACE_LOG_MEM_CAP  = 8 << 20 // 8 MB in-memory log buffer
	FLEXSPACE_LOG_MAX_SIZE = 2 << 30 // 2 GB max on-disk log

	// garbage collector
	FLEXSPACE_GC_QUEUE_DEPTH = 8192
	FLEXSPACE_GC_THRESHOLD   = 64 // free blocks below this triggers GC

	// block manager
	FLEXSPACE_BM_BLKDIST_BITS = 16
	FLEXSPACE_BM_BLKDIST_SIZE = (FLEXSPACE_BLOCK_SIZE >> FLEXSPACE_BM_BLKDIST_BITS) + 1 // 65 buckets

)
View Source
const (
	// SLOTTED_PAGE_KB is the target page size for slotted page intervals.
	// Tune this for different workloads. Larger pages amortize overhead
	// better but increase rewrite cost. At 4 KB pages we are 2x faster at
	// load than pebble and 2x faster at read than Bolt. At 64 KB pages
	// our load time is the same as pebble, but we are 3x faster than Bolt.
	// See go test -v -tags memfs -run=xxx -bench Iter
	// And go test -v -tags memfs -run=xxx -bench BigRandomRWBatch
	//
	// 10 is a nice middle ground: 2x write vs Pebble,2x read vs Bolt. With tight splits (Config.PaddedSplits = false, the default).
	SLOTTED_PAGE_KB = 10 // 10:5.214 8:5.84 12:7.874 // up from 2 to 10 seems to help scans alot. 10: (9.984, 10.01, 9.817 ns/key); 2: (23.72 ns/key); 20:14.84 128: 4ns/key sequential scan, nice. 8: 5.347 ns/key. 4: very fast insert. 7.6 ns/key full scan. 32:4.473 ns/key. choice for now: keep at 4 for a litle balance between insert and scan through. but 64:4.168 ns/key, but random rw slows 2x.

	// MAX_KEY_BYTES is the maximum key size in bytes that a slotted page
	// can hold. Derived from SLOTTED_PAGE_KB to ensure at least one key
	// always fits in a page.
	MAX_KEY_BYTES = SLOTTED_PAGE_KB * 512
)
View Source
const DefaultFlexTreeMaxExtentSizeLimit = (64 << 20) // 64 MB
View Source
const FLEXTREE_HOLE = (1 << 47) // highest bit set to indicate a hole
View Source
const FLEXTREE_INTERNAL_CAP = 30
View Source
const FLEXTREE_INTERNAL_CAP_PLUS_ONE = FLEXTREE_INTERNAL_CAP + 1
View Source
const FLEXTREE_LEAF_CAP = 60

Tunable parameters

View Source
const FLEXTREE_MAX_EXTENT_SIZE_LIMIT = 64 << 20
View Source
const FLEXTREE_PATH_DEPTH = 7 // at most 7 levels
View Source
const FLEXTREE_POFF_MASK = 0xffffffffffff // 48 bits
View Source
const (
	// MaxKeySize is actually 16 bytes smaller (really the limit is 4080 bytes).
	MaxKeySize = 4096
)

Variables

View Source
var ErrKeyEmpty = fmt.Errorf("key cannot be the empty string")

Functions

func FirstDiff added in v0.9.8

func FirstDiff(dbA, dbB *FlexDB) string

Types

type Batch

type Batch struct {
	// contains filtered or unexported fields
}

Batch submits a set of writes all together at once for load efficiency and/or atomic change to the database.

func (*Batch) Close

func (s *Batch) Close()

Close forgets any existing queued up puts, and frees any other resources associated with the Batch.

func (*Batch) Commit

func (s *Batch) Commit(doFsync bool) (interv HLCInterval, err error)

Commit flushes the batch atomically all the way to disk but does not fsync unless set doFsync true.

After Commit the batch is empty and can be re-used immediately.

Returns the half-open HLC interval [Begin, Endx) assigned to this batch.

Again, with doFsync false, we do not wait for the data to be fdatasynced to disk. Set doFsync true to fsync into a MEMWAL log, or do multiple batches and then db.Sync() if you need durability across power restarts. Usually if performance is required this is done once after all your batches are loaded.

Metrics are useful, but relatively expensive as we must scan all of the FlexSpace blocks linearly; use CommitGetMetrics() to view them. Commit() itself now skips them for speed.

func (*Batch) CommitGetMetrics

func (s *Batch) CommitGetMetrics(doFsync bool) (HLCInterval, *Metrics, error)

CommitGetMetrics does Commit, and then returns metrics on the flex space for garbage collection and write-amplification study purposes; hence it is slower. It does a linear scan through all the FLEXSPACE.KV.SLOT_BLOCKS to see how much free space could be reclaimed.

func (*Batch) Delete added in v0.9.0

func (s *Batch) Delete(key string)

Delete marks key for deletion in this batch.

func (*Batch) Reset

func (s *Batch) Reset()

Reset forgets any existing queued up puts.

func (*Batch) Set

func (s *Batch) Set(key string, value []byte) (err error)

Set copies key and value internally, so the original memory is safe to be re-used by the caller immediately after Set returns.

type CommonNode

type CommonNode struct {

	// SlotID is the on-disk page slot for CoW persistence; -1 = unassigned
	SlotID int64 `zid:"0"`

	// know thyself.
	NodeID NodeID `zid:"1"`

	Count  uint32 `zid:"2"`
	IsLeaf bool   `zid:"3"`
	Dirty  bool   `zid:"4"`
	Freed  bool   `zid:"5"`
}

CommonNode is embedded in both LeafNode and InternalNode. 20 bytes at the moment.

func (*CommonNode) DecodeMsg

func (z *CommonNode) DecodeMsg(dc *msgp.Reader) (err error)

DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.

func (*CommonNode) EncodeMsg

func (z *CommonNode) EncodeMsg(en *msgp.Writer) (err error)

EncodeMsg implements msgp.Encodable

func (*CommonNode) Gstring

func (z *CommonNode) Gstring() (r string)

func (*CommonNode) MarshalMsg

func (z *CommonNode) MarshalMsg(b []byte) (o []byte, err error)

MarshalMsg implements msgp.Marshaler

func (*CommonNode) Msgsize

func (z *CommonNode) Msgsize() (s int)

Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message

func (*CommonNode) UnmarshalMsg

func (z *CommonNode) UnmarshalMsg(bts []byte) (o []byte, err error)

UnmarshalMsg implements msgp.Unmarshaler

func (*CommonNode) UnmarshalMsgWithCfg

func (z *CommonNode) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)

type Config

type Config struct {
	CacheMB uint64 // default 32 (for 32 MB)

	// NoDisk runs the entire database in-memory using MemVFS.
	// No files are created on disk. Useful for testing.
	NoDisk bool

	// FS overrides the filesystem implementation. When nil, RealVFS{}
	// is used (or MemVFS if NoDisk is true). Allows injecting a
	// custom VFS for testing (e.g. fault injection).
	FS vfs.FS

	// DisableVLOG disables the value log. When true, all values are stored
	// inline in FlexSpace regardless of size (original behavior).
	DisableVLOG bool

	// OmitFlexSpaceOpsRedoLog skips FlexSpace redo-log writes and instead
	// calls SyncCoW() on every Sync(). This eliminates ~0.86x write
	// amplification from the redo log at the cost of slightly more CoW
	// tree page writes. The net effect is lower total write amp (~3.2x
	// vs ~4.1x).
	OmitFlexSpaceOpsRedoLog bool

	// LowBlockUtilizationPct sets the threshold (0.0–1.0) for counting
	// blocks as "low utilization" in Metrics.BlocksWithLowUtilization.
	// A block whose live bytes / FLEXSPACE_BLOCK_SIZE is below this
	// fraction is counted. Default 0.25 (25%) when zero.
	LowBlockUtilizationPct float64

	// OmitMemWalFsync true means we do not durably fdatasync the MEMWAL1/2 files.
	// This is useful for batch loading a lot of data quickly, and then doing
	// one fsync at the end for durability. The proviso of course is that
	// if your process crashes you have no intermediate state and have to
	// start again at the beginning; which may be fine. True also disables
	// the LARGE.VLOG fsyncs until db.Sync() is called.
	OmitMemWalFsync bool

	// PiggybackGC_on_SyncOrFlush enables automatic GC at the end of
	// every Sync() and flush operation. GC runs only if the garbage
	// fraction (wasted bytes / total bytes in used blocks) exceeds
	// GCGarbagePct. Default false (disabled).
	PiggybackGC_on_SyncOrFlush bool

	// GCGarbagePct is the minimum fraction of wasted bytes in used
	// blocks (garbage / (garbage + live)) required to trigger piggyback GC.
	// Value between 0.0 and 1.0. Default 0.50 (50%) when zero and
	// PiggybackGC_on_SyncOrFlush is true.
	GCGarbagePct float64

	// DisableBackgroundFlush disables the background flush worker goroutine.
	// When true, memtable flushes only happen on explicit Sync() or Close() calls.
	// This is useful for fuzz testing where background goroutines can crash
	// the entire fuzz worker subprocess if they panic (the test's recover()
	// only catches panics on the test goroutine, not background goroutines).
	DisableBackgroundFlush bool

	// PaddedSplits controls whether treeInsertAnchor pads both split
	// halves to slottedPageMaxSize. Default false uses tight encoding,
	// which cuts space amplification substantially. When true, the old
	// padded behavior is used (useful for A/B comparison).
	// Padding allows in-place additions to a slotted page to
	// occur, making key updates more efficient. The trade-off
	// is pre-allocating space for new additions.
	PaddedSplits bool
}

Config allows configuration of a FlexDB.

type FlexDB

type FlexDB struct {
	Path string

	// Write-byte counters (accessed atomically)
	MemWALBytesWritten  int64 // WAL (FLEXDB.MEMWAL) bytes written
	LogicalBytesWritten int64 // user payload bytes (key+value)
	// contains filtered or unexported fields
}

FlexDB is a persistent ordered key-value store backed by FlexSpace. It is thread-safe, except for iteration via Ascend/Descend--which allows deletions and updates on the fly.

func OpenFlexDB

func OpenFlexDB(path string, pCfg *Config) (*FlexDB, error)

OpenFlexDB opens or creates a FlexDB at the given directory path. cacheMB is the cache capacity in megabytes.

func (*FlexDB) BeginUpdate added in v0.9.8

func (db *FlexDB) BeginUpdate() *WriteTx

func (*FlexDB) BeginView added in v0.9.8

func (db *FlexDB) BeginView() *ReadOnlyTx

func (*FlexDB) CheckIntegrity

func (db *FlexDB) CheckIntegrity() []IntegrityError

CheckIntegrity performs a read-only consistency check of the FlexDB. It flushes the memtable first, then acquires a read lock on FlexSpace.

Checks performed:

  1. FlexTree leaf linked list: no cycles, prev/next consistency
  2. Extent validity: every non-hole extent has poff + len within file bounds
  3. Extent readability: data at every extent can be read from disk
  4. Block usage: recomputed from FlexTree matches the block manager's state
  5. Sparse index: every anchor interval is readable and kv128-decodable
  6. Sorted keys: keys within each decoded interval are in sorted order
  7. Anchor coverage: anchor loff+psize spans tile the FlexSpace without gaps/overlaps
  8. VLOG blake3: for every KV with a VPtr, read the VLOG entry and verify hdrCRC, valCRC, and blake3 checksum of value bytes

Returns nil if no errors found.

func (*FlexDB) Clear

func (db *FlexDB) Clear(includeLarge bool) (allGone bool, err error)

Clear deletes all keys in the database.

When includeLarge is true, the entire database is wiped and re-initialized (fast path). When false, only keys with inline (small) values are deleted; keys with large values stored in the VLOG survive.

Returns allGone=true when the database was re-initialized. In that case, ALL previously held iterators, cursors, and pointers into the database are invalid and must be re-acquired.

Goroutine safe. Acquires the database write lock for the duration of the call, serializing against all other operations.

func (*FlexDB) Close

func (db *FlexDB) Close() *Metrics

Close syncs and shuts down the FlexDB.

func (*FlexDB) CumulativeMetrics

func (db *FlexDB) CumulativeMetrics() *Metrics

CumulativeMetrics reports file sizes on disk, reflecting the cumulative history of all sessions. Each physical metric is the current file size (via Stat), so it captures bytes written by previous sessions as well. LogicalBytesWritten uses FlexSpace's MaxLoff as an approximation of total user payload stored (it includes kv128 encoding overhead of ~10-20 bytes per entry; values separated to VLOG are represented by 16-byte VPtrs).

func (*FlexDB) Delete

func (db *FlexDB) Delete(key string) error

Delete removes key from the store.

func (*FlexDB) DeleteRange

func (db *FlexDB) DeleteRange(includeLarge bool, begKey, endKey string, begInclusive, endInclusive bool) (n int64, allGone bool, err error)

DeleteRange deletes all keys in the range [begKey, endKey] with configurable inclusivity on each bound.

Returns:

  • n: number of tombstones written (0 when allGone is true)
  • allGone: true if the entire database was wiped and re-initialized. When true, ALL previously held iterators, cursors, and pointers into the database are invalid and must be re-acquired.
  • err: non-nil on failure

When includeLarge is false, keys whose values are stored in the VLOG (large values, > 64 bytes) are skipped and survive the deletion.

The begInclusive and endInclusive parameters control whether the bounds are inclusive or exclusive:

DeleteRange(true,  a, z, true,  true)   // [a, z]  - both inclusive, include large values
DeleteRange(true,  a, z, true,  false)  // [a, z)  - half-open, include large values
DeleteRange(false, a, z, true,  true)   // [a, z]  - both inclusive, skip large values

Goroutine safe. Concurrent reads and writes are serialized via the database write lock. However, when allGone is returned true, all previously held iterators, cursors, and references are invalidated.

func (*FlexDB) FetchLarge added in v0.7.0

func (db *FlexDB) FetchLarge(kv *KV) ([]byte, error)

FetchLarge retrieves the value bytes for a KV whose value is stored in the VLOG (kv.Large() returns true). For inline values, it simply returns kv.Value. The returned bytes are a fresh copy safe to retain.

Goroutine safe. Acquires the read lock internally.

func (*FlexDB) Find added in v0.6.2

func (db *FlexDB) Find(smod SearchModifier, key string) (kvc *KVcloser, exact bool, err error)

Find allows GTE, GT, LTE, LT, and Exact searches.

GTE: find the smallest key greater-than-or-equal to key.

GT: find the smallest key strictly greater-than key.

LTE: find the largest key less-than-or-equal to key.

LT: find the largest key strictly less-than key.

Exact: find a matching key exactly.

If key is the empty string, then GTE and GT return the first key in the tree, while LTE and LT return the last key.

Any of the LAZY* set of flag can be bitwise-OR-ed with the smod to request that large/small/all values not be returned unless and until we decide we want them with an explicit FetchLarge() call. For example: Find(Exact|LAZY, "needle")

The returned *KVcloser contains the found key and its value; unless laziness was requested.

The returned bool, 'exact', indicates an exact match to the query key.

If the returned kvc *KVcloser is nil, this means that the key was not found, or there was an I/O error. The caller should always check the returned error (err) first rule out I/O error before concluding the key was not found from a nil kvc.

A typical call sequence would be:

kvc, _, err := dbHaystack.Find(Exact, "needle")

if err != nil {
   return err
}

if kvc != nil {
  // found exact match! (we know, because Exact was
  // requested; if this was a GTE search we would need
  // to check the 'exact' bool return to know if we found
  // our "needle", or went past it).

  // Here all value sizes are automatically pulled in, since
  // none of the (LAZY_SMALL, LAZY_LARGE, LAZY) smod were requested

  // For performance, we do not copy kvc.Value for you.
  // So you must copy kvc.Value, if you need it later,
  // before doing kvc.Close().
  processKeyAndValueAtHlcTimestamp(kvc.Key, kvc.Value, kvc.Hlc)

  kvc.Close() // unpin from internal caches. Allows zero-copy reads.
}

The returned iterator is a locked iterator (holds the exclusive Find looks up the first key matching the SearchModifier and returns an owned copy of the KV (safe to retain indefinitely). For scanning beyond the found key, use Find inside a View or Update transaction.

Goroutine safe. Acquires the read lock internally.

Warning: if kvc != nil, the user must call Close() on the returned kvc *KVcloser when done copying any Value out, or else memory and resource leaks will ensue.

The kvc.Close() can be skipped if kvc is nil (key not found). However it is always fine to do the Close() even then, as kvc.Close() is a no-op if kvc is nil.

func (*FlexDB) Get

func (db *FlexDB) Get(key string) (value []byte, found bool, err error)

Get retrieves the value for key. Returns nil, false if not found. Get can return nil, true if a nil value was stored with the key. Get is value size agnostic. It returns large and small values immediately. This is tested at, for example, gc_test.go Test_GC1K_write_1k_keys_with_large_values.

func (*FlexDB) GetKV added in v0.9.1

func (db *FlexDB) GetKV(key string) (kv *KVcloser, err error)

GetKV is like Get but allows lazy loading of Large values; they are not fetched automatically. If the user sees kv.Large() true, then db.FetchLarge(kv) will return the large value. GetKV is equivalent to db.Find(Exact, key).

func (*FlexDB) Len

func (db *FlexDB) Len() int64

Len returns the total number of live (non-tombstone) keys in the database. O(1) - reads a pre-maintained counter. Goroutine safe.

func (*FlexDB) LenBigSmall

func (db *FlexDB) LenBigSmall() (big int64, small int64)

LenBigSmall returns the live key count partitioned by storage location. big: keys whose values are stored in the VLOG (> 64 bytes). small: keys whose values are stored inline. O(1) - reads pre-maintained counters. Goroutine safe.

func (*FlexDB) Merge

func (db *FlexDB) Merge(key string, fn func(oldVal []byte, exists bool) (newVal []byte, write bool, doDelete bool)) error

Merge performs an atomic read-modify-write on key. fn is always called: when the key exists, oldVal is its current value and exists=true; when the key is absent or deleted, oldVal=nil and exists=false (allowing conditional creation). Return write=false to skip the write, or doDelete=true to delete the key.

This mirrors C's flexdb_merge: it looks up the old value across all layers (active memtable, inactive memtable, FlexSpace), applies the user function, and writes the result atomically.

Q: When should the callback return doWrite=false, doDelete=false? A: This means "do nothing." The main scenarios:

  1. Conditional creation: The callback inspects exists and decides not to create the key. E.g., "only increment if the key already exists" - if exists=false, return a bare return (all zeros = no-op).

  2. Conditional update: The callback inspects the old value and decides no change is needed. E.g., "set to X only if current value isn't already X."

  3. Read-only peek: The callback just wants to see the current value (though Get is simpler for that).

.

func (*FlexDB) NewBatch

func (db *FlexDB) NewBatch() (b *Batch)

NewBatch returns an empty new Batch.

func (*FlexDB) Put

func (db *FlexDB) Put(key string, value []byte) error

Put writes key -> value. len(value) == 0 is fine, if desired. Call Delete instead of Put to delete a key and any associated value.

Values of any size are accepted. Values > vlogInlineThreshold (64 bytes) are stored in the VLOG file; smaller values are stored inline in the FLEXSPACE.KV.SLOT_BLOCKS file with the keys.

Large values are written exactly once: to the VLOG. The WAL stores only the VPtr (16 bytes), not the full value.

Puts are not durably on disk until after the user has also completed a db.Sync() call. This allows the user to control the rate of fsyncs and trade that against their durability requirements.

func (*FlexDB) SessionMetrics

func (db *FlexDB) SessionMetrics() *Metrics

Metrics returns a snapshot of write-byte counters aggregated from all layers.

func (*FlexDB) Sync

func (db *FlexDB) Sync() error

Sync flushes all in-memory data in the active memtable to disk in FLEXSPACE.KV128.BLOCKS and fsyncs it. Users must call Sync after Puts for them to be durable.

func (*FlexDB) Update

func (db *FlexDB) Update(fn func(rw *WriteTx) error) (err error)

Update runs fn inside an exclusive write transaction. The write lock (topMutRW.Lock()) is held for the duration of fn, blocking all other readers and writers including the flush worker.

All iterators created within fn via rwDB.NewIter() or rwDB.FindIt() are automatically closed when fn returns.

Writes via rwDB.Put/rwDB.Delete are applied immediately to the database (no buffering, no Commit needed).

Do NOT call db.Put/db.Get/db.Delete/db.Sync inside fn - use rwDB methods instead (deadlock).

func (*FlexDB) VacuumKV

func (db *FlexDB) VacuumKV() (*VacuumKVStats, error)

VacuumKV reclaims dead FLEXSPACE.KV.SLOT_BLOCKS space by rewriting all live extents sequentially to a new file and replacing the old file. This is an exclusive operation that acquires topMutRW.

Crash safety: if the process crashes before the rename completes, the old FLEXSPACE.KV.SLOT_BLOCKS and old FlexTree remain intact. The stale .vacuum file (if present) is harmless and will be overwritten on the next vacuum.

VacuumKV does a one-time compaction of already-bloated databases. Algorithm: 1. Flush memtable, acquire exclusive locks 1b. Compact slotted page padding: decode each page, compute tight size,

Collapse the zero-padding in reverse loff order

2. Walk FlexTree leaf linked list, read/rewrite all live extents sequentially to a .vacuum file 3. Close old fd, rename .vacuum -> FLEXSPACE.KV.SLOT_BLOCKS, reopen 4. Rebuild block manager, checkpoint FlexTree 5. Rebuild anchor tree from FlexTree tags (replaces stale anchor loffs) 6. Clean up stale .vacuum files on OpenFlexSpaceCoW

See the tests: TestFlexDB_VacuumKV_Basic - overwrites 200 keys, vacuums, verifies data integrity across reopen TestFlexDB_VacuumKV_WithDeletes - deletes half of 100 keys, vacuums, verifies correct keys survive .

func (*FlexDB) VacuumVLOG

func (db *FlexDB) VacuumVLOG() (*VacuumVLOGStats, error)

VacuumVLOG reclaims dead LARGE.VLOG space by copying live values to a new VLOG file and rewriting their VPtrs in FlexSpace. This is an exclusive operation that acquires topMutRW.

Crash safety: if the process crashes before the rename completes, the old VLOG and old intervals remain intact. The stale VLOG.new file (if present) is harmless and will be overwritten on the next vacuum.

func (*FlexDB) View

func (db *FlexDB) View(fn func(ro *ReadOnlyTx) error) (err error)

View runs fn inside a read-only transaction. The read lock (topMutRW.RLock()) is held for the duration of fn, blocking writers (including the flush worker) but allowing concurrent readers.

All iterators created within fn via roDB.NewIter() or roDB.FindIt() are automatically closed when fn returns.

Do NOT call db.Get inside fn - use roDB methods instead (deadlock).

type FlexSpace

type FlexSpace struct {
	Path string

	// Write-byte counters (accessed atomically)
	KV128BytesWritten   int64 // FLEXSPACE.KV.SLOT_BLOCKS file bytes written
	REDOLogBytesWritten int64 // redo LOG bytes written
	// contains filtered or unexported fields
}

FlexSpace is a log-structured file-like address space providing insert-range, collapse-range, read, write, and GC operations. Corresponds to struct flexfile in flexfile.h/flexfile.c.

FlexSpace itself does no internal locking. Clients must guarantee single user at a time; e.g. via FlexDB.ffMu.

The main data file within the FLEXSPACE directory, the file that holds all the keys + small values is "FLEXSPACE.KV.SLOT_BLOCKS".

Here's how the block management works:

The FLEXSPACE.KV.SLOT_BLOCKS file is divided into 4 MB blocks. The block manager (blockManager) keeps one 4 MB in-memory write buffer and appends data into it sequentially:

1. Writing: When FlexDB flushes a memtable, it calls putPassthrough which calls ff.Insert(). Insert copies the kv128-encoded bytes into the block manager's buffer at the current offset (bm.buf[blkoff:]). If the current block can't fit the data, nextBlock() writes the filled buffer to disk at blkid × 4MB and moves to the next empty block.

2. Mapping: Each chunk written gets a physical offset (poff = blkid × 4MB + blkoff). That poff is inserted into the FlexTree, which maps logical offset -> physical offset. So the FlexTree is the indirection layer that lets you read kv128 intervals by logical position even though they're scattered across 4 MB blocks on disk.

3. GC: The block manager tracks per-block usage (blkusage[]). When data is deleted (Collapse), the block's usage count decreases. GC reclaims blocks with low utilization by rewriting their live extents into the current write block, then freeing the old block.

So "block-managed" means the FLEXSPACE.KV.SLOT_BLOCKS file is a pool of 4 MB blocks with an append-only write cursor, a FlexTree for logical -> physical mapping, and a GC that reclaims fragmented blocks. It's essentially a log-structured store at the block level - writes always go to the current block, never overwrite in place.

func OpenFlexSpaceCoW

func OpenFlexSpaceCoW(path string, omitRedoLog bool, fs vfs.FS) (*FlexSpace, error)

OpenFlexSpaceCoW opens or creates a FlexSpace using CoW page-based persistence for the FlexTree instead of greenpack full-tree serialization. The directory will contain: FLEXSPACE.KV.SLOT_BLOCKS, FLEXTREE.COMMIT, FLEXTREE.PAGES, FLEXSPACE.REDO.LOG. When omitRedoLog is true, redo log writes are skipped and SyncCoW is called on every Sync.

func (*FlexSpace) Close

func (ff *FlexSpace) Close() (kvBlocksOnDiskFootprintBytes int64)

Close syncs and cleanly shuts down the FlexSpace.

func (*FlexSpace) Collapse

func (ff *FlexSpace) Collapse(loff, length uint64) error

Collapse deletes len bytes at loff, shifting subsequent extents left.

func (*FlexSpace) Defrag

func (ff *FlexSpace) Defrag(buf []byte, loff, length uint64) error

Defrag rewrites the len bytes at loff as a fresh contiguous physical extent.

func (*FlexSpace) Fallocate

func (ff *FlexSpace) Fallocate(loff, size uint64) error

Fallocate pre-allocates size bytes starting at loff by inserting zero data. Bug fix: C code used FLEXSPACE_MAX_EXTENT_BIT (5) instead of FLEXSPACE_MAX_EXTENT_SIZE (131072).

func (*FlexSpace) Ftruncate

func (ff *FlexSpace) Ftruncate(size uint64) error

Ftruncate truncates the FlexSpace to size bytes by collapsing the tail.

func (*FlexSpace) GC

func (ff *FlexSpace) GC()

GC runs garbage collection if free blocks are below the threshold. Iterates up to 4 rounds of decreasing aggressiveness.

func (*FlexSpace) GetHandler

func (ff *FlexSpace) GetHandler(loff uint64) FlexSpaceHandler

GetHandler returns a handler positioned at loff.

func (*FlexSpace) GetTag

func (ff *FlexSpace) GetTag(loff uint64) (uint16, error)

GetTag returns the tag at loff, or an error if not found.

func (*FlexSpace) Insert

func (ff *FlexSpace) Insert(buf []byte, loff, length uint64) (int, error)

Insert inserts len bytes at loff, shifting all subsequent extents right.

func (*FlexSpace) Overwrite

func (ff *FlexSpace) Overwrite(buf []byte, loff uint64, length uint64) error

Overwrite writes buf directly to the physical location backing the extent at loff, without mutating the FlexTree or writing redo log entries. The extent at loff must already exist and have length == len(buf). This creates zero garbage - the same physical blocks are reused in-place.

func (*FlexSpace) Read

func (ff *FlexSpace) Read(buf []byte, loff, length uint64) (int, error)

Read reads len bytes from loff into buf. Returns bytes read or -1 on error.

func (*FlexSpace) ReadFragmentation

func (ff *FlexSpace) ReadFragmentation(buf []byte, loff, length uint64) (int, uint64, error)

ReadFragmentation reads and also returns the number of physical extents (frag).

func (*FlexSpace) SetTag

func (ff *FlexSpace) SetTag(loff uint64, tag uint16) error

SetTag sets the 16-bit tag on the extent at loff.

func (*FlexSpace) Size

func (ff *FlexSpace) Size() uint64

Size returns the current logical size of the FlexSpace.

func (*FlexSpace) Sync

func (ff *FlexSpace) Sync()

Sync flushes all in-memory state to disk.

func (*FlexSpace) Update

func (ff *FlexSpace) Update(buf []byte, loff, length, olen uint64) (int, error)

Update atomically replaces olen bytes at loff with len bytes from buf. The tag (if any) is preserved.

func (*FlexSpace) Write

func (ff *FlexSpace) Write(buf []byte, loff, length uint64) (int, error)

Write overwrites len bytes at loff. Semantics:

  • if loff == size: equivalent to Insert (append)
  • if loff + len > size: truncate tail then Insert
  • otherwise: Update (collapse old + insert new, same size)

type FlexSpaceHandler

type FlexSpaceHandler struct {
	// contains filtered or unexported fields
}

FlexSpaceHandler is a stateful read-only cursor over a FlexSpace. Corresponds to struct flexfile_handler.

func (*FlexSpaceHandler) Backward

func (fh *FlexSpaceHandler) Backward(step uint64)

Backward moves the handler's position backward by step bytes.

func (*FlexSpaceHandler) Forward

func (fh *FlexSpaceHandler) Forward(step uint64)

Forward advances the handler's position by step bytes.

func (*FlexSpaceHandler) ForwardExtent

func (fh *FlexSpaceHandler) ForwardExtent()

ForwardExtent advances the handler to the start of the next extent.

func (*FlexSpaceHandler) GetTag

func (fh *FlexSpaceHandler) GetTag() (uint16, error)

GetTag returns the tag at the handler's current position.

func (*FlexSpaceHandler) Loff

func (fh *FlexSpaceHandler) Loff() uint64

Loff returns the current logical offset.

func (*FlexSpaceHandler) Poff

func (fh *FlexSpaceHandler) Poff() uint64

Poff returns the current physical offset.

func (*FlexSpaceHandler) Read

func (fh *FlexSpaceHandler) Read(buf []byte, length uint64) (int, error)

Read reads len bytes from the handler's current position into buf.

func (*FlexSpaceHandler) Valid

func (fh *FlexSpaceHandler) Valid() bool

Valid returns true if the handler points to a valid extent.

type FlexTree

type FlexTree struct {
	MaxLoff       uint64 `zid:"0"`
	MaxExtentSize uint32 `zid:"1"`

	Path              string `zid:"2"`
	NodeCount         uint64 `zid:"3"`
	PersistentVersion uint64 `zid:"4"`

	Root NodeID `zid:"5"`

	LeafHead NodeID `zid:"6"` // linked list head

	// the arenas that hold all nodes.
	InternalArena []InternalNode `zid:"7"`
	LeafArena     []LeafNode     `zid:"8"`

	// Stack of available internal node indices
	FreeInternals []int32 `zid:"9"`

	// Stack of available leaf node indices
	FreeLeaves []int32 `zid:"10"`

	// Write-byte counters (accessed atomically)
	FlexTreePagesBytesWritten int64 // FLEXTREE.PAGES + FLEXTREE.COMMIT bytes written to disk

	MaxHLC int64 // highest HLC ever assigned; restored on reopen
	// contains filtered or unexported fields
}

FlexTree acts as our Arena allocator and entry point.

func NewFlexTree

func NewFlexTree(fs vfs.FS) *FlexTree

NewTree initializes the tree and pre-allocates some capacity to avoid early slice reallocations.

Warning: after getting a new node with AllocLeaf() or AllocInternal() be aware that the Go allocator may have just re-sized the underlying arena to grow it, which means any existing address (rather than index) will need to be re-converted from index into pointer before use. We had several early bugs from this. You have been warned!

We could and maybe should just allocate the arenas with malloc directly from the operating system... but meh. It's usually fragile to get into the business of creating your own memory manager.

func OpenFlexTreeCoW

func OpenFlexTreeCoW(dirPath string, fs vfs.FS) (*FlexTree, error)

OpenFlexTreeCoW opens an existing CoW FlexTree from dirPath, or creates a new one if the directory does not contain FLEXTREE.COMMIT/FLEXTREE.PAGES files.

func (*FlexTree) AllocInternal

func (t *FlexTree) AllocInternal() (ie *InternalNode)

AllocInternal mirrors the leaf logic for internal nodes.

func (*FlexTree) AllocLeaf

func (t *FlexTree) AllocLeaf() (le *LeafNode)

AllocLeaf grabs a slot, preferring recycled nodes from the free list.

func (*FlexTree) Close

func (tree *FlexTree) Close()

func (*FlexTree) CloseCoW

func (t *FlexTree) CloseCoW() error

CloseCoW syncs the tree and closes the file descriptors.

func (*FlexTree) DecodeMsg

func (z *FlexTree) DecodeMsg(dc *msgp.Reader) (err error)

DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.

func (*FlexTree) Delete

func (tree *FlexTree) Delete(loff, length uint64) int

func (*FlexTree) Diff

func (tree *FlexTree) Diff(b *FlexTree) (diff string)

func (*FlexTree) EncodeMsg

func (z *FlexTree) EncodeMsg(en *msgp.Writer) (err error)

EncodeMsg implements msgp.Encodable

func (*FlexTree) FreeNode

func (t *FlexTree) FreeNode(id NodeID)

FreeNode releases a node's index back to the pool, strictly checking for double-free corruption.

func (*FlexTree) GetInternal

func (t *FlexTree) GetInternal(id NodeID) *InternalNode

GetInternal retrieves a pointer to an internal node.

func (*FlexTree) GetLeaf

func (t *FlexTree) GetLeaf(id NodeID) *LeafNode

GetLeaf retrieves a pointer to a leaf node for mutation.

func (*FlexTree) GetMaxLoff

func (tree *FlexTree) GetMaxLoff() uint64

func (*FlexTree) GetNodeString

func (t *FlexTree) GetNodeString(id NodeID) string

func (*FlexTree) GetTag

func (tree *FlexTree) GetTag(loff uint64) (uint16, int)

func (*FlexTree) GrandparentIdx

func (tree *FlexTree) GrandparentIdx(path *FlexTreePath) (index uint32)

func (*FlexTree) GrandparentNode

func (tree *FlexTree) GrandparentNode(path *FlexTreePath) (ie *InternalNode)

func (*FlexTree) Gstring

func (z *FlexTree) Gstring() (r string)

func (*FlexTree) Insert

func (tree *FlexTree) Insert(loff, poff uint64, len uint32) int

func (*FlexTree) InsertWTag

func (tree *FlexTree) InsertWTag(loff, poff uint64, len uint32, tag uint16) int

func (*FlexTree) MarkAllInternalsDirty

func (t *FlexTree) MarkAllInternalsDirty()

MarkAllInternalsDirty sets Dirty=true on every allocated internal node. This is needed before SyncCoW when leaf nodes have been modified (e.g., by VacuumKV) without going through the normal insert/delete path that propagates dirty flags up the tree.

func (*FlexTree) MarshalMsg

func (z *FlexTree) MarshalMsg(b []byte) (o []byte, err error)

MarshalMsg implements msgp.Marshaler

func (*FlexTree) Msgsize

func (z *FlexTree) Msgsize() (s int)

Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message

func (*FlexTree) NodeSlotID

func (t *FlexTree) NodeSlotID(id NodeID) int64

NodeSlotID returns the SlotID for the node identified by id.

func (*FlexTree) PDelete

func (tree *FlexTree) PDelete(loff uint64) int

func (*FlexTree) PQuery

func (tree *FlexTree) PQuery(loff uint64) uint64

func (*FlexTree) ParentIdx

func (tree *FlexTree) ParentIdx(path *FlexTreePath) (index uint32, ok bool)

func (*FlexTree) ParentNode

func (tree *FlexTree) ParentNode(path *FlexTreePath) (ie *InternalNode, nodeID NodeID, ok bool)

func (*FlexTree) ParentNodeIdx

func (tree *FlexTree) ParentNodeIdx(path *FlexTreePath) (ie *InternalNode, index uint32, ok bool)

func (*FlexTree) PosGet

func (tree *FlexTree) PosGet(loff uint64) Pos

PosGet returns a Pos cursor positioned at loff in the tree. If loff is not within any extent, returns an invalid (node==nil) Pos.

func (*FlexTree) Print

func (tree *FlexTree) Print()

func (*FlexTree) Query

func (tree *FlexTree) Query(loff, length uint64) *FlextreeQueryResult

func (*FlexTree) SetTag

func (tree *FlexTree) SetTag(loff uint64, tag uint16) int

func (*FlexTree) String

func (tree *FlexTree) String() string

func (*FlexTree) Sync

func (tree *FlexTree) Sync()

func (*FlexTree) SyncCoW

func (t *FlexTree) SyncCoW() error

SyncCoW writes only dirty nodes to disk using copy-on-write. The FLEXTREE.COMMIT file is the commit point: written only after FLEXTREE.PAGES is synced.

func (*FlexTree) UnmarshalMsg

func (z *FlexTree) UnmarshalMsg(bts []byte) (o []byte, err error)

UnmarshalMsg implements msgp.Unmarshaler

func (*FlexTree) UnmarshalMsgWithCfg

func (z *FlexTree) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)

type FlexTreeExtent

type FlexTreeExtent struct {
	Loff uint32 `zid:"0"`
	Len  uint32 `zid:"1"`

	// Combines tag (16 bits) and poff (48 bits)
	TagPoff uint64 `zid:"2"`
}

FlexTreeExtent is a foundation piece.

Q: FlexTreeExtent is defined with the logical offset Loff uint32. How it is possible that the small 32-bit uint32 is correct here, rather than a larger uint64? It seems that the logical address space needs to span 62 or 63 bits, does it not? Please help me understand the design here.

A: Great question. Here's the key insight.

The Loff in each FlexTreeExtent is not a global logical offset. It's a leaf-local relative offset. The global 64-bit logical address is reconstructed on-the-fly during tree traversal by accumulating Shift values down the path.

Here's how it works:

1. Internal nodes carry Shift int64 per child

type InternalChild struct {
    NodeID NodeID
    Shift  int64   // <- this is the 64-bit accumulator
}

When traversing from root to leaf (line 697), the global loff uint64 is reduced at each level: loff -= uint64(ie.Children[target].Shift)

By the time you reach a leaf, loff has been transformed into a small, leaf-local coordinate. The extents in that leaf use uint32 offsets relative to this local frame.

2. Inserts propagate shifts upward

When data is inserted at a logical offset, FlexTree.shiftUpPropagate() (line 752) adds the inserted length to the Shift of sibling children at each level -- but only for siblings to the right of the insertion point. This means only the path from root to the affected leaf is updated; all other subtrees lazily absorb the shift when they are next traversed.

3. rebase() prevents uint32 overflow

Line 736, in FlexTree.rebase():

if le.Extents[le.Count-1].Loff >= ^uint32(0) - tree.MaxExtentSize*2 {
    newBase := le.Extents[0].Loff
    parent.Children[pIdx].Shift += int64(newBase)
    for i := uint32(0); i < le.Count; i++ {
        le.Extents[i].Loff -= newBase
    }
}

When a leaf's local offsets approach uint32 max (~4 GB), rebase() shifts the entire leaf's coordinates back to near zero by absorbing the base offset into the parent's Shift. This is O(leaf capacity) = O(60) work -- cheap.

Summary: The 64-bit logical address space lives in the tree edges (Shift int64 on each InternalChild), not in the leaf extents. Each leaf only needs to represent offsets within its own local window, which never exceeds ~4 GB thanks to rebase().

This is the paper's key trick -- it keeps the FlexTreeExtent struct at exactly 16 bytes (matching a cache line quarter), which is critical for the B-tree's I/O efficiency. Storing a uint64 Loff would bloat each extent to 20+ bytes and reduce the leaf fan-out from 60 to ~48. .

func (*FlexTreeExtent) Address

func (e *FlexTreeExtent) Address() uint64

Address extracts the 47-bit physical offset address

func (*FlexTreeExtent) DecodeMsg

func (z *FlexTreeExtent) DecodeMsg(dc *msgp.Reader) (err error)

DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.

func (*FlexTreeExtent) EncodeMsg

func (z *FlexTreeExtent) EncodeMsg(en *msgp.Writer) (err error)

EncodeMsg implements msgp.Encodable

func (*FlexTreeExtent) Gstring

func (z *FlexTreeExtent) Gstring() (r string)

func (*FlexTreeExtent) IsHole

func (e *FlexTreeExtent) IsHole() bool

IsHole returns true if the 48th bit of poff is set

func (*FlexTreeExtent) MarshalMsg

func (z *FlexTreeExtent) MarshalMsg(b []byte) (o []byte, err error)

MarshalMsg implements msgp.Marshaler

func (*FlexTreeExtent) Msgsize

func (z *FlexTreeExtent) Msgsize() (s int)

Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message

func (*FlexTreeExtent) Poff

func (e *FlexTreeExtent) Poff() uint64

Poff extracts the upper 48 bits

func (*FlexTreeExtent) SetAddress

func (e *FlexTreeExtent) SetAddress(addr uint64)

SetAddress sets the 47-bit address without altering the hole flag

func (*FlexTreeExtent) SetHole

func (e *FlexTreeExtent) SetHole(isHole bool)

SetHole sets or clears the 1-bit hole flag without altering the address

func (*FlexTreeExtent) SetPoff

func (e *FlexTreeExtent) SetPoff(poff uint64)

SetPoff replaces the upper 48 bits while keeping the lower 16 bits intact

func (*FlexTreeExtent) SetTag

func (e *FlexTreeExtent) SetTag(tag uint16)

SetTag replaces the lower 16 bits while keeping the upper 48 bits intact

func (*FlexTreeExtent) Tag

func (e *FlexTreeExtent) Tag() uint16

Tag extracts the lower 16 bits

func (*FlexTreeExtent) UnmarshalMsg

func (z *FlexTreeExtent) UnmarshalMsg(bts []byte) (o []byte, err error)

UnmarshalMsg implements msgp.Unmarshaler

func (*FlexTreeExtent) UnmarshalMsgWithCfg

func (z *FlexTreeExtent) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)

type FlexTreePath

type FlexTreePath struct {
	Level uint8                       `zid:"0"`
	Path  [FLEXTREE_PATH_DEPTH]uint8  `zid:"1"`
	Nodes [FLEXTREE_PATH_DEPTH]NodeID `zid:"2"`
}

func (*FlexTreePath) DecodeMsg

func (z *FlexTreePath) DecodeMsg(dc *msgp.Reader) (err error)

DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.

func (*FlexTreePath) EncodeMsg

func (z *FlexTreePath) EncodeMsg(en *msgp.Writer) (err error)

EncodeMsg implements msgp.Encodable

func (*FlexTreePath) Gstring

func (z *FlexTreePath) Gstring() (r string)

func (*FlexTreePath) MarshalMsg

func (z *FlexTreePath) MarshalMsg(b []byte) (o []byte, err error)

MarshalMsg implements msgp.Marshaler

func (*FlexTreePath) Msgsize

func (z *FlexTreePath) Msgsize() (s int)

Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message

func (*FlexTreePath) UnmarshalMsg

func (z *FlexTreePath) UnmarshalMsg(bts []byte) (o []byte, err error)

UnmarshalMsg implements msgp.Unmarshaler

func (*FlexTreePath) UnmarshalMsgWithCfg

func (z *FlexTreePath) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)

type FlextreeQueryResult

type FlextreeQueryResult struct {
	Loff  uint64   `zid:"0"`
	Len   uint64   `zid:"1"`
	Count uint64   `zid:"2"`
	V     []Offlen `zid:"3"`
}

func (*FlextreeQueryResult) DecodeMsg

func (z *FlextreeQueryResult) DecodeMsg(dc *msgp.Reader) (err error)

DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.

func (*FlextreeQueryResult) EncodeMsg

func (z *FlextreeQueryResult) EncodeMsg(en *msgp.Writer) (err error)

EncodeMsg implements msgp.Encodable

func (*FlextreeQueryResult) Gstring

func (z *FlextreeQueryResult) Gstring() (r string)

func (*FlextreeQueryResult) MarshalMsg

func (z *FlextreeQueryResult) MarshalMsg(b []byte) (o []byte, err error)

MarshalMsg implements msgp.Marshaler

func (*FlextreeQueryResult) Msgsize

func (z *FlextreeQueryResult) Msgsize() (s int)

Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message

func (*FlextreeQueryResult) UnmarshalMsg

func (z *FlextreeQueryResult) UnmarshalMsg(bts []byte) (o []byte, err error)

UnmarshalMsg implements msgp.Unmarshaler

func (*FlextreeQueryResult) UnmarshalMsgWithCfg

func (z *FlextreeQueryResult) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)

type HLC

type HLC int64

HLC is a hybrid logical/physical clock, see the literature references in the README for citations.

Goroutine safety: all HLC routines are goroutine safe except for the argument m to ReceiveMessageWithHLC(m) by default, since m is typically from a network message that is exclusive owned. If m could be shared, read its HLC with Aload() first before passing it to ReceiveMessageWithHLC().

Warning about 32-bit CPU systems: this package uses atomic.LoadInt64() which is safe and fine on 64-bit systems but which can be buggy on 32-bit systems if the compiler happens to not 64-bit align your data in memory.

So: if you need this package on a 32-bit system, ensure your HLCs are always the very first field(s) in your struct, so they are 64-bit aligned.

See https://pkg.go.dev/sync/atomic#pkg-note-BUG where it discusses 32-bit systems (i.e. ARM not ARM64):

"On ARM, 386, and 32-bit MIPS, it is the caller's responsibility to arrange for 64-bit alignment of 64-bit words accessed atomically via the primitive atomic functions (types atomic.Int64 and atomic.Uint64 are automatically aligned). The first word in an allocated struct, array, or slice; in a global variable; or in a local variable (because on 32-bit architectures, the subject of 64-bit atomic operations will escape to the heap) can be relied upon to be 64-bit aligned."

func AssembleHLC

func AssembleHLC(lc int64, count int64) HLC

AssembleHLC does the simple addition, but takes care of the type conversation too. For safety, it masks off the low 16 bits of lc that should always be 0 anyway before doing the addition.

func PhysicalTime48

func PhysicalTime48() HLC

PhysicalTime48 rounds up to the 16th bit the UnixNano() of the current time, as requested by the Hybrid-Logical-Clock algorithm. The low order 16 bits are used for a logical counter rather than nanoseconds. The low 16 bits are always zero on return from this function.

func (*HLC) Aload

func (hlc *HLC) Aload() (r HLC)

Aload does an atomic load of hlc and returns it. Callers of ReceiveMessageWithHLC, for instance, should use Aload() to read their HLC atomically (for the argument m) before calling ReceiveMessageWithHLC(m), if the possibility of data races exists (if more than one goroutine can see m).

func (*HLC) Count

func (hlc *HLC) Count() int64

Count atomically reads the hlc and returns the lower 16 bits. The naming reflects the terminology in the original paper.

func (*HLC) CreateAndNow

func (hlc *HLC) CreateAndNow() (r HLC, now time.Time)

CreateAndNow is the same as CreateSendOrLocalEvent but also returns the raw time.Time before hlc conversion of the low 16 bits.

func (*HLC) CreateSendOrLocalEvent

func (hlc *HLC) CreateSendOrLocalEvent() (r HLC)

CreateSendOrLocalEvent updates the local hybrid clock j based on PhysicalTime48. POST: r == *hlc

func (*HLC) LC

func (hlc *HLC) LC() int64

LC atomically reads the hlc and returns the upper 48 bits. The naming reflects the terminology in the original paper.

func (*HLC) ReceiveMessageWithHLC

func (hlc *HLC) ReceiveMessageWithHLC(m HLC) (r HLC)

ReceiveMessageWithHLC updates the local hybrid clock hlc based on the received message m's hybrid clock.

PRE: to avoid data races, m should be owned exclusively by the calling goroutine, or if shared, m should be the result of an atomic load with m := sourceHLC.Aload().

POST: r == *hlc

func (*HLC) String

func (hlc *HLC) String() string

func (*HLC) ToTime

func (hlc *HLC) ToTime() time.Time

ToTime returns the LC and Count as the nanoseconds since Unix epoch using the time.Unix() call directly.

func (*HLC) ToTime48

func (hlc *HLC) ToTime48() (r time.Time)

ToTime48 returns only the LC in the upper 48 bits of hlc; the lower 16 bits of r.UnixNano() will be all 0. See ToTime to include the Count as well.

type HLCInterval

type HLCInterval struct {
	Begin HLC // first HLC assigned
	Endx  HLC // exclusive upper bound (one past last)
}

HLCInterval represents a half-open interval [Begin, Endx) of HLC timestamps assigned during a Batch.Commit.

type IntegrityError

type IntegrityError struct {
	Check  string // which check failed
	Detail string // human-readable details
	Fatal  bool   // if true, subsequent checks may be unreliable
}

IntegrityError describes a single integrity violation.

func (IntegrityError) Error

func (e IntegrityError) Error() string

type InternalChild

type InternalChild struct {
	NodeID NodeID `zid:"0"`

	// 64-bit accumulator
	Shift int64 `zid:"1"`
}

func (*InternalChild) DecodeMsg

func (z *InternalChild) DecodeMsg(dc *msgp.Reader) (err error)

DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.

func (*InternalChild) EncodeMsg

func (z *InternalChild) EncodeMsg(en *msgp.Writer) (err error)

EncodeMsg implements msgp.Encodable

func (*InternalChild) Gstring

func (z *InternalChild) Gstring() (r string)

func (*InternalChild) MarshalMsg

func (z *InternalChild) MarshalMsg(b []byte) (o []byte, err error)

MarshalMsg implements msgp.Marshaler

func (*InternalChild) Msgsize

func (z *InternalChild) Msgsize() (s int)

Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message

func (*InternalChild) UnmarshalMsg

func (z *InternalChild) UnmarshalMsg(bts []byte) (o []byte, err error)

UnmarshalMsg implements msgp.Unmarshaler

func (*InternalChild) UnmarshalMsgWithCfg

func (z *InternalChild) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)

type InternalNode

type InternalNode struct {
	CommonNode `zid:"0"` // 20 bytes

	// given n pivots,
	Pivots [FLEXTREE_INTERNAL_CAP]uint64 `zid:"1"` // 8 bytes * 30 = 240 bytes (260 total)
	// we need n+1 children.
	Children [FLEXTREE_INTERNAL_CAP_PLUS_ONE]InternalChild `zid:"2"` // 12 bytes * 31 == 372 bytes (632 total)

	// On-disk slot IDs for CoW persistence: tracks which page slot
	// each child occupies on disk. During CoW sync, dirty nodes are
	// written to freshly allocated slots, and the parent's
	// ChildSlotIDs[idx] records the child's new slot location.
	ChildSlotIDs [FLEXTREE_INTERNAL_CAP_PLUS_ONE]int64 `zid:"3"` // 8 bytes * 31 == 248 bytes (880 total)
}

InternalNode contains keys and offsets to children. Size is fixed (arrays not slices).

func (*InternalNode) DecodeMsg

func (z *InternalNode) DecodeMsg(dc *msgp.Reader) (err error)

DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.

func (*InternalNode) EncodeMsg

func (z *InternalNode) EncodeMsg(en *msgp.Writer) (err error)

EncodeMsg implements msgp.Encodable

func (*InternalNode) Gstring

func (z *InternalNode) Gstring() (r string)

func (*InternalNode) MarshalMsg

func (z *InternalNode) MarshalMsg(b []byte) (o []byte, err error)

MarshalMsg implements msgp.Marshaler

func (*InternalNode) Msgsize

func (z *InternalNode) Msgsize() (s int)

Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message

func (*InternalNode) StringOnTree

func (ie *InternalNode) StringOnTree(tree *FlexTree) (r string)

func (*InternalNode) UnmarshalMsg

func (z *InternalNode) UnmarshalMsg(bts []byte) (o []byte, err error)

UnmarshalMsg implements msgp.Unmarshaler

func (*InternalNode) UnmarshalMsgWithCfg

func (z *InternalNode) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)

type Iter

type Iter struct {
	// contains filtered or unexported fields
}

func (*Iter) Close

func (it *Iter) Close()

Close marks the iterator as closed and releases cursor state. The transaction manages lock lifetime, not the iterator.

func (*Iter) FetchV

func (it *Iter) FetchV() ([]byte, error)

FetchV fetches the current iterator value from the VLOG if the iterator points to a large value. For inline values, this is equivalent to Vin(). Returns the value bytes and any error from the VLOG read. If the value is inline and not large, it will still be returned (and the error will be nil).

func (*Iter) GetAnySize added in v0.9.7

func (it *Iter) GetAnySize() (key string, val []byte, found bool, err error)

GetAnySize returns values large or small, if available.

func (*Iter) Hlc

func (it *Iter) Hlc() HLC

Hlc returns the HLC timestamp of the current key-value pair.

func (*Iter) KV added in v0.7.1

func (it *Iter) KV() *KV

KV returns the current *KV with any inline value resolved (lazy-copied from cache memory). It does not fetch large values from the VLOG. The returned pointer is owned by the iterator and must not be retained past the next Next()/Prev()/Seek() call. Copy Key and Value if you need them to outlive the iterator step.

func (*Iter) Key

func (it *Iter) Key() string

Key returns the current key. On the fast path this is a direct pointer into cache memory (zero-copy). Call dupBytes(it.Key) if you need to keep a copy beyond the next Next()/Prev() call.

func (*Iter) Large

func (it *Iter) Large() bool

Large returns false if the current value is stored inline (small value). When true, the value is stored in the VLOG and must be fetched via FetchV(). Iteration remains cheap until you really need to see the large value.

func (*Iter) Next

func (it *Iter) Next()

Next advances to the next key in ascending order.

func (*Iter) Prev

func (it *Iter) Prev()

Prev moves to the previous key in descending order.

func (*Iter) Seek

func (it *Iter) Seek(target string)

Seek positions the iterator at the first key >= target.

func (*Iter) SeekFirst added in v0.11.0

func (it *Iter) SeekFirst()

SeekFirst positions the iterator at the first key.

func (*Iter) SeekLast added in v0.11.0

func (it *Iter) SeekLast()

SeekLast positions the iterator at the last key.

func (*Iter) Valid

func (it *Iter) Valid() bool

Valid returns true if the iterator is positioned at a valid key.

func (*Iter) Vel

func (it *Iter) Vel() (val []byte, empty, large bool)

Vel can be more ergonomic than Vin. Vel returns the current inline Value as well as the two important flags to tell you how to interpret the zero len(val) situation: whether the value is truly empty, or just large and sitting in the VLOG waiting for a separate FetchV call.

Vel is a mnemonic for the returns: value, empty, large.

Plus it's fun to say with your best Young Frankenstein style faux German accent (Gene Wilder, Mel Brooks 1974)

"Vel, vel, vel... vhat have ve here? A large value??"

func (*Iter) Vin

func (it *Iter) Vin() (val []byte)

Vin returns the current value for _inline_ values. For large values stored in VLOG, Vin returns nil. If you get back nil, you must use Large() to check if the value is actually large, and then FetchV() to fetch it on demand if so. Or just use Vel() below to find out what you've got in one call.

On the fast path (empty memtable), the first call copies from the cache reference into valBuf. Subsequent calls return valBuf directly.

type KV

type KV struct {
	Key   string
	Value []byte
	Vptr  VPtr // Vptr.Length==1 means tombstone; Length>1: VLOG pointer; Length==0: inline/nil
	Hlc   HLC  // hybrid logical clock timestamp. LSN like per mini batch, but has big gaps.
}

KV is a key-value pair. Tombstones are marked by Vptr.Length == tombstoneVPtrLength == 1 A nil Value with Vptr.Length == 0 is a live key with nil value (just a key that is present but has no value; this is fine).

When Vptr.Length > tombstoneVPtrLength (== 1), the value is stored in the VLOG file and Vptr contains the location. Use kv.HasVPtr() to test this.

KV is currently 64 bytes, a cache line on most systems. Be very wary of making it any bigger, as this could really slow things down.

The Key is just a string. There is no loss of generality over []byte, just advantages: a) being immutable, we can avoid slow copies; and being smaller (2 words, not 3) than a []byte our KV now fits in one 64B cache line. Moreover users cannot corrupt the Key, so it is safe to return from our internal caches. The technical reason memory-mapped systems must use []byte keys is that they are read straight from a read-only memory map. Since we do not use a memory-mapped design we can get compile time safety, cache speed, and the benefits of immutability by making Key a string.

func (*KV) HasVPtr

func (kv *KV) HasVPtr() bool

HasVPtr returns true if the value is stored in the VLOG file. Real VLOG entries always have Length > tombstoneVPtrLength.

func (*KV) Large added in v0.7.0

func (kv *KV) Large() bool

Large returns true if this KV's value is stored in the VLOG (too large for inline storage). Use db.FetchLarge(kv) to retrieve the value bytes.

func (*KV) String

func (z *KV) String() (r string)

type KVcloser added in v0.9.25

type KVcloser struct {
	KV
	// contains filtered or unexported fields
}

KVcloser is the result of a Find or GetKV call. The user must call Close() on the KVcloser when done copying any value out, or else memory and resource leaks will ensue.

func (*KVcloser) Close added in v0.9.25

func (s *KVcloser) Close()

Close must be called when done with the non-nil *KVcloser result of a GetKV or Find call. Otherwise memory and resource leaks will ensue. Close() is a no-op if called on a nil *KVcloser.

func (*KVcloser) Fetch added in v0.9.25

func (s *KVcloser) Fetch() error

Fetch retrieves the large value from the VLOG if this KV has a VPtr (kvc.Large() == true). For inline values, Fetch is a no-op. After Fetch returns nil error, kvc.Value holds the bytes. Fetch is only needed when LAZY_LARGE was used in the Find call.

Much more detail: after Find() returns, kvc.Value is always populated for inline (small) values, regardless of whether LAZY_SMALL was used. Only SKIP_VALUES will result in kvc.Values always being nil (even for non-nil Values in the db; SKIP_VALUES means we do not retrieve them). What "lazy" means in each case:

LAZY_LARGE: Value is not fetched from VLOG. kvc.Value == nil, kvc.Large() == true. You must call Fetch() to get bytes.

LAZY_SMALL: Value is present in kvc.Value, but it's a zero-copy alias into cache memory instead of an owned copy. The "lazy" here means "lazy about copying", not "lazy about providing the value".

So after Find() with LAZY_SMALL:

- Inline values: kvc.Value points into cache memory (or a copy if it fell back). Ready to use. - Large values (no LAZY_LARGE): auto-fetched, kvc.Value populated. Ready to use. - Large values (with LAZY_LARGE): kvc.Value == nil, need Fetch().

Requiring Fetch() only applies to the VLOG/large-value case. It doesn't need a LAZY_SMALL path because the small value is already there, it is just borrowed rather than copied.

The only obligation LAZY_SMALL imposes is that you must call Close() to release the cache pin (so the entry can be evicted).

Currently without LAZY_SMALL, Close() is a no-op (the value is an owned copy); but we reserve the right to alter this, and so require that users properly use Close() if kvc != nil.

type LeafNode

type LeafNode struct {
	CommonNode `zid:"0"` //

	Extents [FLEXTREE_LEAF_CAP]FlexTreeExtent `zid:"1"`

	// leaf linked list
	Prev NodeID `zid:"2"`
	Next NodeID `zid:"3"`
}

LeafNode contains keys and values. Size is fixed.

func (*LeafNode) DecodeMsg

func (z *LeafNode) DecodeMsg(dc *msgp.Reader) (err error)

DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.

func (*LeafNode) EncodeMsg

func (z *LeafNode) EncodeMsg(en *msgp.Writer) (err error)

EncodeMsg implements msgp.Encodable

func (*LeafNode) Gstring

func (z *LeafNode) Gstring() (r string)

func (*LeafNode) MarshalMsg

func (z *LeafNode) MarshalMsg(b []byte) (o []byte, err error)

MarshalMsg implements msgp.Marshaler

func (*LeafNode) Msgsize

func (z *LeafNode) Msgsize() (s int)

Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message

func (*LeafNode) String

func (le *LeafNode) String() (r string)

func (*LeafNode) UnmarshalMsg

func (z *LeafNode) UnmarshalMsg(bts []byte) (o []byte, err error)

UnmarshalMsg implements msgp.Unmarshaler

func (*LeafNode) UnmarshalMsgWithCfg

func (z *LeafNode) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)

type Metrics

type Metrics struct {
	Session                   bool
	LiveKeyCount              int64
	KV128BytesWritten         int64 // FlexSpace FLEXSPACE.KV.SLOT_BLOCKS file
	MemWALBytesWritten        int64 // FlexDB WAL (FLEXDB.MEMWAL)
	REDOLogBytesWritten       int64 // FLEXSPACE.REDO.LOG
	FlexTreePagesBytesWritten int64 // CoW FLEXTREE.PAGES + FLEXTREE.COMMIT
	VLOGBytesWritten          int64 // VLOG value log
	LogicalBytesWritten       int64 // user payload (key + value)
	TotalBytesWritten         int64 // sum of all physical writes

	// WriteAmp returns the write amplification factor (total physical / logical).
	// Returns 0 if no logical bytes have been written.
	WriteAmp float64 //  TotalBytesWritten / LogicalBytesWritten

	CumulativeWriteAmp float64 // totalPhysicalBytesWrit / totalLogicalBytesWrit

	// Garbage metrics computed from FlexSpace block usage tracking.
	// TotalFreeBytesInBlocks is the sum of dead (unused) bytes across all
	// non-empty blocks. A block with 1000 live bytes out of 4 MB has
	// 4_193_304 garbage bytes. Completely empty blocks are not counted
	// (they are already free for reuse).
	TotalFreeBytesInBlocks int64

	// BlocksInUse shows how many 4MB blocks FLEXSPACE.KV.SLOT_BLOCKS is using.
	BlocksInUse int64

	// BlocksWithLowUtilization is the count of non-empty blocks whose
	// utilization (live bytes / block size) is below the configured
	// LowBlockUtilizationPct threshold (default 25%).
	BlocksWithLowUtilization int64

	// KVBlocksTotalLiveBytes is the sum of live (used) bytes across all blocks,
	// as tracked by the block manager.
	KVBlocksTotalLiveBytes int64

	// KVBlocksOnDiskFootprintBytes is the on-disk footprint of bytes for FLEXSPACE.KV.SLOT_BLOCKS.
	KVBlocksOnDiskFootprintBytes int64

	// VlogOnDiskFootprintBytes is the on-disk footprints for LARGE.VLOG.
	VlogOnDiskFootprintBytes int64

	// LowBlockUtilizationPct we used; copied from Config
	// or what default we used if not set.
	LowBlockUtilizationPct float64

	// PiggybackGCRuns is the number of piggyback GC runs during this session.
	PiggybackGCRuns int64

	// PiggybackGCLastDurMs is the duration of the last piggyback GC run in milliseconds.
	PiggybackGCLastDurMs int64
	// contains filtered or unexported fields
}

Metrics holds byte-level write counters for computing write amplification.

func (*Metrics) String

func (z *Metrics) String() (r string)

type NodeID

type NodeID int32

NodeID acts as our union pointer.

A FlexTree LeafNode will have a NodeID > 0. A FlexTree InternalNode will have a NodeID < 0.

A zero NodeID is reserved as IllegalID so we can terminate our linked list of NodeIDs properly. It is naturally neither positive nor negative, so neither Leaf nor Internal.

const IllegalID NodeID = 0

IllegalID is used instead of nil pointers to represent an uninitialized NodeID.

func NewInternalID

func NewInternalID(index int32) NodeID

func NewLeafID

func NewLeafID(index int32) NodeID

func (*NodeID) DecodeMsg

func (z *NodeID) DecodeMsg(dc *msgp.Reader) (err error)

DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.

func (NodeID) EncodeMsg

func (z NodeID) EncodeMsg(en *msgp.Writer) (err error)

EncodeMsg implements msgp.Encodable

func (NodeID) Index

func (id NodeID) Index() int32

func (NodeID) IsIllegal

func (id NodeID) IsIllegal() bool

func (NodeID) IsInternal

func (id NodeID) IsInternal() bool

func (NodeID) IsLeaf

func (id NodeID) IsLeaf() bool

func (NodeID) MarshalMsg

func (z NodeID) MarshalMsg(b []byte) (o []byte, err error)

MarshalMsg implements msgp.Marshaler

func (NodeID) Msgsize

func (z NodeID) Msgsize() (s int)

Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message

func (*NodeID) UnmarshalMsg

func (z *NodeID) UnmarshalMsg(bts []byte) (o []byte, err error)

UnmarshalMsg implements msgp.Unmarshaler

func (*NodeID) UnmarshalMsgWithCfg

func (z *NodeID) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)

type Offlen

type Offlen struct {
	Poff uint64 `zid:"0"`
	Len  uint64 `zid:"1"`
}

func (*Offlen) DecodeMsg

func (z *Offlen) DecodeMsg(dc *msgp.Reader) (err error)

DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.

func (*Offlen) EncodeMsg

func (z *Offlen) EncodeMsg(en *msgp.Writer) (err error)

EncodeMsg implements msgp.Encodable

func (*Offlen) Gstring

func (z *Offlen) Gstring() (r string)

func (*Offlen) MarshalMsg

func (z *Offlen) MarshalMsg(b []byte) (o []byte, err error)

MarshalMsg implements msgp.Marshaler

func (*Offlen) Msgsize

func (z *Offlen) Msgsize() (s int)

Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message

func (*Offlen) UnmarshalMsg

func (z *Offlen) UnmarshalMsg(bts []byte) (o []byte, err error)

UnmarshalMsg implements msgp.Unmarshaler

func (*Offlen) UnmarshalMsgWithCfg

func (z *Offlen) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)

type PiggybackGCStats added in v0.6.2

type PiggybackGCStats struct {
	LastGCTime     time.Time
	LastGCDuration time.Duration
	TotalGCRuns    int64
}

PiggybackGCStats tracks statistics for piggyback GC runs.

type Pos

type Pos struct {
	Loff uint64 `zid:"0"` // logical offset of the position
	Idx  uint32 `zid:"1"` // index of the entry
	Diff uint32 `zid:"2"` // the current position within the entry
	// contains filtered or unexported fields
}

low level position on leaf nodes

func (*Pos) Backward

func (p *Pos) Backward(step uint64)

Backward moves the position backward by step bytes.

func (*Pos) DecodeMsg

func (z *Pos) DecodeMsg(dc *msgp.Reader) (err error)

DecodeMsg implements msgp.Decodable We treat empty fields as if we read a Nil from the wire.

func (*Pos) EncodeMsg

func (z *Pos) EncodeMsg(en *msgp.Writer) (err error)

EncodeMsg implements msgp.Encodable

func (*Pos) Forward

func (p *Pos) Forward(step uint64)

Forward advances the position by step bytes.

func (*Pos) ForwardExtent

func (p *Pos) ForwardExtent()

ForwardExtent advances the position to the end of the current extent.

func (*Pos) GetLoff

func (p *Pos) GetLoff() uint64

GetLoff returns the current logical offset.

func (*Pos) GetPoff

func (p *Pos) GetPoff() uint64

GetPoff returns the physical offset at the current position.

func (*Pos) GetTag

func (p *Pos) GetTag() (uint16, bool)

GetTag returns the tag of the current extent. Returns (0, false) if not at the start of an extent.

func (*Pos) Gstring

func (z *Pos) Gstring() (r string)

func (*Pos) MarshalMsg

func (z *Pos) MarshalMsg(b []byte) (o []byte, err error)

MarshalMsg implements msgp.Marshaler

func (*Pos) Msgsize

func (z *Pos) Msgsize() (s int)

Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message

func (*Pos) Rewind

func (p *Pos) Rewind()

Rewind moves the position to the start of the current leaf node. Bug fix: C code used i < fp->idx-1 which missed the last extent when idx==1.

func (*Pos) UnmarshalMsg

func (z *Pos) UnmarshalMsg(bts []byte) (o []byte, err error)

UnmarshalMsg implements msgp.Unmarshaler

func (*Pos) UnmarshalMsgWithCfg

func (z *Pos) UnmarshalMsgWithCfg(bts []byte, cfg *msgp.RuntimeConfig) (o []byte, err error)

func (*Pos) Valid

func (p *Pos) Valid() bool

Valid returns true if the Pos points to a valid extent.

type ReadOnlyDB added in v0.7.0

type ReadOnlyDB interface {
	Get(key string) ([]byte, bool, error)
	GetKV(key string) (kv *KVcloser, err error)
	Find(smod SearchModifier, key string) (kvc *KVcloser, exact bool, err error)
	FindIt(smod SearchModifier, key string) (kvc *KVcloser, exact bool, err error, it *Iter)
	FetchLarge(kv *KV) ([]byte, error)
	NewIter() *Iter
	Ascend(pivot string, iter func(key string, value []byte) bool)
	Descend(pivot string, iter func(key string, value []byte) bool)
	AscendRange(greaterOrEqual, lessThan string, iter func(key string, value []byte) bool)
	DescendRange(lessOrEqual, greaterThan string, iter func(key string, value []byte) bool)
	Len() int64
	LenBigSmall() (big, small int64)
}

ReadOnlyDB provides read-only access to the database within a View transaction. Methods must not be used after the callback returns. ReadOnlyDB is implemented by ReadOnlyTx.

To avoid indirect call overhead, this interface is not actually used in the View API. It is nice for documention purposes though; to get an overview of the available methods.

type ReadOnlyTx added in v0.9.0

type ReadOnlyTx struct {
	// contains filtered or unexported fields
}

ReadOnlyTx provides read-only access to the database within a View transaction. Methods must not be used after the callback returns.

func (*ReadOnlyTx) Ascend added in v0.9.0

func (roTx *ReadOnlyTx) Ascend(pivot string, iter func(key string, value []byte) bool)

Ascend iterates keys >= pivot in ascending order until iter returns false. Use pivot="" to start from the first key.

func (*ReadOnlyTx) AscendRange added in v0.9.0

func (roTx *ReadOnlyTx) AscendRange(greaterOrEqual, lessThan string, iter func(key string, value []byte) bool)

AscendRange iterates keys in [greaterOrEqual, lessThan) in ascending order. Use "" for either bound to leave it open.

func (*ReadOnlyTx) Close added in v0.9.8

func (rtx *ReadOnlyTx) Close()

func (*ReadOnlyTx) Descend added in v0.9.0

func (roTx *ReadOnlyTx) Descend(pivot string, iter func(key string, value []byte) bool)

Descend iterates keys <= pivot in descending order until iter returns false. Use pivot="" to start from the last key.

func (*ReadOnlyTx) DescendRange added in v0.9.0

func (roTx *ReadOnlyTx) DescendRange(lessOrEqual, greaterThan string, iter func(key string, value []byte) bool)

DescendRange iterates keys in (greaterThan, lessOrEqual] in descending order. Use "" for either bound to leave it open.

func (*ReadOnlyTx) FetchLarge added in v0.9.0

func (roTx *ReadOnlyTx) FetchLarge(kv *KV) ([]byte, error)

FetchLarge retrieves the full value for a KV. For VLOG-stored values it reads from disk; for inline values it returns kv.Value directly.

func (*ReadOnlyTx) Find added in v0.9.0

func (roTx *ReadOnlyTx) Find(smod SearchModifier, key string) (kvc *KVcloser, exact bool, err error)

Find seeks to a key relative to the given key per smod (Exact, GTE, GT, LTE, LT) and returns a KVcloser. nil KVcloser means not found. exact is true when the returned key equals the query. Supports LAZY_SMALL, LAZY_LARGE, and LAZY flags.

func (*ReadOnlyTx) FindIt added in v0.9.0

func (roTx *ReadOnlyTx) FindIt(smod SearchModifier, key string) (kvc *KVcloser, exact bool, err error, it *Iter)

FindIt is like Find but also returns an iterator positioned at the result. The returned KVcloser is independent of the iterator - the user must call kvc.Close() when done with the initial result, and use the iterator for continued scanning. The iterator is auto-closed when the transaction ends.

func (*ReadOnlyTx) Get added in v0.9.0

func (roTx *ReadOnlyTx) Get(key string) (value []byte, found bool, err error)

Get retrieves the value for key. Returns (nil, false, nil) if not found or deleted. The returned []byte is a copy, safe to retain.

func (*ReadOnlyTx) GetKV added in v0.9.1

func (roTx *ReadOnlyTx) GetKV(key string) (kv *KVcloser, err error)

GetKV is equivalent to roTx.Find(Exact, key).

func (*ReadOnlyTx) Len added in v0.9.0

func (roTx *ReadOnlyTx) Len() int64

Len returns the total number of live keys in the database. See also LenBigSmall to get the count partitioned by the size class (small inline, large in the VLOG).

func (*ReadOnlyTx) LenBigSmall added in v0.9.0

func (roTx *ReadOnlyTx) LenBigSmall() (big, small int64)

LenBigSmall returns live key counts partitioned by size class. The returned count big counts the keys in VLOG, while small gives the number of keys with inline values. See also Len to get the total directly.

func (*ReadOnlyTx) NewIter added in v0.9.0

func (roTx *ReadOnlyTx) NewIter() *Iter

NewIter returns a new iterator over the database. It is automatically closed when the transaction ends. It is legal to Close it sooner if you want to release resources early.

type SearchModifier added in v0.6.2

type SearchModifier int

SearchModifier controls the matching behavior of Find.

const (
	// Exact matches only; like a hash table.
	Exact SearchModifier = 0
	// GTE finds the smallest key greater-than-or-equal to the query.
	GTE SearchModifier = 1
	// LTE finds the largest key less-than-or-equal to the query.
	LTE SearchModifier = 2
	// GT finds the smallest key strictly greater-than the query.
	GT SearchModifier = 3
	// LT finds the largest key strictly less-than the query.
	LT SearchModifier = 4

	// SKIP_VALUES returns KV.Values = nil; we make
	// no effort to retieve values, only keys. This is
	// useful for very fast full-table scans of just the keys,
	// when the user knows they will not inspect values
	// at all. In contrast, LAZY keeps open the option to
	// look at the values, but will pay some time in overhead.
	SKIP_VALUES SearchModifier = 16

	// LAZY_SMALL requests zero-copy return of inline values.
	// The returned KVcloser.Value aliases interval cache memory.
	// The caller MUST call Close() to release the cache pin.
	// If the result came from a memtable (not yet flushed),
	// then we must do a copy; Value is copied as usual
	// to avoid returning stale/non-linearizable data
	// (best-effort zero-copy).
	// LAZY_SMALL can be | or-ed with Exact, GTE, GT, LTE, or LT.
	LAZY_SMALL SearchModifier = 32

	// LAZY_LARGE means we do not fetch LARGE.VLOG values
	// automatically. The User must call FetchLarge() explicitly
	// when they are desired.
	// LAZY_LARGE can be | or-ed with Exact, GTE, GT, LTE, or LT.
	LAZY_LARGE SearchModifier = 64

	// LAZY means do both LAZY_SMALL and LAZY_LARGE
	LAZY SearchModifier = 96
)

type VPtr

type VPtr struct {
	Offset uint64 // byte offset in VLOG file
	Length uint64 // value length in bytes
}

VPtr is a pointer to a value stored in the VLOG file.

type VacuumKVStats

type VacuumKVStats struct {
	OldFileSize      int64
	NewFileSize      int64
	BytesReclaimed   int64
	PaddingReclaimed int64
	ExtentsRewritten int64
}

VacuumKVStats reports the results of a VacuumKV operation.

func (*VacuumKVStats) String

func (z *VacuumKVStats) String() (r string)

type VacuumVLOGStats added in v0.7.0

type VacuumVLOGStats struct {
	OldVLOGSize        int64
	NewVLOGSize        int64
	BytesReclaimed     int64
	EntriesCopied      int64
	IntervalsRewritten int64
}

VacuumVLOGStats reports the results of a VacuumVLOG operation.

func (*VacuumVLOGStats) String added in v0.7.0

func (z *VacuumVLOGStats) String() (r string)

type WritableDB added in v0.7.0

type WritableDB interface {
	ReadOnlyDB
	Put(key string, value []byte) error
	Delete(key string) error
	DeleteRange(includeLarge bool, begKey, endKey string, begInclusive, endInclusive bool) (n int64, allGone bool, err error)
	Clear(includeLarge bool) (allGone bool, err error)
	Merge(key string, fn func(oldVal []byte, exists bool) (newVal []byte, write bool, doDelete bool)) error
	Sync() error
}

WritableDB extends ReadOnlyDB with mutation methods. Passed to Update transaction callbacks. Writes are applied immediately to the database (no buffering). Since there is only ever a single writer at a time, and no concurrent readers, there is no point in waiting to apply each action, and "reading your own writes" is often desired/expected/required. WritableDB is implemented by WriteTx.

To avoid indirect call overhead, this interface is not actually used in the Update API. It is nice for documention purposes though; to get an overview of the available methods.

type WriteTx added in v0.9.0

type WriteTx struct {
	// contains filtered or unexported fields
}

WriteTx extends ReadTx with mutation methods. Passed to Update transaction callbacks. Writes are applied immediately to the database (no buffering). Since there is only ever a single writer at a time, and no concurrent readers, there is no point in waiting to apply each action, and "reading your own writes" is often desired/expected/required.

func (*WriteTx) Ascend added in v0.9.0

func (tx *WriteTx) Ascend(pivot string, iter func(key string, value []byte) bool)

Ascend iterates keys >= pivot in ascending order until iter returns false. Use pivot="" to start from the first key.

func (*WriteTx) AscendRange added in v0.9.0

func (tx *WriteTx) AscendRange(greaterOrEqual, lessThan string, iter func(key string, value []byte) bool)

AscendRange iterates keys in [greaterOrEqual, lessThan) in ascending order. Use "" for either bound to leave it open.

func (*WriteTx) Clear added in v0.9.0

func (tx *WriteTx) Clear(includeLarge bool) (allGone bool, err error)

Clear deletes all keys. When includeLarge is false, only inline-value keys are deleted; VLOG-stored keys survive. When allGone is true, the database was reinitialized and all previously obtained iterators and KV references are invalidated.

func (*WriteTx) Close added in v0.9.8

func (wtx *WriteTx) Close()

func (*WriteTx) Delete added in v0.9.0

func (tx *WriteTx) Delete(key string) error

Delete removes key from the store.

func (*WriteTx) DeleteRange added in v0.9.0

func (tx *WriteTx) DeleteRange(includeLarge bool, begKey, endKey string, begInclusive, endInclusive bool) (n int64, allGone bool, err error)

DeleteRange deletes keys in the range [begKey, endKey] with configurable inclusivity. When includeLarge is false, VLOG-stored keys are skipped. If allGone is true, the entire database was reinitialized and all previously obtained iterators and KV references are invalidated.

func (*WriteTx) Descend added in v0.9.0

func (tx *WriteTx) Descend(pivot string, iter func(key string, value []byte) bool)

Descend iterates keys <= pivot in descending order until iter returns false. Use pivot="" to start from the last key.

func (*WriteTx) DescendRange added in v0.9.0

func (tx *WriteTx) DescendRange(lessOrEqual, greaterThan string, iter func(key string, value []byte) bool)

DescendRange iterates keys in (greaterThan, lessOrEqual] in descending order. Use "" for either bound to leave it open.

func (*WriteTx) FetchLarge added in v0.9.0

func (tx *WriteTx) FetchLarge(kv *KV) ([]byte, error)

FetchLarge retrieves the full value for a KV. For VLOG-stored values it reads from disk; for inline values it returns kv.Value directly.

func (*WriteTx) Find added in v0.9.0

func (tx *WriteTx) Find(smod SearchModifier, key string) (kvc *KVcloser, exact bool, err error)

Find seeks to a key relative to the given key per smod (Exact, GTE, GT, LTE, LT) and returns a KVcloser. Getting back a nil *KVcloser means not found. The returned bool 'exact' is true when the returned key equals the query. The smod supports bitwise OR-ing with the LAZY_SMALL, LAZY_LARGE, and LAZY flags to control value flow. e.g. LAZY allows near-zero-copy full-table key-only scans.

Warning: the user must call Close() on the kvc *KVcloser when done copying any value out, or else memory and resource leaks will ensue.

func (*WriteTx) FindIt added in v0.9.0

func (tx *WriteTx) FindIt(smod SearchModifier, key string) (kvc *KVcloser, exact bool, err error, it *Iter)

FindIt is like Find but also returns an iterator positioned at the result. The returned kvc *KVcloser is independent of the iterator - the user must call kvc.Close() when done with the initial result, and use the iterator for continued scanning. The iterator is auto-closed when the transaction ends.

func (*WriteTx) Get added in v0.9.0

func (tx *WriteTx) Get(key string) (value []byte, found bool, err error)

Get retrieves the value for key. Returns (nil, false, nil) if not found or deleted. The returned []byte is a copy, safe to retain.

func (*WriteTx) GetKV added in v0.9.1

func (tx *WriteTx) GetKV(key string) (kv *KVcloser, err error)

GetKV is equivalent to tx.Find(Exact, key).

func (*WriteTx) Len added in v0.9.0

func (tx *WriteTx) Len() int64

Len returns the total number of live keys in the database.

func (*WriteTx) LenBigSmall added in v0.9.0

func (tx *WriteTx) LenBigSmall() (big, small int64)

LenBigSmall returns live key counts partitioned by storage (VLOG vs inline).

func (*WriteTx) Merge added in v0.9.0

func (tx *WriteTx) Merge(key string, fn func(oldVal []byte, exists bool) (newVal []byte, write bool, doDelete bool)) error

Merge performs an atomic read-modify-write on key. fn is always called: when the key exists, oldVal is its current value and exists=true; when the key is absent or deleted, oldVal=nil and exists=false (allowing conditional creation). Return write=false to skip the write, or doDelete=true to delete the key.

This mirrors C's flexdb_merge: it looks up the old value across all layers (active memtable, inactive memtable, FlexSpace), applies the user function, and writes the result atomically.

Q: When should the callback return doWrite=false, doDelete=false? A: This means "do nothing." The main scenarios:

  1. Conditional creation: The callback inspects exists and decides not to create the key. E.g., "only increment if the key already exists" - if exists=false, return a bare return (all zeros = no-op).

  2. Conditional update: The callback inspects the old value and decides no change is needed. E.g., "set to X only if current value isn't already X."

  3. Read-only peek: The callback just wants to see the current value (though Get is simpler for that).

.

func (*WriteTx) NewIter added in v0.9.0

func (tx *WriteTx) NewIter() *Iter

NewIter returns a new iterator over the database. It is automatically closed when the transaction ends. It is legal to Close it sooner if you want to release resources early.

func (*WriteTx) Put added in v0.9.0

func (tx *WriteTx) Put(key string, value []byte) error

Put writes key -> value. len(value) == 0 is fine, if desired. Call Delete instead of Put to delete a key and any associated value.

Values of any size are accepted. Values > vlogInlineThreshold (64 bytes) are stored in the VLOG file; smaller values are stored inline in the FLEXSPACE.KV.SLOT_BLOCKS file with the keys.

Large values are written exactly once: to the VLOG. The WAL stores only the VPtr (16 bytes), not the full value.

Puts are not durably on disk until after the user has also completed a db.Sync() call. This allows the user to control the rate of fsyncs and trade that against their durability requirements.

func (*WriteTx) Sync added in v0.9.0

func (tx *WriteTx) Sync() error

Sync flushes all in-memory data to disk and fsyncs.

Directories

Path Synopsis
cmd
ydiff command
yload command
yogabench command
cmd/yogabench/main.go — Go benchmark harness for YogaDB, equivalent to the C FlexDB benchmarks in study/eurosys22-artifact/flexdb_benchmark/.
cmd/yogabench/main.go — Go benchmark harness for YogaDB, equivalent to the C FlexDB benchmarks in study/eurosys22-artifact/flexdb_benchmark/.
yvac command
yview command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL