orchid_sync

package module
v1.0.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 31, 2026 License: MIT Imports: 17 Imported by: 0

README

OrchidSync

OrchidSync is a distributed, secure search engine built with Go, designed for high-performance information retrieval using the Okapi BM25 ranking algorithm and an embedded B+ Tree database.

Overview

The engine manages local document indexing and provides a mechanism for distributed search across a mesh network. Key features include:

  • Secure Networking: Utilizes a peer-to-peer network layer for communication.
  • Persistent Storage: Built for atomic, transaction-aware data persistence.
  • BM25 Ranking: Implements the Okapi BM25 algorithm, featuring term frequency saturation and length normalization.
  • NLP Pipeline: Built-in analyzer for tokenization, case normalization, and stop-word filtering.

Architecture

The engine operates by intercepting documents, tokenizing them, and storing them in an inverted index where each term maps to a list of postings.

Core Components

  • Engine: The top-level wrapper that manages the database connection, the network node, and the scoring logic.
  • Indexer: Bridges the NLP analyzer and the storage layer to perform atomic updates on the inverted index.
  • Search: Processes queries, fetches posting lists, calculates document scores using BM25, and returns a ranked list of hits.
  • ScatterGather: Enables distributed search by broadcasting queries to the peer mesh and merging results.

Usage

Initializing the Engine
engine, err := NewEngine("/path/to/db", 9999)
if err != nil {
    log.Fatal(err)
}

Indexing a Document
err := engine.Index("doc-001", "The quick brown fox jumps over the lazy dog.")

results, err := engine.Search("quick fox", 10)

Testing

The package includes a suite of tests to ensure the integrity of the NLP pipeline, the BM25 mathematical implementation, and the engine's initialization routines. Run the tests using the standard Go toolchain:

go test -v ./...

Documentation

Index

Constants

View Source
const (
	DefaultVirtualNodes = 64
	MaxShardReplicas    = 3
)
View Source
const IndexPageID ultimate_db.PageID = 10

IndexPageID is strictly reserved for inverted index postings to avoid collisions

View Source
const MetadataPageID ultimate_db.PageID = 11

Variables

This section is empty.

Functions

func CompareTrees

func CompareTrees(
	local *MerkleNode,
	remote *MerkleNode,
	divergent *[]ultimate_db.PageID,
)

func ComputePageHashes

func ComputePageHashes(
	ctx context.Context,
	db *ultimate_db.DB,
	pageIDs []ultimate_db.PageID,
	workers int,
) (map[ultimate_db.PageID]MerkleHash, error)

func CountLeaves

func CountLeaves(root *MerkleNode) int

func DiffTrees

func DiffTrees(
	local *MerkleTree,
	remote *MerkleTree,
) []ultimate_db.PageID

func TreeDepth

func TreeDepth(root *MerkleNode) int

func ValidateTree

func ValidateTree(root *MerkleNode) bool

Types

type Analyzer

type Analyzer struct {
	// contains filtered or unexported fields
}

Analyzer processes raw text into indexable tokens. It is completely stateless and safe for concurrent use across multiple indexing goroutines.

func NewAnalyzer

func NewAnalyzer() *Analyzer

NewAnalyzer initializes the analyzer with default stop words.

func (*Analyzer) Tokenize

func (a *Analyzer) Tokenize(text string) []string

Tokenize splits strings, normalizes them, and filters stop words.

func (*Analyzer) TokenizeSDF

func (a *Analyzer) TokenizeSDF(script string, profileType string, target string) []string

TokenizeSDF extracts standard full-text tokens from an SDF script body while appending namespaced metadata facets for profile types and targets.

type BM25Scorer

type BM25Scorer struct {
	// contains filtered or unexported fields
}

BM25Scorer implements the Okapi BM25 ranking function. It is cluster-safe, deterministic, and optimized to score both unstructured text terms and namespaced SDF (Secure Data Format) structural facets.

func NewBM25Scorer

func NewBM25Scorer() *BM25Scorer

NewBM25Scorer creates a scorer using Lucene-compatible defaults.

func (*BM25Scorer) Score

func (s *BM25Scorer) Score(
	tf float64,
	docLen float64,
	avgDocLen float64,
	totalDocs int,
	docFreq int,
) float64

Score calculates the relevance score for a given token or structural facet.

Parameters:

tf         -> term frequency inside document
docLen     -> total token count for document (including structural metadata)
avgDocLen  -> average token count across corpus
totalDocs  -> total indexed documents
docFreq    -> number of docs containing the term/facet

BM25 Formula:

IDF * ((tf * (k1 + 1)) / (tf + k1 * (1 - b + b * (docLen / avgDocLen))))

type ClusterQuery

type ClusterQuery struct {
	QueryID   string `json:"query_id"`
	QueryText string `json:"query_text"`
	Limit     int    `json:"limit"`
}

type Engine

type Engine struct {
	TotalDocs int
	AvgDocLen float64
	// contains filtered or unexported fields
}

func NewEngine

func NewEngine(
	db *ultimate_db.DB,
	node *secure_network.MeshNode,
	sysLog *logger.LogDispatcher,
) (*Engine, error)

func NewEngineWithNode

func NewEngineWithNode(
	db *ultimate_db.DB,
	sdEngine *secure_data_format.SecureDataEngine,
	signerKey []byte,
	sysLog *logger.LogDispatcher,
) (*Engine, error)

func (*Engine) Index

func (e *Engine) Index(docID string, text string) error

func (*Engine) IndexSecureData

func (e *Engine) IndexSecureData(docID string, script string, profileType string, target string) error

func (*Engine) NetNode

func (e *Engine) NetNode() *secure_network.MeshNode

func (*Engine) ScatterGather

func (e *Engine) ScatterGather(ctx context.Context, query string, limit int) ([]SearchResult, error)

func (*Engine) Search

func (e *Engine) Search(query string, limit int) ([]SearchResult, error)

type EngineState

type EngineState struct {
	TotalDocs int     `json:"total_docs"`
	AvgDocLen float64 `json:"avg_doc_len"`
}

type Indexer

type Indexer struct {
	// contains filtered or unexported fields
}

Indexer bridges the NLP analyzer pipeline and the ultimate_db storage layer.

func NewIndexer

func NewIndexer(db *ultimate_db.DB, analyzer *Analyzer) *Indexer

NewIndexer initializes the pipeline worker

func (*Indexer) AddDocument

func (idx *Indexer) AddDocument(docID string, text string) (map[string]int, []string)

AddDocument tokenizes raw text, calculates term frequencies, and returns organized maps to prevent write amplification inside lower storage layers.

func (*Indexer) AddSecureDocument

func (idx *Indexer) AddSecureDocument(docID string, script string, profileType string, target string) (map[string]int, []string)

AddSecureDocument tokenizes an SDF script alongside its metadata facets and returns computed term maps.

type MerkleHash

type MerkleHash [32]byte

func ComputePageHash

func ComputePageHash(
	ctx context.Context,
	db *ultimate_db.DB,
	id ultimate_db.PageID,
) (MerkleHash, error)

func ZeroHash

func ZeroHash() MerkleHash

func (MerkleHash) String

func (m MerkleHash) String() string

type MerkleNode

type MerkleNode struct {
	Hash      MerkleHash
	Left      *MerkleNode
	Right     *MerkleNode
	Parent    *MerkleNode
	PageID    ultimate_db.PageID
	Leaf      bool
	Timestamp int64
}

func FlattenTree

func FlattenTree(root *MerkleNode) []*MerkleNode

func (*MerkleNode) IsLeaf

func (m *MerkleNode) IsLeaf() bool

type MerkleSyncRequest

type MerkleSyncRequest struct {
	NodeID    string        `json:"node_id"`
	RootHash  string        `json:"root_hash"`
	PageIDs   []uint64      `json:"page_ids"`
	Requested time.Time     `json:"requested"`
	Timeout   time.Duration `json:"timeout"`
}

type MerkleSyncResponse

type MerkleSyncResponse struct {
	NodeID         string   `json:"node_id"`
	RemoteRootHash string   `json:"remote_root_hash"`
	DivergentPages []uint64 `json:"divergent_pages"`
	Synced         bool     `json:"synced"`
	Error          string   `json:"error,omitempty"`
}

type MerkleTree

type MerkleTree struct {
	Root      *MerkleNode
	PageCount int
	CreatedAt time.Time
}

func BuildTree

func BuildTree(
	ctx context.Context,
	db *ultimate_db.DB,
	pageIDs []ultimate_db.PageID,
) (*MerkleTree, error)

func (*MerkleTree) RootHash

func (t *MerkleTree) RootHash() MerkleHash

type Posting

type Posting struct {
	DocID string  `json:"doc_id"`
	TF    float64 `json:"tf"`
}

Posting represents a single document's relationship to a specific term. This holds the exact metrics needed for the BM25 scorer.

type RoutingEntry

type RoutingEntry struct {
	ID       string
	Address  string
	ShardIDs []uint64
	Healthy  bool
	Load     int64
}

RoutingEntry represents a peer that owns shards within the FabricStack cluster.

type SDFProfile

type SDFProfile string

SDFProfile defines the supported structural classifications for the Secure Data Format.

const (
	ProfileStructuredLog SDFProfile = "LOG"
	ProfileGrant         SDFProfile = "GRANT"
	ProfileProofOfPoss   SDFProfile = "POP"
)

type SearchResult

type SearchResult struct {
	DocID string  `json:"doc_id"`
	Score float64 `json:"score"`
}

type Shard

type Shard struct {
	ID       uint64
	Owner    string
	Replicas []string

	DocCount uint64
}

Shard represents a logical index partition distributed across the mesh network.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL