orchid_sync

package module

v1.0.4 Latest Latest Go to latest Published: May 31, 2026 License: MIT Imports: 17 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/0TrustCloud/orchid_sync

Links

Open Source Insights

README ¶

OrchidSync

OrchidSync is a distributed, secure search engine built with Go, designed for high-performance information retrieval using the Okapi BM25 ranking algorithm and an embedded B+ Tree database.

Overview

The engine manages local document indexing and provides a mechanism for distributed search across a mesh network. Key features include:

Secure Networking: Utilizes a peer-to-peer network layer for communication.
Persistent Storage: Built for atomic, transaction-aware data persistence.
BM25 Ranking: Implements the Okapi BM25 algorithm, featuring term frequency saturation and length normalization.
NLP Pipeline: Built-in analyzer for tokenization, case normalization, and stop-word filtering.

Architecture

The engine operates by intercepting documents, tokenizing them, and storing them in an inverted index where each term maps to a list of postings.

Core Components

Engine: The top-level wrapper that manages the database connection, the network node, and the scoring logic.
Indexer: Bridges the NLP analyzer and the storage layer to perform atomic updates on the inverted index.
Search: Processes queries, fetches posting lists, calculates document scores using BM25, and returns a ranked list of hits.
ScatterGather: Enables distributed search by broadcasting queries to the peer mesh and merging results.

Usage

Initializing the Engine

engine, err := NewEngine("/path/to/db", 9999)
if err != nil {
    log.Fatal(err)
}

Indexing a Document

err := engine.Index("doc-001", "The quick brown fox jumps over the lazy dog.")

Performing a Search

results, err := engine.Search("quick fox", 10)

Testing

The package includes a suite of tests to ensure the integrity of the NLP pipeline, the BM25 mathematical implementation, and the engine's initialization routines. Run the tests using the standard Go toolchain:

go test -v ./...

Documentation ¶

Index ¶

Constants
func CompareTrees(local *MerkleNode, remote *MerkleNode, divergent *[]ultimate_db.PageID)
func ComputePageHashes(ctx context.Context, db *ultimate_db.DB, pageIDs []ultimate_db.PageID, ...) (map[ultimate_db.PageID]MerkleHash, error)
func CountLeaves(root *MerkleNode) int
func DiffTrees(local *MerkleTree, remote *MerkleTree) []ultimate_db.PageID
func TreeDepth(root *MerkleNode) int
func ValidateTree(root *MerkleNode) bool
type Analyzer
- func NewAnalyzer() *Analyzer
- func (a *Analyzer) Tokenize(text string) []string
- func (a *Analyzer) TokenizeSDF(script string, profileType string, target string) []string
type BM25Scorer
- func NewBM25Scorer() *BM25Scorer
- func (s *BM25Scorer) Score(tf float64, docLen float64, avgDocLen float64, totalDocs int, docFreq int) float64
type ClusterQuery
type Engine
- func NewEngine(db *ultimate_db.DB, node *secure_network.MeshNode, ...) (*Engine, error)
- func NewEngineWithNode(db *ultimate_db.DB, sdEngine *secure_data_format.SecureDataEngine, ...) (*Engine, error)
- func (e *Engine) Index(docID string, text string) error
- func (e *Engine) IndexSecureData(docID string, script string, profileType string, target string) error
- func (e *Engine) NetNode() *secure_network.MeshNode
- func (e *Engine) ScatterGather(ctx context.Context, query string, limit int) ([]SearchResult, error)
- func (e *Engine) Search(query string, limit int) ([]SearchResult, error)
type EngineState
type Indexer
- func NewIndexer(db *ultimate_db.DB, analyzer *Analyzer) *Indexer
- func (idx *Indexer) AddDocument(docID string, text string) (map[string]int, []string)
- func (idx *Indexer) AddSecureDocument(docID string, script string, profileType string, target string) (map[string]int, []string)
type MerkleHash
- func ComputePageHash(ctx context.Context, db *ultimate_db.DB, id ultimate_db.PageID) (MerkleHash, error)
- func ZeroHash() MerkleHash
- func (m MerkleHash) String() string
type MerkleNode
- func FlattenTree(root *MerkleNode) []*MerkleNode
- func (m *MerkleNode) IsLeaf() bool
type MerkleSyncRequest
type MerkleSyncResponse
type MerkleTree
- func BuildTree(ctx context.Context, db *ultimate_db.DB, pageIDs []ultimate_db.PageID) (*MerkleTree, error)
- func (t *MerkleTree) RootHash() MerkleHash
type Posting
type RoutingEntry
type SDFProfile
type SearchResult
type Shard

Constants ¶

View Source

const (
	DefaultVirtualNodes = 64
	MaxShardReplicas    = 3
)

View Source

const IndexPageID ultimate_db.PageID = 10

IndexPageID is strictly reserved for inverted index postings to avoid collisions

View Source

const MetadataPageID ultimate_db.PageID = 11

Variables ¶

This section is empty.

Functions ¶

func CompareTrees ¶

func CompareTrees(
	local *MerkleNode,
	remote *MerkleNode,
	divergent *[]ultimate_db.PageID,
)

func ComputePageHashes ¶

func ComputePageHashes(
	ctx context.Context,
	db *ultimate_db.DB,
	pageIDs []ultimate_db.PageID,
	workers int,
) (map[ultimate_db.PageID]MerkleHash, error)

func CountLeaves ¶

func CountLeaves(root *MerkleNode) int

func DiffTrees ¶

func DiffTrees(
	local *MerkleTree,
	remote *MerkleTree,
) []ultimate_db.PageID

func TreeDepth ¶

func TreeDepth(root *MerkleNode) int

func ValidateTree ¶

func ValidateTree(root *MerkleNode) bool

Types ¶

type Analyzer ¶

type Analyzer struct {
	// contains filtered or unexported fields
}

Analyzer processes raw text into indexable tokens. It is completely stateless and safe for concurrent use across multiple indexing goroutines.

func NewAnalyzer ¶

func NewAnalyzer() *Analyzer

NewAnalyzer initializes the analyzer with default stop words.

func (*Analyzer) Tokenize ¶

func (a *Analyzer) Tokenize(text string) []string

Tokenize splits strings, normalizes them, and filters stop words.

func (*Analyzer) TokenizeSDF ¶

func (a *Analyzer) TokenizeSDF(script string, profileType string, target string) []string

TokenizeSDF extracts standard full-text tokens from an SDF script body while appending namespaced metadata facets for profile types and targets.

type BM25Scorer ¶

type BM25Scorer struct {
	// contains filtered or unexported fields
}

BM25Scorer implements the Okapi BM25 ranking function. It is cluster-safe, deterministic, and optimized to score both unstructured text terms and namespaced SDF (Secure Data Format) structural facets.

func NewBM25Scorer ¶

func NewBM25Scorer() *BM25Scorer

NewBM25Scorer creates a scorer using Lucene-compatible defaults.

func (*BM25Scorer) Score ¶

func (s *BM25Scorer) Score(
	tf float64,
	docLen float64,
	avgDocLen float64,
	totalDocs int,
	docFreq int,
) float64

Score calculates the relevance score for a given token or structural facet.

Parameters:

tf         -> term frequency inside document
docLen     -> total token count for document (including structural metadata)
avgDocLen  -> average token count across corpus
totalDocs  -> total indexed documents
docFreq    -> number of docs containing the term/facet

BM25 Formula:

IDF * ((tf * (k1 + 1)) / (tf + k1 * (1 - b + b * (docLen / avgDocLen))))

type ClusterQuery ¶

type ClusterQuery struct {
	QueryID   string `json:"query_id"`
	QueryText string `json:"query_text"`
	Limit     int    `json:"limit"`
}

type Engine ¶

type Engine struct {
	TotalDocs int
	AvgDocLen float64
	// contains filtered or unexported fields
}

func NewEngine ¶

func NewEngine(
	db *ultimate_db.DB,
	node *secure_network.MeshNode,
	sysLog *logger.LogDispatcher,
) (*Engine, error)

func NewEngineWithNode ¶

func NewEngineWithNode(
	db *ultimate_db.DB,
	sdEngine *secure_data_format.SecureDataEngine,
	signerKey []byte,
	sysLog *logger.LogDispatcher,
) (*Engine, error)

func (*Engine) Index ¶

func (e *Engine) Index(docID string, text string) error

func (*Engine) IndexSecureData ¶

func (e *Engine) IndexSecureData(docID string, script string, profileType string, target string) error

func (*Engine) NetNode ¶

func (e *Engine) NetNode() *secure_network.MeshNode

func (*Engine) ScatterGather ¶

func (e *Engine) ScatterGather(ctx context.Context, query string, limit int) ([]SearchResult, error)

func (*Engine) Search ¶

func (e *Engine) Search(query string, limit int) ([]SearchResult, error)

type EngineState ¶

type EngineState struct {
	TotalDocs int     `json:"total_docs"`
	AvgDocLen float64 `json:"avg_doc_len"`
}

type Indexer ¶

type Indexer struct {
	// contains filtered or unexported fields
}

Indexer bridges the NLP analyzer pipeline and the ultimate_db storage layer.

func NewIndexer ¶

func NewIndexer(db *ultimate_db.DB, analyzer *Analyzer) *Indexer

NewIndexer initializes the pipeline worker

func (*Indexer) AddDocument ¶

func (idx *Indexer) AddDocument(docID string, text string) (map[string]int, []string)

AddDocument tokenizes raw text, calculates term frequencies, and returns organized maps to prevent write amplification inside lower storage layers.

func (*Indexer) AddSecureDocument ¶

func (idx *Indexer) AddSecureDocument(docID string, script string, profileType string, target string) (map[string]int, []string)

AddSecureDocument tokenizes an SDF script alongside its metadata facets and returns computed term maps.

type MerkleHash ¶

type MerkleHash [32]byte

func ComputePageHash ¶

func ComputePageHash(
	ctx context.Context,
	db *ultimate_db.DB,
	id ultimate_db.PageID,
) (MerkleHash, error)

func ZeroHash ¶

func ZeroHash() MerkleHash

func (MerkleHash) String ¶

func (m MerkleHash) String() string

type MerkleNode ¶

type MerkleNode struct {
	Hash      MerkleHash
	Left      *MerkleNode
	Right     *MerkleNode
	Parent    *MerkleNode
	PageID    ultimate_db.PageID
	Leaf      bool
	Timestamp int64
}

func FlattenTree ¶

func FlattenTree(root *MerkleNode) []*MerkleNode

func (*MerkleNode) IsLeaf ¶

func (m *MerkleNode) IsLeaf() bool

type MerkleSyncRequest ¶

type MerkleSyncRequest struct {
	NodeID    string        `json:"node_id"`
	RootHash  string        `json:"root_hash"`
	PageIDs   []uint64      `json:"page_ids"`
	Requested time.Time     `json:"requested"`
	Timeout   time.Duration `json:"timeout"`
}

type MerkleSyncResponse ¶

type MerkleSyncResponse struct {
	NodeID         string   `json:"node_id"`
	RemoteRootHash string   `json:"remote_root_hash"`
	DivergentPages []uint64 `json:"divergent_pages"`
	Synced         bool     `json:"synced"`
	Error          string   `json:"error,omitempty"`
}

type MerkleTree ¶

type MerkleTree struct {
	Root      *MerkleNode
	PageCount int
	CreatedAt time.Time
}

func BuildTree ¶

func BuildTree(
	ctx context.Context,
	db *ultimate_db.DB,
	pageIDs []ultimate_db.PageID,
) (*MerkleTree, error)

func (*MerkleTree) RootHash ¶

func (t *MerkleTree) RootHash() MerkleHash

type Posting ¶

type Posting struct {
	DocID string  `json:"doc_id"`
	TF    float64 `json:"tf"`
}

Posting represents a single document's relationship to a specific term. This holds the exact metrics needed for the BM25 scorer.

type RoutingEntry ¶

type RoutingEntry struct {
	ID       string
	Address  string
	ShardIDs []uint64
	Healthy  bool
	Load     int64
}

RoutingEntry represents a peer that owns shards within the FabricStack cluster.

type SDFProfile ¶

type SDFProfile string

SDFProfile defines the supported structural classifications for the Secure Data Format.

const (
	ProfileStructuredLog SDFProfile = "LOG"
	ProfileGrant         SDFProfile = "GRANT"
	ProfileProofOfPoss   SDFProfile = "POP"
)

type SearchResult ¶

type SearchResult struct {
	DocID string  `json:"doc_id"`
	Score float64 `json:"score"`
}

type Shard ¶

type Shard struct {
	ID       uint64
	Owner    string
	Replicas []string

	DocCount uint64
}

Shard represents a logical index partition distributed across the mesh network.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL