minhashlsh

package module
v0.0.0-...-f9372a2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 6, 2026 License: MIT Imports: 8 Imported by: 0

README

Minhash LSH in Golang

GoDoc

This is a fork of ekzhu/minhash-lsh with generics, two backend implementations, and significant performance optimizations.

Install: go get github.com/stillmatic/minhash-lsh

Two backends

MinhashLSH — batch mode (sorted slices)

Best when you add all entries first, then query many times. Requires an explicit Index() call after adding entries.

  • Insert: O(1) append per band
  • Index: O(n log n) sort per band (called once)
  • Query: O(log n) binary search per band, zero allocations
MinhashLSHMap — online/streaming mode (hash maps)

Best for interleaved add/query workloads where new entries must be immediately searchable. No Index() call needed.

  • Insert: O(1) map insert per band
  • Query: O(1) map lookup per band, zero allocations

At 100k entries, the map backend is 5,000x faster than re-indexing sorted slices for interleaved add+query workloads.

Example

Deduplicating text using the map backend:

package minhashlsh_test

import (
	"fmt"
	minhashlsh "github.com/stillmatic/minhash-lsh"
)

type newsItem struct {
	URL         string
	Description string
}

func ExampleMinhashLSHMap() {
	newsItems := []newsItem{
		{URL: "https://example.com/1", Description: "This is a test"},
		{URL: "https://example.com/2", Description: "This is another test"},
		{URL: "https://example.com/3", Description: "This is a test"},
	}

	// key on the URL, so instantiate with `string` generic
	lsh := minhashlsh.NewMinhashLSHMapWithSize[string](88, 0.7, len(newsItems))
	for _, item := range newsItems {
		mh := minhashlsh.NewMinhashWithDefaults()
		mh.Push([]byte(item.Description))
		lsh.Add(item.URL, mh.Signature())
	}

	// no need to build index with map backend

	// find duplicate entries
	dupeKeys := make(map[string]struct{})
	for _, item := range newsItems {
		if _, ok := dupeKeys[item.URL]; ok {
			continue
		}
		mh := minhashlsh.NewMinhashWithDefaults()
		mh.Push([]byte(item.Description))
		queryRes := lsh.Query(mh.Signature())
		if len(queryRes) == 0 {
			continue
		}

		for _, res := range queryRes {
			if res != item.URL {
				dupeKeys[res] = struct{}{}
			}
		}
	}
	// should be 1 duplicate to remove
	fmt.Println(dupeKeys)
}

Performance

See docs/performance-improvements.md for full benchmark tables. Highlights vs the original implementation:

Metric Before After
Insert allocs (100k) 1,599,931 199,917 (8x fewer)
Query allocs 30 0
Query latency (100k) 1,375 ns 556 ns (1.9x)
Interleaved add+query (100k) 37,400,000 ns 6,600 ns (5,700x)

Documentation

Overview

Package minhashlsh implements Locality Sensitive Hashing using MinHash signatures

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Minhash

type Minhash struct {
	// contains filtered or unexported fields
}

Minhash represents a MinHash object

func NewMinhash

func NewMinhash(seed int64, numHash int) *Minhash

NewMinhash initialize a MinHash object with a seed and the number of hash functions.

func NewMinhashWithDefaults

func NewMinhashWithDefaults() *Minhash

NewMinhashWithDefaults initializes a MinHash object with default seed and number of hashes. Matches Python implementation: https://github.com/ekzhu/datasketch/blob/5512549871d29c55b3cd8e99a79fcbd14859b77d/datasketch/minhash.py#L69C9-L69C17

func (*Minhash) Merge

func (m *Minhash) Merge(o *Minhash)

Merge combines the signature of the other Minhash with this one, making this one carry the signature of the union.

func (*Minhash) Push

func (m *Minhash) Push(b []byte)

Push a new value to the MinHash object. The value should be serialized to byte slice.

func (*Minhash) Signature

func (m *Minhash) Signature() []uint64

Signature exports the MinHash as a list of hash values.

type MinhashLSH

type MinhashLSH[T comparable] struct {
	// contains filtered or unexported fields
}

MinhashLSH represents a MinHash LSH implemented using LSH Forest (http://ilpubs.stanford.edu:8090/678/1/2005-14.pdf). It supports query-time setting of the MinHash LSH parameters L (number of bands) and K (number of hash functions per band).

func NewMinhashLSH

func NewMinhashLSH[T comparable](numHash int, threshold float64, initSize int) *MinhashLSH[T]

NewMinhashLSH is the default constructor uses 32 bit hash value with pre-allocation of hash tables.

func NewMinhashLSH16

func NewMinhashLSH16[T comparable](numHash int, threshold float64, initSize int) *MinhashLSH[T]

NewMinhashLSH16 uses 16-bit hash values and pre-allocation of hash tables. MinHash signatures with 64 or 32 bit hash values will have their hash values trimmed.

func NewMinhashLSH32

func NewMinhashLSH32[T comparable](numHash int, threshold float64, initSize int) *MinhashLSH[T]

NewMinhashLSH32 uses 32-bit hash values and pre-allocation of hash tables. MinHash signatures with 64 bit hash values will have their hash values trimmed.

func NewMinhashLSH64

func NewMinhashLSH64[T comparable](numHash int, threshold float64, initSize int) *MinhashLSH[T]

NewMinhashLSH64 uses 64-bit hash values and pre-allocation of hash tables.

func NewMinhashLSHWithDefaults

func NewMinhashLSHWithDefaults[T comparable](initSize int) *MinhashLSH[T]

NewMinhashLSHWithDefaults is the default constructor with default parameters

func (*MinhashLSH[T]) Add

func (f *MinhashLSH[T]) Add(key T, sig []uint64)

Add a key with MinHash signature into the index. The key won't be searchable until Index() is called.

func (*MinhashLSH[T]) Index

func (f *MinhashLSH[T]) Index()

Index makes all the keys added searchable.

func (*MinhashLSH[T]) Params

func (f *MinhashLSH[T]) Params() (k, l int)

Params returns the LSH parameters k and l

func (*MinhashLSH[T]) Query

func (f *MinhashLSH[T]) Query(sig []uint64) []T

Query returns candidate keys given the query signature.

type MinhashLSHMap

type MinhashLSHMap[T comparable] struct {
	// contains filtered or unexported fields
}

MinhashLSHMap is a map-backed Minhash LSH optimized for interleaved Add/Query workloads. Entries are immediately searchable after Add() with O(1) insert and O(1) lookup per band — no sorting or explicit Index() call required.

Example
package main

import (
	"fmt"
	minhashlsh "github.com/stillmatic/minhash-lsh"
)

type newsItem struct {
	URL         string
	Description string
}

func main() {
	newsItems := []newsItem{
		{URL: "https://example.com/1", Description: "This is a test"},
		{URL: "https://example.com/2", Description: "This is another test"},
		{URL: "https://example.com/3", Description: "This is a test"},
	}

	// key on the URL, so instantiate with `string` generic
	lsh := minhashlsh.NewMinhashLSHMapWithSize[string](88, 0.7, len(newsItems))
	for _, item := range newsItems {
		mh := minhashlsh.NewMinhashWithDefaults()
		mh.Push([]byte(item.Description))
		lsh.Add(item.URL, mh.Signature())
	}

	// no need to build index with map backend

	// find duplicate entries
	dupeKeys := make(map[string]struct{})
	for _, item := range newsItems {
		if _, ok := dupeKeys[item.URL]; ok {
			//already a duplicate
			continue
		}
		mh := minhashlsh.NewMinhashWithDefaults()
		mh.Push([]byte(item.Description))
		queryRes := lsh.Query(mh.Signature())
		if len(queryRes) == 0 {
			continue
		}

		for _, res := range queryRes {
			if res != item.URL {
				dupeKeys[res] = struct{}{}
			}
		}
	}
	// should be 1 duplicate to remove
	fmt.Println(dupeKeys)
}

func NewMinhashLSHMap

func NewMinhashLSHMap[T comparable](numHash int, threshold float64) *MinhashLSHMap[T]

func NewMinhashLSHMapWithSize

func NewMinhashLSHMapWithSize[T comparable](numHash int, threshold float64, initSize int) *MinhashLSHMap[T]

func (*MinhashLSHMap[T]) Add

func (f *MinhashLSHMap[T]) Add(key T, sig []uint64)

func (*MinhashLSHMap[T]) Query

func (f *MinhashLSHMap[T]) Query(sig []uint64) []T

Query returns candidate keys given the query signature.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL