NornicDB Search Methodology¶
Complete guide to NornicDB's multi-method search architecture.
๐ฏ Overview¶
NornicDB implements a hybrid search system that combines multiple search methodologies for maximum accuracy and performance:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Search Request โ
โโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโผโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโโ
โ Vector โ โ BM25 โ โ Metadata โ
โ Search โ โ Full- โ โ Filters โ
โ (HNSW) โ โ Text โ โ โ
โโโโโโฌโโโโโ โโโโโโฌโโโโโ โโโโโโโฌโโโโโ
โ โ โ
โโโโโโโโโโโโโผโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโ
โ RRF Fusion โ
โ (Rank Aggregation) โ
โโโโโโโโโโฌโโโโโโโโโโโโโโ
โ
โโโโโโโโโโผโโโโโโโโโโโ
โ Cross-Encoder โ
โ Reranking โ
โ (Optional Stage 2)โ
โโโโโโโโโโฌโโโโโโโโโโโ
โ
โโโโโโโโโโผโโโโโโโโโโโโโ
โ Final Results โ
โ (Top-K, sorted) โ
โโโโโโโโโโโโโโโโโโโโโโโ
1๏ธโฃ Vector Similarity Search¶
What It Does¶
Finds documents by semantic meaning using vector embeddings. Query and documents are converted to high-dimensional vectors, then compared using cosine similarity.
Architecture¶
Query Text (e.g., "machine learning algorithms")
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Embedding โ Generate dense vector
โ Model โ (384-1536 dimensions)
โ (OpenAI, BGE) โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโ
โ Query Vector โ
โ (0.12, -0.45, 0.23) โ
โโโโโโโโโโฌโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ HNSW Index โ
โ (Fast nearest neighbors) โ
โโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โผ โผ
Document A (similarity 0.92) Document B (similarity 0.87)
Implementation: HNSW Index¶
HNSW (Hierarchical Navigable Small-World) provides O(log N) approximate nearest neighbor search.
Layer 3 (top): [โ]โโโโโโโโโโโโโโโโโโโโโโโโ[โ]
โ โ
Layer 2: [โ]โ[โ]โโโโโโโโโโโโโโ[โ]โ[โ]
โ โ โ โ โ โ
Layer 1: [โ]โ[โ]โ[โ]โ[โ]โโโ[โ]โ[โ]โ[โ]โ[โ]
โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ
Layer 0: [โ]-[โ]-[โ]-[โ]-[โ]-[โ]-[โ]-[โ]-[โ]-[โ] โ All documents
โ = Document node (with embedding vector)
โ = Connection (learned during index build)
Search Process:
- Entry Point: Start at random node in top layer
- Layer Traversal: Greedily move closer to query vector in each layer
- Candidate Pool: At bottom layer, find K nearest neighbors
Parameters:
| Parameter | Default | Impact |
|---|---|---|
M | 16 | Max connections per layer (larger = more thorough, slower build) |
efConstruction | 200 | Candidate pool size during build (larger = better index, slower build) |
efSearch | 100 | Candidate pool size during search (larger = more accurate, slower search) |
Performance:
Vector similarity search is accelerated by NornicDB's SIMD layer, which uses AVX2+FMA assembly for float32 operations via the viterin/vek library. Real-world benchmarks on Intel i9-9900KF (AVX2+FMA):
Operation Vector Size Pure Go AVX2 SIMD Speedup
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Dot Product 384-dim 196.2ns 19.81ns 9.9x
Dot Product 1536-dim 750.2ns 61.41ns 12.2x
Dot Product 3072-dim 1457ns 123.3ns 11.8x
Cosine Similarity 384-dim 103.0ns 32.17ns 3.2x
Cosine Similarity 1536-dim 381.8ns 101.2ns 3.8x
Cosine Similarity 3072-dim 748.0ns 190.3ns 3.9x
Euclidean Distance 384-dim 187.6ns 26.12ns 7.2x
Euclidean Distance 1536-dim 730.6ns 64.92ns 11.3x
Euclidean Distance 3072-dim 1460ns 117.5ns 12.4x
Norm (L2 Magnitude) 384-dim 98.07ns 15.76ns 6.2x
Norm (L2 Magnitude) 1536-dim 391.6ns 48.12ns 8.1x
Norm (L2 Magnitude) 3072-dim 737.7ns 93.47ns 7.9x
Memory Bandwidth: Large vectors ~16 GB/s ~110-130 GB/s 7-8x
Summary: Core operations average 6-12x faster with vek32 SIMD. Cosine similarity (3-4x) is lower due to normalization overhead, but still substantial.
SIMD Acceleration Details:
- Technology: AVX2+FMA assembly via viterin/vek
- CPU Support: Intel Haswell+ (2013+), AMD Zen+ (2017+)
- Fallback: Pure Go on unsupported platforms (automatic)
- Memory Bandwidth: 100-130 GB/s sustained on AVX2 (vs. 16 GB/s pure Go)
- Compilation: vek32 functions compiled with -ffast-math for maximum speed
Check your SIMD capabilities:
info := simd.Info()
if info.Accelerated {
fmt.Printf("Using %s SIMD (%v)\n", info.Implementation, info.Features)
}
Advantages & Limitations¶
| Aspect | โ Strength | โ Limitation |
|---|---|---|
| Semantic Match | Captures meaning, not just keywords | May miss exact keyword matches |
| Speed | O(log N) with HNSW + SIMD acceleration | Requires pre-computed embeddings |
| Scalability | Handles millions of vectors with SIMD boost | Memory footprint grows with dimensions |
| Accuracy | Great for similarity ranking | Can return "close but wrong" results |
2๏ธโฃ Full-Text Search (BM25)¶
What It Does¶
Finds documents by exact keyword matching. Uses BM25 scoring algorithm (same as Elasticsearch, Solr).
How BM25 Works¶
BM25 scores documents based on term importance and frequency:
BM25(D, Q) = ฮฃ IDF(qi) ร (f(qi, D) ร (k1 + 1)) / (f(qi, D) + k1 ร (1 - b + b ร |D|/avgdl))
Where:
IDF(q) = log((N - df(q) + 0.5) / (df(q) + 0.5))
f(q, D) = term frequency in document D
|D| = length of document D
avgdl = average document length
k1 = 1.2 (saturation factor)
b = 0.75 (length normalization)
Example Calculation¶
Query: "machine learning"
Document A: "We teach machine learning and AI"
- Term freq: machine=1, learning=1
- Length: 8 words
- Normalized score: 2.34
Document B: "Machine learning algorithms, machine learning frameworks, machine learning tools"
- Term freq: machine=3, learning=3
- Length: 11 words
- Normalized score: 8.92 โ Higher score (more relevant)
Document C: "What is the meaning of 'machine'?"
- Term freq: machine=1, learning=0
- Length: 6 words
- Normalized score: 0.89 โ Lower (missing 'learning')
Inverted Index Structure¶
Term Documents with frequency
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
"machine" โ {docA: 1, docB: 3, docC: 1}
"learning" โ {docA: 1, docB: 3}
"algorithm" โ {docB: 2, docD: 1}
"framework" โ {docB: 1, docE: 4}
Search Process:
- Tokenize query:
"machine learning"โ["machine", "learning"] - Look up inverted index for each term
- Score documents containing the terms using BM25 formula
- Return top-K sorted by score
Advantages & Limitations¶
| Aspect | โ Strength | โ Limitation |
|---|---|---|
| Keyword Matching | Exact term matching guaranteed | Misses semantic variations |
| Speed | Very fast (inverted index O(log N)) | Must build index upfront |
| Predictability | Reproducible, interpretable | Can't understand context |
| Maturity | Proven, used in Lucene/Elasticsearch | Over-matches on stop words |
Search index persistence¶
By default, both the vector (HNSW) and BM25 indexes are built in memory on startup. For large graphs, you can persist search indexes to disk so they are saved after updates and loaded on the next startup, avoiding a full storage scan. Enable with NORNICDB_PERSIST_SEARCH_INDEXES=true or database.persist_search_indexes: true in YAML (requires data_dir to be set). See Search index persistence in the Configuration Guide.
Experimental: Search index persistence is currently experimental. If startup must rebuild IVF-HNSW (for example due to missing/incompatible persisted artifacts), large datasets can still take a long time to become ready. Observed reference point: ~30 minutes to rebuild IVF-HNSW for ~1M embeddings on startup (hardware dependent).
3๏ธโฃ Reciprocal Rank Fusion (RRF)¶
What It Does¶
Combines vector search + BM25 results into a single ranking using rank positions (not scores).
Why RRF Instead of Score Blending?¶
Scores from different algorithms are incomparable:
Vector Search Score BM25 Score
โโโโโโโโโโโโโโโโโ โโโโโโโโโโ
Range: 0-1 Range: 0-โ
Higher = better Higher = better
Based on direction Based on term frequency
โ ๏ธ Can't simply add them: (0.92) + (15.3) = meaningless!
RRF Formula¶
RRF_score = ฮฃ (weight / (k + rank))
Where:
weight = importance of this ranking (default 1.0)
k = constant (default 60) to reduce high-rank dominance
rank = position in result list (1-indexed)
Example: Fusing Two Rankings¶
Query: "python data science"
Vector Search Results: BM25 Full-Text Results:
โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ
Rank 1: Doc A (sim 0.95) Rank 1: Doc C (score 18.5)
Rank 2: Doc B (sim 0.88) Rank 2: Doc A (score 16.2)
Rank 3: Doc D (sim 0.82) Rank 3: Doc E (score 14.1)
Rank 4: Doc C (sim 0.79) Rank 4: Doc B (score 12.3)
โ Apply RRF โ
Document A:
Vector rank=1: 1.0/(60+1) = 0.0164
BM25 rank=2: 1.0/(60+2) = 0.0159
RRF = 0.0164 + 0.0159 = 0.0323 โ High (appears in both, ranked well)
Document B:
Vector rank=2: 1.0/(60+2) = 0.0159
BM25 rank=4: 1.0/(60+4) = 0.0152
RRF = 0.0159 + 0.0152 = 0.0311
Document C:
Vector rank=4: 1.0/(60+4) = 0.0152
BM25 rank=1: 1.0/(60+1) = 0.0164
RRF = 0.0152 + 0.0164 = 0.0316
Document D:
Vector rank=3: 1.0/(60+3) = 0.0157
BM25 rank=none: (missing)
RRF = 0.0157 โ Lower (only in vector results)
โ Final Ranking โ
Rank 1: Document A (RRF 0.0323) โ Best of both worlds
Rank 2: Document C (RRF 0.0316)
Rank 3: Document B (RRF 0.0311)
Rank 4: Document D (RRF 0.0157)
Why This Works¶
Agreement Score
Document Vector BM25 Combined
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
A (appears in both) 1 1 HIGH
B (appears in both) 1 1 HIGH
C (disagreement) -1 1 MEDIUM
D (vector only) 1 - LOW
Documents that both algorithms agree on get boosted. Disagreements are treated reasonably.
When RRF Helps Most¶
Query: "enterprise database"
Scenario 1: Vector & BM25 Agree
Vector: [Product A, Product B, Product C]
BM25: [Product A, Product B, Product C]
RRF: No change (both agree)
Scenario 2: Vector & BM25 Disagree
Vector: [NewsArticle (semantic fit), ProductX, ProductY]
BM25: [ProductA (keyword match), ProductB, ProductC]
RRF: Smart blend - products that match both move up
NewsArticle still ranked (one vote) but lower
4๏ธโฃ Cross-Encoder Reranking (Stage 2)¶
What It Does¶
Re-scores and re-ranks top-K candidates from RRF using a more expensive but accurate cross-encoder model. Reranking can use a local GGUF (e.g. BGE-Reranker-v2-m3, loaded like the embedding model) or an external API (Cohere, HuggingFace TEI, Ollama adapter). Configure via NORNICDB_SEARCH_RERANK_ENABLED, NORNICDB_SEARCH_RERANK_PROVIDER, and related env vars or YAML; see Cross-Encoder Reranking and Configuration.
Bi-Encoder vs Cross-Encoder¶
Bi-Encoder (Stage 1: Fast)
Query: "What is machine learning?"
โ
โโโโโโโโโโโโโโโโ
โ Encode Query โ โ [0.1, -0.5, 0.3, ...] (384-dim)
โโโโโโโโโโโโโโโโ
Document: "Machine learning is..."
โ
โโโโโโโโโโโโโโโโ
โ Encode Doc โ โ [0.2, -0.4, 0.35, ...] (384-dim)
โโโโโโโโโโโโโโโโ
Compare: cosine_similarity(query_vec, doc_vec) = 0.94
Cross-Encoder (Stage 2: Accurate)
[Query] "What is machine learning?"
[Document] "Machine learning is..."
โ
โโโโโโโโโโโโโโโโโโโโ
โ Cross-Encoder โ Sees both together!
โ Model โ Captures query-doc interaction
โโโโโโโโโโฌโโโโโโโโโโ
โผ
Relevance score: 0.97 (more accurate)
Processing Pipeline¶
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ RRF Results (100 candidates) โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Cross-Encoder Batch โ
โ (Re-score all 100) โ
โ Time: ~1-2 seconds โ
โโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Re-ranked Results โ
โ (Top-10, resorted) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Supported Models¶
| Model | Provider | Speed | Accuracy | Cost |
|---|---|---|---|---|
| ms-marco-MiniLM-L-6-v2 | HuggingFace | Fast | Good | Free |
| e5-large-v2-reranker | EmbedRank | Slow | Excellent | Free |
| rerank-3.5 | Cohere | Moderate | Excellent | $$ |
| rankgpt-3.5-turbo | OpenAI | Slow | Perfect | $$$$ |
Decision Logic¶
IF reranking enabled:
IF num_candidates > 50:
Run cross-encoder on top-100
ELSE IF num_candidates > 10:
Run cross-encoder on all
ELSE:
Skip (too few to matter)
Advantages¶
โ
Significantly improves relevance
โ
Context-aware (sees query + document together)
โ
Can correct RRF mistakes
โ
Works with any model via API
Trade-offs¶
โฑ๏ธ Slower: ~1-2s for 100 documents
๐ฐ Cost: External API calls
๐ Dependency: Requires external service
๐ Marginal gain: ~5-15% improvement for most cases
5๏ธโฃ SIMD Vector Acceleration¶
What It Does¶
Accelerates vector math using CPU SIMD instructions (AVX2+FMA on x86, NEON on ARM) via the viterin/vek library, which provides true assembly implementations.
Architecture¶
NornicDB uses viterin/vek32 for high-performance vector operations:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ NornicDB Vector Operations โ
โ (DotProduct, CosineSimilarity, etc) โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโผโโโโโโโโโโโ
โ vek32 Library โ
โ SIMD Assembly โ
โโโโโโโโฌโโโโโโโโโโโ
โ
โโโโโโโโโผโโโโโโโโ
โ โ โ
โผ โผ โผ
AVX2 NEON Generic
x86-64 ARM64 Pure Go
(8x) (4x) (1x)
Why vek32?
- True AVX2+FMA assembly (not just compiler auto-vectorization)
- Optimized for float32 (standard for embeddings)
- Implementations compiled with -ffast-math for maximum speed
- Fallback to pure Go on unsupported CPUs (automatic)
- Proven 6-20x speedups on real workloads
SIMD Fundamentals¶
Scalar (Normal) Processing:
a = [1.0, 2.0, 3.0, 4.0]
b = [5.0, 6.0, 7.0, 8.0]
Dot Product (scalar approach):
result = 0
for i in 0..3:
result += a[i] * b[i]
Operations:
Multiply: 4 ร (a[i] * b[i])
Add: 4 ร (result += ...)
โโโโโโโโโโโโโโโโโโโโโโโ
Total: 8 operations
CPU Cycles: ~16 cycles (with dependencies)
SIMD (Vectorized) Processing with AVX2:
a = [1.0, 2.0, 3.0, 4.0]
b = [5.0, 6.0, 7.0, 8.0]
Dot Product (SIMD approach):
Registers hold 8 ร float32 each (256-bit AVX2):
a_vec = [1.0, 2.0, 3.0, 4.0, ...] (8 values)
b_vec = [5.0, 6.0, 7.0, 8.0, ...] (8 values)
VMULPS: a_vec ร b_vec in ONE instruction
โ [5.0, 12.0, 21.0, 32.0, ...]
VHADDPS: Horizontal sum โ single result in ONE instruction
CPU Cycles: ~3 cycles (massive parallelism!)
Speedup: 16 cycles / 3 cycles โ 5-10x faster
Measured Performance (Intel i9-9900KF, AVX2+FMA)¶
Comprehensive benchmarks across multiple vector sizes:
DOT PRODUCT (AVX2 is 8-way vectorized):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Size Pure Go SIMD Speedup MB/s (SIMD)
128 66.15ns 10.70ns 6.2x 95,742
256 129.3ns 15.27ns 8.5x 134,160
384 196.2ns 19.81ns 9.9x 155,100
512 252.8ns 23.50ns 10.8x 174,295
768 375.1ns 33.34ns 11.2x 184,277
1024 494.3ns 48.69ns 10.2x 168,242
1536 750.2ns 61.41ns 12.2x 200,091
3072 1457ns 123.3ns 11.8x 199,313
COSINE SIMILARITY (normalized dot product + magnitude division):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Size Pure Go SIMD Speedup MB/s (SIMD)
128 41.97ns 16.25ns 2.6x 63,003
256 71.94ns 24.69ns 2.9x 82,944
384 103.0ns 32.17ns 3.2x 95,494
512 136.6ns 39.66ns 3.4x 103,271
768 195.1ns 55.76ns 3.5x 110,193
1024 253.7ns 69.37ns 3.7x 118,087
1536 381.8ns 101.2ns 3.8x 121,458
3072 748.0ns 190.3ns 3.9x 129,170
EUCLIDEAN DISTANCE (sqrt of sum of squares):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Size Pure Go SIMD Speedup MB/s (SIMD)
128 66.98ns 13.54ns 4.9x 75,642
256 126.8ns 19.72ns 6.4x 103,866
384 187.6ns 26.12ns 7.2x 117,611
512 247.4ns 30.56ns 8.1x 134,015
768 367.0ns 39.58ns 9.3x 155,246
1024 489.6ns 48.06ns 10.2x 170,460
1536 730.6ns 64.92ns 11.3x 189,290
3072 1460ns 117.5ns 12.4x 209,111
NORM / L2 MAGNITUDE:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Size Pure Go SIMD Speedup MB/s (SIMD)
128 37.62ns 8.871ns 4.2x 57,718
256 68.75ns 12.14ns 5.7x 84,334
384 98.07ns 15.76ns 6.2x 97,461
512 128.5ns 20.71ns 6.2x 98,872
768 187.9ns 26.80ns 7.0x 114,627
1024 247.4ns 32.72ns 7.6x 125,172
1536 391.6ns 48.12ns 8.1x 127,688
3072 737.7ns 93.47ns 7.9x 131,458
MEMORY BANDWIDTH (sustained throughput):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Vector Size Bandwidth
8K elements 108 GB/s
16K elements 106 GB/s
32K elements 107 GB/s
64K elements 61 GB/s (L3 cache limited)
Summary:
| Operation | Speedup Range | Best Use Case |
|---|---|---|
| Dot Product | 6.2x - 12.2x | Similarity search, HNSW indexing |
| Euclidean Distance | 4.9x - 12.4x | K-means clustering, spatial queries |
| Norm (L2) | 4.2x - 8.1x | Vector normalization |
| Cosine Similarity | 2.6x - 3.9x | Text/semantic similarity |
Platform-Specific Implementations¶
x86/amd64 with AVX2 + FMA:
CPU Requirements:
- Intel: Haswell (2013+) or newer
- AMD: Zen+ (2017+) or newer
vek32 Instructions:
VMULPS - Multiply 8 ร float32 in parallel
VFMADD - Fused multiply-add (a*b + c)
VHADDPS - Horizontal sum
Performance:
- 8-way parallelism (256-bit registers)
- 6-12x speedup on vector operations (measured)
- 100-200 GB/s memory bandwidth
ARM64 with NEON:
CPU Requirements:
- Apple Silicon (M1/M2/M3+)
- ARM servers with NEON support
vek32 Instructions:
FMUL - Multiply 4 ร float32 in parallel
FMLA - Fused multiply-add
FADDP - Parallel add
Performance:
- 4-way parallelism (128-bit registers)
- 4-8x speedup expected on ARM64
- Excellent on Apple M-series unified memory
Generic Fallback:
Fallback Plan:
- Automatic if AVX2/NEON not detected
- Pure Go scalar loop
- Works everywhere
- Performance: Baseline (1x)
Cosine Similarity Example¶
Computing similarity of two 1536-dim embeddings (OpenAI ada-002):
Without SIMD (scalar Go):
- Dot product: 380ns
- Norm A: 369ns
- Norm B: 369ns
- Division + sqrt: 200ns
- Total: ~1.3ยตs per pair
For 1000 similarity searches: 1.3ms
With SIMD (AVX2 vek32):
- All operations accelerated via vek32
- Effective speedup: ~1.6-2.0x
- Total: ~0.7ยตs per pair
For 1000 similarity searches: 0.7ms
Speedup: 1.3ms / 0.7ms โ 1.9x faster
Integration with Search¶
HNSW Search (1M vectors, 1000 queries):
1. Load query embedding (1536 floats)
2. Compare against candidate vectors
โ CosineSimilarity() [SIMD-accelerated via vek32]
3. Keep top-K
RRF Fusion:
1. Score each candidate
2. Rank candidates
3. Select top-K for reranking
Impact:
Standard (pure Go): ~500ms to search 1M vectors
SIMD (vek32): ~50ms to search 1M vectors
~10x speedup!
Automatic Detection¶
NornicDB automatically detects and uses the best available SIMD backend:
// Runtime detection at startup
info := simd.Info()
On Intel i9-9900K (AVX2+FMA):
Implementation: avx2
Features: [AVX2, FMA, BMI2, ...]
Accelerated: true
On Apple Silicon M3:
Implementation: neon
Features: [NEON, FMA, ...]
Accelerated: true
On older CPU without AVX2/NEON:
Implementation: generic
Features: [SSE2]
Accelerated: false
On x86 with AVX2:
Implementation: AVX2
Features: [AVX2, FMA]
Accelerated: true
On older x86 or other platforms:
Implementation: generic
Features: []
Accelerated: false
6๏ธโฃ K-Means Clustering (Semantic Partitioning)¶
What It Does¶
Partitions embeddings into semantic clusters to enable 10-100x faster search by searching only relevant clusters instead of all vectors.
Instead of comparing query against 1M vectors, compare against ~1000 (only closest clusters), then search within those clusters.
How K-Means Works¶
K-means partitions data into K clusters by iteratively:
- Initialize: Pick K random centroid points
- Assign: Assign each point to nearest centroid
- Update: Move centroid to mean of assigned points
- Repeat: Until centroids stop moving (convergence)
Iteration 0 (Random starts):
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ โ โ 3 centroids
โ โ โ โ โ โ โ โ scattered randomly
โ โ โ โโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
Iteration 5 (Converging):
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โโโ โโ โโ โ
โ โโ โโโ โโโโ โ Points move towards
โ โโโโ โ โโโโโโ โ nearest centroid
โ โ โโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
Iteration 50 (Converged):
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โโโ โโ โโ โ
โ โโโโโโโ โโโโโ โ Tight clusters
โ โโโโโโโโโโโโโโโโ โ centroids stable
โ โโโโโโ โโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
Performance: K-Means Search vs Brute-Force¶
Brute-force vector search (no clustering):
Query: [0.12, -0.45, 0.23, ..., 0.08] (1536-dim)
โ
Compare against ALL 1M vectors:
- 1M ร CosineSimilarity operations
- 1M ร 1536 float32 reads = 6GB memory traffic
- Latency: ~500ms
Clustered search (K-means + search):
Query: [0.12, -0.45, 0.23, ..., 0.08]
โ
1. Find K nearest CENTROIDS:
- 100 centroids ร CosineSimilarity = 0.2ms
- Return top 3 centroids
2. Search within those 3 clusters:
- Each cluster ~333K vectors
- But search within ~1% of data: 333K ร 3 = 1M vectors... WAIT!
Actually: If K=100 clusters:
- Average cluster size: 1M / 100 = 10K vectors
- Search top 3 clusters: 3 ร 10K = 30K vectors
- 30K << 1M, so 30x-50x speedup!
Latency: ~5-10ms
Real performance numbers (1M embeddings, 1536-dim):
Cluster Count Build Time Search (top-10) Speedup Memory
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
No clustering โ 500ms 1x 6GB
100 clusters 2s 10ms 50x 6.5GB
500 clusters 10s 3ms 166x 7GB
1000 clusters 25s 2ms 250x 8GB
โ ๏ธ Diminishing returns: 500+ clusters adds overhead >gain
Initialization Methods¶
K-Means++ (Default - Better Quality)
Algorithm:
1. Pick first centroid randomly
2. For each next centroid:
- Calculate distance to nearest existing centroid for each point
- Pick point with probability โ (distanceยฒ)
- Add as new centroid
Benefit: Avoids poor local minima
Quality: ~2x better final clustering
Speed: Slower initialization (~100ms for 10K points)
Random Initialization (Faster)
Algorithm:
1. Pick K random points as centroids
Benefit: Instant
Speed: Much faster initialization
Quality: May need more iterations or poor results
Auto-K Selection¶
When dataset size is unknown, NornicDB automatically selects K using Elbow Method:
Algorithm:
1. Run k-means for K = 10, 20, 30, ..., sqrt(N)
2. Track within-cluster sum-of-squares (WCSS)
3. Find "elbow" (diminishing improvement)
4. Use that K
Example for 100K embeddings:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ WCSS โฑ โ
โ โฒ โฑ โ
โ โฒ โฑ โ
โ โฒ โฑ โ Elbow (K=100) โ
โ โฒ โฑ โ
โ โฒโฑ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโ K โ
โ 10 50 100 200 300 โ
โ
Result: K=100 chosen automatically
Drift-Based Re-Clustering¶
NornicDB can automatically re-cluster when new embeddings cause significant centroid drift:
Configuration:
drift_threshold: 0.1 # Re-cluster if drift > 10%
cluster_interval: 15m # Check every 15 minutes
Monitoring:
if (centroid_drift > threshold) {
trigger_clustering() // Adapt to new data
}
Example:
Initial clusters: 100 (aligned with data)
Add 50K new embeddings (10% of dataset)
Centroid drift: 12% (> 10% threshold)
โ Automatically re-cluster
โ New centroids optimized for all 150K embeddings
Integration with Vector Search¶
K-means is a pre-filter before HNSW:
Query: [0.12, -0.45, 0.23, ...]
โ
Phase 1: CLUSTER FILTERING (fast, approximate)
โโโ Find K nearest centroids
โโโ Get top-3 clusters: [C1, C2, C3]
โโโ Return 30K candidate vectors
โโโ Time: ~1ms
โ
Phase 2: HNSW REFINEMENT (accurate, slow)
โโโ Run HNSW search on 30K candidates
โโโ Return top-10 vectors
โโโ Time: ~5ms
โ
Final Results: Top-10 (from 1M vectors, in ~6ms!)
Without clustering:
โโโ HNSW on 1M vectors: ~20-50ms
Search Strategy: Cluster-Based¶
// Cluster-aware search parameters
numClustersToSearch := 3 // Search how many clusters
candidateLimit := limit * 2 // Ensure enough candidates
// Step 1: Find nearest centroids
nearestClusters := findNearestCentroids(query, numClustersToSearch)
// Step 2: Get vectors from those clusters
candidates := getCandidatesFromClusters(nearestClusters)
// Step 3: Rank candidates with HNSW
results := hnswSearch(candidates, query, limit)
When to search multiple clusters:
Query Type Clusters Reason
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Specific term 1-2 Query clearly in one cluster
Broad topic 3-5 Query spans multiple topics
Exploratory/fuzzy 5+ Need diverse results
Configuration & Tuning¶
k_means_clustering:
enabled: true
# Cluster count
num_clusters: 0 # 0 = auto-detect with elbow method
# Or fixed:
# num_clusters: 100
# Convergence
max_iterations: 100 # Stop after 100 iterations
tolerance: 0.0001 # Stop when drift < 0.0001
# Initialization
init_method: "kmeans++" # Use k-means++ (better) or "random" (faster)
# Auto-clustering triggers
cluster_interval: "15m" # Re-cluster every 15 minutes
drift_threshold: 0.1 # Re-cluster if centroids drift >10%
# Constraints
min_cluster_size: 10 # Merge tiny clusters
# Search behavior
vector_search:
use_clustered_search: true
clusters_to_search: 3 # Search top 3 clusters
Performance Trade-Offs¶
Parameter Impact on Build Impact on Search Quality
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
num_clusters โ Slower Faster Worse
max_iterations โ Much slower โ Better
tolerance โ Much slower โ Better
init_method k-means++ slower than random
Example tuning:
Scenario: 10M vectors, need fast response
num_clusters: 1000 (large K)
max_iterations: 50 (converge quick)
tolerance: 0.01 (loose)
โ Build: 10s, Search: 1ms, Quality: Good
Automatic vs Manual Clustering¶
Automatic (Recommended):
service.EnableClustering(gpuManager, 0) // 0 = auto-detect K
// NornicDB automatically:
// 1. Detects optimal K using elbow method
// 2. Builds clusters on first ~1M embeddings
// 3. Re-clusters every 15min if drift detected
// 4. Integrates seamlessly with search
Manual (For fine control):
config := &gpu.KMeansConfig{
NumClusters: 100,
MaxIterations: 50,
Tolerance: 0.001,
InitMethod: "kmeans++",
AutoK: false,
}
service.EnableClustering(gpuManager, 100)
Monitoring & Diagnostics¶
Metrics available in search stats:
num_clusters: 100 # K value
embedding_count: 1000000 # Total vectors
avg_cluster_size: 10000 # Mean vectors per cluster
is_clustered: true # Clustering active
cluster_iterations: 47 # Iterations to converge
centroid_drift: 0.002 # How much centroids moved (0-1)
Check health:
โ
is_clustered = true Clustering working
โ
avg_cluster_size > 100 Balanced clusters
โ
centroid_drift < 0.1 Centroids stable
โ ๏ธ clustering disabled Enable with feature flag
Limitations & When NOT to Use¶
โ Don't use k-means clustering if:
- Dataset < 10K embeddings (overhead not worth it)
- Embeddings change constantly (re-clustering overhead)
- Need exact nearest neighbors (approximate only)
- Clusters need semantic meaning (pure distance-based)
โ
Use k-means clustering if:
- Dataset > 100K embeddings
- Search latency critical (need <10ms)
- Embeddings relatively stable (add new batch weekly)
- Good approximation acceptable (95%+ recall)
๐ End-to-End Search Flow¶
Complete Request Lifecycle¶
User Query: "best machine learning frameworks for production"
โ
โผ
1. EMBEDDING (pkg/embed)
โโโ OpenAI / BGE / Local LLM
โโโ Produces: [0.12, -0.45, 0.23, ..., 0.08] (1536 dim)
Latency: 200-800ms (depends on provider)
โ
โผ
2. HNSW VECTOR SEARCH (pkg/search - HNSWIndex)
โโโ Index lookup with query embedding
โโโ SIMD acceleration (pkg/simd.CosineSimilaritySIMD)
โโโ Produces: [Doc1, Doc2, Doc3, ...] with similarity scores
Latency: 1-5ms (1M vectors)
Returns: Top-100 candidates
โ
โผ
3. BM25 FULL-TEXT (pkg/search - FulltextIndex)
โโโ Tokenize query: ["machine", "learning", "framework", ...]
โโโ Lookup inverted index
โโโ Score with BM25 formula
โโโ Produces: [Doc3, Doc1, Doc5, ...] with BM25 scores
Latency: 5-20ms (1M documents)
Returns: Top-100 candidates
โ
โผ
4. RRF FUSION (pkg/search - Search.Fuse)
โโโ Merge rankings from Vector + BM25
โโโ Apply RRF formula to positions
โโโ Produces: [Doc1, Doc3, Doc2, ...] with RRF scores
Latency: <1ms
Returns: Top-100 fused results
โ
โผ
5. CROSS-ENCODER RERANKING (Optional, pkg/search)
โโโ IF enabled && num_results > 10:
โโโ Batch top-100 through cross-encoder API
โโโ Re-score and re-rank
โโโ Produces: [Doc1, Doc2, Doc5, ...] reranked
Latency: 500-1500ms
Returns: Top-10 final results
โ
โผ
6. RESPONSE (pkg/search - SearchResponse)
โโโ Format results with all metadata
โโโ Include metrics (timings, scores, ranks)
โโโ Return to client
TOTAL LATENCY (without reranking): ~200-1000ms
TOTAL LATENCY (with reranking): ~700-2500ms
Optimization Techniques¶
Technique Impact Implementation
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
SIMD Acceleration 5-10x faster Automatic in pkg/simd
Batch Processing 2-3x faster Process 100 at once
GPU (optional) 10-100x Available in pkg/gpu
Query Caching 2-5x faster Cypher: pkg/cypher; Search: pkg/search
Index Warming 2x faster Pre-load hot shards
Adaptive EF 1.5-2x Tune efSearch per QPS
Search result caching¶
Unified search (pkg/search) caches search results so that repeated identical requests are served from memory regardless of entry point (HTTP search API, Cypher, MCP, etc.).
Behavior:
- Key: Query string + options that affect the result:
Limit,Types(labels),RerankEnabled,RerankTopK,RerankMinScore,MMREnabled,MMRLambda. Same (query, options) โ same cache key. - Capacity: 1000 entries (LRU eviction).
- TTL: 5 minutes (aligned with Cypher query cache).
- Invalidation: Full cache clear on
IndexNodeandRemoveNode, so results stay correct after index changes.
Effect: The first request for a given query+options runs the full pipeline (embedding, vector search, BM25, RRF, optional rerank). The second identical request returns the cached SearchResponse immediately (no embedding or index lookup). This is separate from the Cypher query cache, which caches Cypher execution results; search result caching is inside the search service and benefits every caller.
๐ฏ Choosing the Right Method¶
Decision Tree¶
Q: Need EXACT keyword match?
โโ YES โ Use BM25 Full-Text
โโ NO โ Continue
Q: Need SEMANTIC understanding?
โโ YES โ Use Vector Search (HNSW)
โโ NO โ Consider metadata filters
Q: Want BEST of both worlds?
โโ YES โ Use RRF (Vector + BM25)
โโ NO โ Pick one method
Q: Critical relevance required?
โโ YES โ Add Cross-Encoder reranking
โโ NO โ RRF alone is sufficient
Use Case Guide¶
| Use Case | Method | Reason |
|---|---|---|
| Full-text search (e.g., logging) | BM25 | Exact term matching |
| Semantic search (e.g., "find papers on topic X") | Vector HNSW | Meaning matters |
| General-purpose search (web, docs) | RRF | Best of both |
| High-stakes ranking (search ads, hiring) | RRF + Cross-encoder | Maximum accuracy |
| Real-time autocomplete | BM25 prefix | Speed critical |
| Fuzzy/typo-tolerant | Vector + Phonetic | Error tolerance |
๐ Performance Benchmarks¶
Field Measurements (Apple M3 Max, 64GB RAM)¶
These are measured on a live translations dataset (February 2026), using POST /nornicdb/search with varied cache-busting queries.
Embedding-query path (best collected)¶
| Scenario | Samples | p50 | p95 | p99 | Mean |
|---|---|---|---|---|---|
| Sequential varied queries | 120 | 11.28ms | 25.84ms | 34.05ms | 14.49ms |
| Concurrent varied queries (8 workers) | 120 | 76.36ms | 87.41ms | 100.07ms | 76.34ms |
Server diagnostics for this path showed embed_total as the dominant component while search_total stayed in microseconds.
Fulltext-only path (best collected)¶
| Scenario | Samples | p50 | p95 | p99 | Mean |
|---|---|---|---|---|---|
| Sequential varied queries | 40 | 0.57ms | 2.77ms | 94.90ms* | 4.44ms |
Observed handler diagnostics:
embed_calls=0embed_total=0ssearch_method="fulltext"withfallback=true
* p99 includes occasional transport/runtime outliers; server-side handler timing for these requests is typically in tens of microseconds.
Single Document Scoring¶
Method Time Throughput Accuracy
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Vector (SIMD) 2ยตs 500k ops/sec 0.92
BM25 5ยตs 200k ops/sec 0.85
RRF fusion 0.1ยตs 10M ops/sec 0.94
Cross-encoder 50ms 1 doc/sec 0.98
Full Search (1M documents, top-10)¶
Method Latency Cost Accuracy
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Vector only 5ms CPU 0.88
BM25 only 20ms CPU 0.80
Vector+BM25 (RRF) 25ms CPU 0.92
+ Reranking 1500ms API 0.96
Scaling Characteristics¶
Vector Search (HNSW) vs Linear BM25
Documents HNSW Linear BM25 Ratio
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1K 0.2ms 0.5ms 2.5x
10K 0.3ms 5ms 16x
100K 0.5ms 50ms 100x
1M 1.2ms 500ms 416x
10M 2.1ms 5000ms 2380x
HNSW scales logarithmically; BM25 scales linearly. RRF helps by combining both strengths.
๐ง Configuration & Tuning¶
Vector Search Parameters¶
vector_search:
enabled: true
hnsw:
m: 16 # Max connections (8-32)
ef_construction: 200 # Build thoroughness (100-500)
ef_search: 100 # Search depth (50-200)
# Trade-off:
# - Larger M/efConstruction = better quality, slower build
# - Larger efSearch = more accurate, slower search
Full-Text Search Parameters¶
fulltext_search:
enabled: true
# Priority properties get extra weight
priority_properties:
- content
- title
- description
# BM25 parameters (rarely need adjustment)
bm25:
k1: 1.2 # Term frequency saturation
b: 0.75 # Length normalization
RRF Parameters¶
rrf_fusion:
enabled: true
k: 60 # Constant (higher = more balanced)
vector_weight: 1.0
bm25_weight: 1.0
# Fallback if one method fails
fallback_enabled: true
Cross-Encoder Parameters¶
cross_encoder:
enabled: true
api_url: "https://api.cohere.com/rerank"
model: "rerank-3.5"
top_k: 100 # Re-score top-100 from RRF
threshold: 0.5 # Minimum score to include
๐ API Reference¶
Basic Search¶
// Vector search
results, err := service.SearchVector(ctx, query, embedding, opts)
// Full-text search
results, err := service.SearchFulltext(ctx, query, opts)
// Hybrid RRF
results, err := service.Search(ctx, query, embedding, opts)
Advanced Options¶
opts := search.SearchOptions{
Limit: 10,
Threshold: 0.7,
VectorSearch: true,
FulltextSearch: true,
UseReranking: true,
IncludeMetrics: true, // Include timing info
}
Response Structure¶
type SearchResponse struct {
Status string // "success" or "error"
Query string // Original query
Results []SearchResult // Top-K matches
TotalCandidates int // Before filtering
SearchMethod string // "vector", "fulltext", "hybrid"
FallbackTriggered bool // True if primary failed
Metrics: {
VectorSearchTimeMs: 5,
BM25SearchTimeMs: 20,
FusionTimeMs: 1,
TotalTimeMs: 26,
}
}
๐ Debugging Search Issues¶
Common Problems¶
Q: Vector search returns irrelevant results - A: Vector model may not fit your domain. Consider fine-tuned embeddings (e.g., BGE for Chinese text) - Test with: simd.Info() to verify SIMD is accelerated
Q: BM25 is too strict (no results) - A: Query has rare terms. Enable fuzzy matching or expand query with synonyms
Q: RRF still missing relevant documents - A: Documents missing from both indices. Check: - Vector index built? svc.BuildIndexes(ctx) - Full-text indexed? svc.Index(docID, text)
Q: Search is slow (>1000ms) - A: Profile with metrics enabled:
resp.Metrics.VectorSearchTimeMs // Fast? Check SIMD
resp.Metrics.BM25SearchTimeMs // Slow? Check inverted index
resp.Metrics.FusionTimeMs // Check RRF config
Enabling Debug Logging¶
// Enable detailed timing
svc := search.NewService(storage)
svc.DebugLogging = true
results, _ := svc.Search(ctx, query, embedding, opts)
// Logs each stage timing and candidate counts
๐ Future Improvements¶
- GPU-accelerated HNSW search for 100M+ vectors
- Learned sparse retrieval (LSR) for semantic BM25
- Multi-vector embeddings (different dimensions for different models)
- Real-time index updates (currently batch-only)
- Approximate cross-encoder reranking (faster)
- Geographic search integration
๐ Further Reading¶
- HNSW: Malkov & Yashunin (2018) - "Efficient and Robust Approximate Nearest Neighbor Search"
- BM25: Robertson & Zaragoza (2009) - "The Probabilistic Relevance Framework: BM25 and Beyond"
- RRF: Cormack, Clarke, & Buettcher (2009) - "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods"
- Cross-Encoder: Nogueira & Cho (2019) - "Passage Re-ranking with BERT"
Last Updated: February 17, 2026
Package: pkg/search + pkg/simd + pkg/math/vector