Local GGUF Embedding Executor: Implementation Plan¶
Companion to:
LOCAL_GGUF_EMBEDDING_FEASIBILITY.md
Scope: Tight, performant GGUF embedding integration for NornicDB
Overview¶
This plan adds a new local embedding provider that runs GGUF models directly within NornicDB. External providers (Ollama, OpenAI) remain fully supported and unchanged.
| Provider | Description | Status |
|---|---|---|
local | NEW - Tightly coupled, runs in-process | This RFC |
ollama | External Ollama server | Existing, unchanged |
openai | OpenAI API | Existing, unchanged |
Licensing¶
BYOM (Bring Your Own Model) - Licensing delineation is at the model file level.
| Component | License | Notes |
|---|---|---|
| llama.cpp | MIT | CGO static link - we're good |
| GGML | MIT | Via llama.cpp - we're good |
| NornicDB | MIT | Our code |
| Model files | User's responsibility | E5, BGE, etc. have their own licenses |
Users download their own .gguf files. We don't ship or recommend any specific model.
Design Principles¶
- Backward compatible - Existing
ollamaandopenaiconfigs unchanged - BYOM - User downloads/converts their own GGUF models
- Use existing config - Same
NORNICDB_EMBEDDING_MODELenv var, same pattern - Simple model path - Model name →
/data/models/{name}.gguf - Default to BGE-M3 -
bge-m3instead ofmxbai-embed-large - GPU-first, CPU fallback - Auto-detect CUDA/Metal, graceful fallback
- Tight CGO integration - Direct llama.cpp bindings, no IPC/subprocess
- Low memory footprint - mmap models, quantized weights, shared context
Configuration (Matches Existing Pattern)¶
# NEW local mode (tightly coupled with database)
NORNICDB_EMBEDDING_PROVIDER=local
NORNICDB_EMBEDDING_MODEL=bge-m3 # Model name → /data/models/{name}.gguf
NORNICDB_EMBEDDING_DIMENSIONS=1024
# Existing providers still fully supported
NORNICDB_EMBEDDING_PROVIDER=ollama # Uses external Ollama server
NORNICDB_EMBEDDING_PROVIDER=openai # Uses OpenAI API
# GPU configuration (local mode only)
NORNICDB_EMBEDDING_GPU_LAYERS=-1 # -1 = auto (all to GPU if available)
# 0 = CPU only
# N = offload N layers to GPU
Backward Compatibility: Existing ollama and openai configurations work exactly as before.
GPU Acceleration¶
Follows llama.cpp's GPU-first strategy:
Startup sequence (local mode):
1. Detect available GPU backend (CUDA → Metal → Vulkan → CPU)
2. If GPU found → offload all layers to GPU
3. If no GPU or NORNICDB_EMBEDDING_GPU_LAYERS=0 → use CPU with SIMD (AVX2/NEON)
| Backend | Platform | Detection |
|---|---|---|
| CUDA | Linux/Windows + NVIDIA | Primary, auto-detect |
| Metal | macOS Apple Silicon | Primary, auto-detect |
| Vulkan | Cross-platform | Fallback |
| CPU | All platforms | Always available |
CGO Build Flags¶
// Build with GPU support
#cgo linux,amd64 LDFLAGS: -L${SRCDIR}/../../lib/llama -lllama_linux_amd64_cuda -lcudart -lcublas
#cgo darwin,arm64 LDFLAGS: -L${SRCDIR}/../../lib/llama -lllama_darwin_arm64 -framework Metal -framework Accelerate
Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ NornicDB Server │
├─────────────────────────────────────────────────────────────────┤
│ pkg/embed/ │
│ ├── Embedder interface (existing) │
│ └── LocalGGUFEmbedder ──────────────────────────────────────┐ │
├──────────────────────────────────────────────────────────────┼──┤
│ pkg/localllm/ │ │
│ ┌───────────────────────────────────────────────────────────┼──┤
│ │ Model (Go wrapper) │ │
│ │ • LoadModel() - mmap GGUF, create context │ │
│ │ • Embed() - tokenize → forward → pool → normalize │ │
│ │ • Close() - free resources │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────▼───────┐ │
│ │ CGO Bridge │ │
│ │ (llama.h) │ │
│ └───────┬───────┘ │
├──────────────────────────────┼──────────────────────────────────┤
│ lib/llama/ (vendored) │ │
│ ├── llama.h + ggml.h ▼ │
│ └── libllama.a ◄── Static library per platform │
└─────────────────────────────────────────────────────────────────┘
Directory Structure¶
nornicdb/
├── pkg/
│ ├── embed/
│ │ ├── embed.go # Embedder interface (unchanged)
│ │ ├── ollama.go # OllamaEmbedder (unchanged)
│ │ ├── openai.go # OpenAIEmbedder (unchanged)
│ │ └── local_gguf.go # LocalGGUFEmbedder (NEW)
│ └── localllm/
│ ├── llama.go # CGO bindings + Go wrapper
│ ├── llama_test.go
│ └── options.go # Config structs
│
├── lib/
│ └── llama/ # Vendored llama.cpp
│ ├── llama.h
│ ├── ggml.h
│ ├── libllama_linux_amd64.a # CPU only
│ ├── libllama_linux_amd64_cuda.a # With CUDA
│ ├── libllama_linux_arm64.a
│ ├── libllama_darwin_arm64.a # With Metal
│ ├── libllama_darwin_amd64.a
│ └── libllama_windows_amd64.a # With CUDA
│
└── scripts/
└── build-llama.sh # Build static libs
Implementation¶
Step 1: CGO Bindings¶
File: pkg/localllm/llama.go
package localllm
/*
#cgo CFLAGS: -I${SRCDIR}/../../lib/llama
// Linux with CUDA (GPU primary)
#cgo linux,amd64,cuda LDFLAGS: -L${SRCDIR}/../../lib/llama -lllama_linux_amd64_cuda -lcudart -lcublas -lm -lstdc++ -lpthread
// Linux CPU fallback
#cgo linux,amd64,!cuda LDFLAGS: -L${SRCDIR}/../../lib/llama -lllama_linux_amd64 -lm -lstdc++ -lpthread
#cgo linux,arm64 LDFLAGS: -L${SRCDIR}/../../lib/llama -lllama_linux_arm64 -lm -lstdc++ -lpthread
// macOS with Metal (GPU primary on Apple Silicon)
#cgo darwin,arm64 LDFLAGS: -L${SRCDIR}/../../lib/llama -lllama_darwin_arm64 -lm -lc++ -framework Accelerate -framework Metal -framework MetalPerformanceShaders
#cgo darwin,amd64 LDFLAGS: -L${SRCDIR}/../../lib/llama -lllama_darwin_amd64 -lm -lc++ -framework Accelerate
// Windows with CUDA
#cgo windows,amd64 LDFLAGS: -L${SRCDIR}/../../lib/llama -lllama_windows_amd64 -lcudart -lcublas -lm -lstdc++
#include <stdlib.h>
#include <string.h>
#include "llama.h"
// Initialize backend once (handles GPU detection)
static int initialized = 0;
void init_backend() {
if (!initialized) {
llama_backend_init();
initialized = 1;
}
}
// Load model with mmap for low memory usage
llama_model* load_model(const char* path, int n_gpu_layers) {
init_backend();
struct llama_model_params params = llama_model_default_params();
params.n_gpu_layers = n_gpu_layers;
params.use_mmap = 1;
return llama_load_model_from_file(path, params);
}
// Create embedding context (minimal memory)
llama_context* create_context(llama_model* model, int n_ctx, int n_batch, int n_threads) {
struct llama_context_params params = llama_context_default_params();
params.n_ctx = n_ctx;
params.n_batch = n_batch;
params.n_threads = n_threads;
params.n_threads_batch = n_threads;
params.embeddings = 1;
params.pooling_type = LLAMA_POOLING_TYPE_MEAN;
return llama_new_context_with_model(model, params);
}
// Tokenize using model's vocab
int tokenize(llama_model* model, const char* text, int text_len, int* tokens, int max_tokens) {
return llama_tokenize(model, text, text_len, tokens, max_tokens, 1, 1);
}
// Generate embedding
int embed(llama_context* ctx, int* tokens, int n_tokens, float* out, int n_embd) {
llama_kv_cache_clear(ctx);
struct llama_batch batch = llama_batch_init(n_tokens, 0, 1);
for (int i = 0; i < n_tokens; i++) {
batch.token[i] = tokens[i];
batch.pos[i] = i;
batch.n_seq_id[i] = 1;
batch.seq_id[i][0] = 0;
batch.logits[i] = 0;
}
batch.n_tokens = n_tokens;
if (llama_decode(ctx, batch) != 0) {
llama_batch_free(batch);
return -1;
}
float* embd = llama_get_embeddings_seq(ctx, 0);
if (!embd) {
llama_batch_free(batch);
return -2;
}
memcpy(out, embd, n_embd * sizeof(float));
llama_batch_free(batch);
return 0;
}
int get_n_embd(llama_model* model) { return llama_n_embd(model); }
void free_ctx(llama_context* ctx) { if (ctx) llama_free(ctx); }
void free_model(llama_model* model) { if (model) llama_free_model(model); }
*/
import "C"
import (
"context"
"fmt"
"math"
"runtime"
"sync"
"unsafe"
)
// Model wraps a GGUF model for embedding generation
type Model struct {
model *C.llama_model
ctx *C.llama_context
dims int
mu sync.Mutex
}
// Options configures model loading
type Options struct {
ModelPath string
ContextSize int // Default: 512
BatchSize int // Default: 512
Threads int // Default: NumCPU/2, capped at 8
GPULayers int // Default: -1 (auto: all layers to GPU if available)
// 0 = CPU only, N = offload N layers
}
// DefaultOptions returns options optimized for GPU with CPU fallback
func DefaultOptions(modelPath string) Options {
threads := runtime.NumCPU() / 2
if threads < 1 {
threads = 1
}
if threads > 8 {
threads = 8
}
return Options{
ModelPath: modelPath,
ContextSize: 512,
BatchSize: 512,
Threads: threads,
GPULayers: -1, // Auto: use GPU if available, fallback to CPU
}
}
// LoadModel loads a GGUF model
func LoadModel(opts Options) (*Model, error) {
cPath := C.CString(opts.ModelPath)
defer C.free(unsafe.Pointer(cPath))
model := C.load_model(cPath, C.int(opts.GPULayers))
if model == nil {
return nil, fmt.Errorf("failed to load: %s", opts.ModelPath)
}
ctx := C.create_context(model, C.int(opts.ContextSize), C.int(opts.BatchSize), C.int(opts.Threads))
if ctx == nil {
C.free_model(model)
return nil, fmt.Errorf("failed to create context")
}
return &Model{
model: model,
ctx: ctx,
dims: int(C.get_n_embd(model)),
}, nil
}
// Embed generates a normalized embedding
func (m *Model) Embed(ctx context.Context, text string) ([]float32, error) {
if text == "" {
return nil, nil
}
m.mu.Lock()
defer m.mu.Unlock()
// Tokenize
cText := C.CString(text)
defer C.free(unsafe.Pointer(cText))
tokens := make([]C.int, 512)
n := C.tokenize(m.model, cText, C.int(len(text)), &tokens[0], 512)
if n < 0 {
return nil, fmt.Errorf("tokenization failed")
}
// Embed
emb := make([]float32, m.dims)
if C.embed(m.ctx, (*C.int)(&tokens[0]), n, (*C.float)(&emb[0]), C.int(m.dims)) != 0 {
return nil, fmt.Errorf("embedding failed")
}
// Normalize
normalize(emb)
return emb, nil
}
// EmbedBatch embeds multiple texts
func (m *Model) EmbedBatch(ctx context.Context, texts []string) ([][]float32, error) {
results := make([][]float32, len(texts))
for i, t := range texts {
select {
case <-ctx.Done():
return nil, ctx.Err()
default:
}
emb, err := m.Embed(ctx, t)
if err != nil {
return nil, fmt.Errorf("text %d: %w", i, err)
}
results[i] = emb
}
return results, nil
}
// Dimensions returns embedding size
func (m *Model) Dimensions() int { return m.dims }
// Close frees resources
func (m *Model) Close() error {
m.mu.Lock()
defer m.mu.Unlock()
C.free_ctx(m.ctx)
C.free_model(m.model)
return nil
}
func normalize(v []float32) {
var sum float32
for _, x := range v {
sum += x * x
}
if sum == 0 {
return
}
norm := float32(1.0 / math.Sqrt(float64(sum)))
for i := range v {
v[i] *= norm
}
}
Step 2: Embedder Integration¶
File: pkg/embed/local_gguf.go
package embed
import (
"context"
"fmt"
"os"
"path/filepath"
"github.com/orneryd/nornicdb/pkg/localllm"
)
// LocalGGUFEmbedder implements Embedder using a local GGUF model
type LocalGGUFEmbedder struct {
model *localllm.Model
modelName string
modelPath string
}
// NewLocalGGUF creates an embedder using the existing Config pattern.
// Model resolution: config.Model → /data/models/{model}.gguf
func NewLocalGGUF(config *Config) (*LocalGGUFEmbedder, error) {
// Resolve model path: model name → /data/models/{name}.gguf
modelsDir := os.Getenv("NORNICDB_MODELS_DIR")
if modelsDir == "" {
modelsDir = "/data/models"
}
modelPath := filepath.Join(modelsDir, config.Model+".gguf")
// Check if file exists
if _, err := os.Stat(modelPath); os.IsNotExist(err) {
return nil, fmt.Errorf("model not found: %s (expected at %s)", config.Model, modelPath)
}
opts := localllm.DefaultOptions(modelPath)
// Optional: override context size explicitly if you need a smaller window
// opts.ContextSize = 4096
model, err := localllm.LoadModel(opts)
if err != nil {
return nil, fmt.Errorf("failed to load %s: %w", modelPath, err)
}
return &LocalGGUFEmbedder{
model: model,
modelName: config.Model,
modelPath: modelPath,
}, nil
}
func (e *LocalGGUFEmbedder) Embed(ctx context.Context, text string) ([]float32, error) {
return e.model.Embed(ctx, text)
}
func (e *LocalGGUFEmbedder) EmbedBatch(ctx context.Context, texts []string) ([][]float32, error) {
return e.model.EmbedBatch(ctx, texts)
}
func (e *LocalGGUFEmbedder) Dimensions() int { return e.model.Dimensions() }
func (e *LocalGGUFEmbedder) Model() string { return e.modelName }
func (e *LocalGGUFEmbedder) Close() error { return e.model.Close() }
Step 3: Factory Update¶
Update NewEmbedder() in pkg/embed/embed.go:
func NewEmbedder(config *Config) (Embedder, error) {
switch config.Provider {
case "local":
return NewLocalGGUF(config)
case "ollama":
return NewOllama(config), nil
case "openai":
if config.APIKey == "" {
return nil, fmt.Errorf("OpenAI requires an API key")
}
return NewOpenAI(config), nil
default:
return nil, fmt.Errorf("unknown provider: %s", config.Provider)
}
}
Step 4: Default Config Update¶
Update defaults in cmd/nornicdb/main.go:
// Change default from mxbai to bge-m3
serveCmd.Flags().String("embedding-model",
getEnvStr("NORNICDB_EMBEDDING_MODEL", "bge-m3"),
"Embedding model name")
Build System¶
Build Script: scripts/build-llama.sh¶
#!/bin/bash
set -euo pipefail
VERSION="${1:-b4535}"
OUTDIR="lib/llama"
mkdir -p "$OUTDIR"
git clone --depth 1 --branch "$VERSION" https://github.com/ggerganov/llama.cpp.git /tmp/llama.cpp
cd /tmp/llama.cpp
OS=$(uname -s | tr '[:upper:]' '[:lower:]')
ARCH=$(uname -m)
[[ "$ARCH" == "x86_64" ]] && ARCH="amd64"
[[ "$ARCH" == "aarch64" ]] && ARCH="arm64"
CMAKE_ARGS="-DLLAMA_STATIC=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=OFF -DLLAMA_BUILD_SERVER=OFF"
[[ "$OS" == "darwin" && "$ARCH" == "arm64" ]] && CMAKE_ARGS="$CMAKE_ARGS -DLLAMA_METAL=ON"
cmake -B build $CMAKE_ARGS
cmake --build build --config Release -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)
cp build/libllama.a "$OUTDIR/libllama_${OS}_${ARCH}.a"
cp llama.h ggml.h "$OUTDIR/"
echo "Built: libllama_${OS}_${ARCH}.a"
GitHub Actions¶
name: Build llama.cpp
on: workflow_dispatch
jobs:
build:
strategy:
matrix:
include:
- os: ubuntu-latest
lib: libllama_linux_amd64.a
- os: macos-14
lib: libllama_darwin_arm64.a
- os: macos-13
lib: libllama_darwin_amd64.a
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- run: ./scripts/build-llama.sh
- uses: actions/upload-artifact@v4
with:
name: ${{ matrix.lib }}
path: lib/llama/${{ matrix.lib }}
Environment Variables¶
| Variable | Default | Description |
|---|---|---|
NORNICDB_EMBEDDING_PROVIDER | ollama | local, ollama, openai |
NORNICDB_EMBEDDING_MODEL | bge-m3 | Model name (looked up in models dir) |
NORNICDB_EMBEDDING_DIMENSIONS | 1024 | Vector dimensions |
NORNICDB_MODELS_DIR | /data/models | Directory for .gguf files |
NORNICDB_EMBEDDING_GPU_LAYERS | -1 | GPU offload: -1=auto, 0=CPU only, N=N layers |
Note: NORNICDB_EMBEDDING_GPU_LAYERS only applies to local provider. External providers (ollama, openai) manage their own GPU usage.
Model Setup (User's Responsibility)¶
# Create models directory
mkdir -p /data/models
# Option 1: Download pre-converted GGUF (if available)
wget -O /data/models/bge-m3.gguf \
https://huggingface.co/...some-community-conversion.../bge-m3.Q4_K_M.gguf
# Option 2: Convert from HuggingFace yourself
pip install llama-cpp-python
python -m llama_cpp.convert \
--outfile /data/models/bge-m3.gguf \
BAAI/bge-m3
We don't ship models. Users bring their own, handle their own licensing.
Model Selection Guide¶
Quick Reference¶
| Model | Best For | Context | Dims | License |
|---|---|---|---|---|
| bge-m3 ⭐ | Long docs, code+docs hybrid (default) | 8192 | 1024 | MIT |
| e5-large-v2 | Natural language search, multilingual | 512 | 1024 | MIT |
| jina-embeddings-v2-base-code | Pure code search | 8192 | 768 | Apache 2.0 |
When to Use Which¶
Use BGE-M3 (default) when: - Indexing entire files (handles 8K tokens) - Code repositories with long functions - Need hybrid retrieval (lexical + semantic) - Want single model for code + docs
Use E5 when: - Need 100+ language support - Simpler dense-only retrieval preferred - Working with shorter content (<512 tokens) - Slightly lower memory footprint needed
Use Jina Code when: - Pure code-to-code similarity - "Find functions similar to this one" - 30+ programming languages - Don't need natural language understanding
Practical Notes¶
-
E5 and BGE-M3 both work for code - they understand variable names, comments, and semantic patterns even though they weren't code-specialized
-
Context length matters - if your code files average >512 tokens, BGE-M3's 8K context is valuable
-
Changing models = re-index everything - embeddings from different models are incompatible
-
Quantization is fine - Q4_K_M loses ~2% quality but runs 3x faster, worth it for most use cases
Performance Targets¶
| Hardware | Model | Latency | Memory |
|---|---|---|---|
| 4-core CPU | E5-base Q4 | <20ms | ~50MB (mmap) |
| 8-core CPU | E5-large Q4 | <30ms | ~100MB (mmap) |
| M2 (Metal) | E5-large Q4 | <10ms | ~100MB |
Quick Start¶
# 1. Put your GGUF model in /data/models/
cp my-bge-model.gguf /data/models/bge-m3.gguf
# 2. Run with local provider (same config pattern as before!)
NORNICDB_EMBEDDING_PROVIDER=local \
NORNICDB_EMBEDDING_MODEL=bge-m3 \
nornicdb serve
# Or via CLI flags
nornicdb serve --embedding-provider=local --embedding-model=bge-m3
Checklist¶
- Vendor llama.cpp headers (
lib/llama/) - Placeholder headers created - Build static libs with GPU support (CUDA for Linux/Windows, Metal for macOS)
- macOS ARM64 with Metal (
libllama_darwin_arm64.a) - Linux AMD64 with CUDA (via build script)
- Windows AMD64 with CUDA (
llama_windows.go+build-llama-cuda.ps1) - Implement
pkg/localllm/llama.go(CGO bindings with GPU detection) - Implement
pkg/localllm/llama_windows.go(Windows CUDA CGO bindings) - Implement
pkg/embed/local_gguf.go(Embedder) - Update
NewEmbedder()factory to handlelocalprovider - Ensure
ollamaandopenaiproviders remain unchanged - Tests pass! - Change default model to
bge-m3(optional - kept mxbai for backward compat) - Add
NORNICDB_MODELS_DIRenv var - Add
NORNICDB_EMBEDDING_GPU_LAYERSenv var - GPU auto-detection with graceful CPU fallback (in CGO code)
- Tests + benchmarks (CPU and GPU) - Tests created, skip without model
- Docs update - README and build workflow created
Build Tags: - Use -tags=localllm to enable local GGUF support on Linux/macOS - Use -tags="cuda localllm" for Windows with CUDA support
Next Steps (Linux/macOS): 1. Run ./scripts/build-llama.sh to build llama.cpp for your platform 2. Place a .gguf model in /data/models/ 3. Build with: go build -tags=localllm ./cmd/nornicdb 4. Run with: NORNICDB_EMBEDDING_PROVIDER=local nornicdb serve
Next Steps (Windows with CUDA): 1. Run .\scripts\build-llama-cuda.ps1 to build llama.cpp with CUDA 2. Place a .gguf model in your models directory 3. Run .\build-cuda.bat or .\build-full.bat to build NornicDB 4. Run with: set NORNICDB_EMBEDDING_PROVIDER=local && bin\nornicdb.exe serve
Migration from mxbai¶
For users already using mxbai-embed-large via Ollama:
# Before (Ollama) - STILL WORKS, NO CHANGES NEEDED
NORNICDB_EMBEDDING_PROVIDER=ollama
NORNICDB_EMBEDDING_MODEL=mxbai-embed-large
# After (Local GGUF) - NEW OPTION, put model in /data/models/
NORNICDB_EMBEDDING_PROVIDER=local
NORNICDB_EMBEDDING_MODEL=bge-m3 # or e5-large-v2, or keep mxbai if you have the GGUF
Note: Changing embedding models requires re-indexing. Embeddings from different models are not compatible.
Version 2.0 - November 2024