Skip to content

Local GGUF Embedding Executor: Implementation Plan

Companion to: LOCAL_GGUF_EMBEDDING_FEASIBILITY.md
Scope: Tight, performant GGUF embedding integration for NornicDB


Overview

This plan adds a new local embedding provider that runs GGUF models directly within NornicDB. External providers (Ollama, OpenAI) remain fully supported and unchanged.

Provider Description Status
local NEW - Tightly coupled, runs in-process This RFC
ollama External Ollama server Existing, unchanged
openai OpenAI API Existing, unchanged

Licensing

BYOM (Bring Your Own Model) - Licensing delineation is at the model file level.

Component License Notes
llama.cpp MIT CGO static link - we're good
GGML MIT Via llama.cpp - we're good
NornicDB MIT Our code
Model files User's responsibility E5, BGE, etc. have their own licenses

Users download their own .gguf files. We don't ship or recommend any specific model.


Design Principles

  1. Backward compatible - Existing ollama and openai configs unchanged
  2. BYOM - User downloads/converts their own GGUF models
  3. Use existing config - Same NORNICDB_EMBEDDING_MODEL env var, same pattern
  4. Simple model path - Model name → /data/models/{name}.gguf
  5. Default to BGE-M3 - bge-m3 instead of mxbai-embed-large
  6. GPU-first, CPU fallback - Auto-detect CUDA/Metal, graceful fallback
  7. Tight CGO integration - Direct llama.cpp bindings, no IPC/subprocess
  8. Low memory footprint - mmap models, quantized weights, shared context

Configuration (Matches Existing Pattern)

# NEW local mode (tightly coupled with database)
NORNICDB_EMBEDDING_PROVIDER=local
NORNICDB_EMBEDDING_MODEL=bge-m3                # Model name → /data/models/{name}.gguf
NORNICDB_EMBEDDING_DIMENSIONS=1024

# Existing providers still fully supported
NORNICDB_EMBEDDING_PROVIDER=ollama             # Uses external Ollama server
NORNICDB_EMBEDDING_PROVIDER=openai             # Uses OpenAI API

# GPU configuration (local mode only)
NORNICDB_EMBEDDING_GPU_LAYERS=-1               # -1 = auto (all to GPU if available)
                                               # 0 = CPU only
                                               # N = offload N layers to GPU

Backward Compatibility: Existing ollama and openai configurations work exactly as before.


GPU Acceleration

Follows llama.cpp's GPU-first strategy:

Startup sequence (local mode):
1. Detect available GPU backend (CUDA → Metal → Vulkan → CPU)
2. If GPU found → offload all layers to GPU
3. If no GPU or NORNICDB_EMBEDDING_GPU_LAYERS=0 → use CPU with SIMD (AVX2/NEON)
Backend Platform Detection
CUDA Linux/Windows + NVIDIA Primary, auto-detect
Metal macOS Apple Silicon Primary, auto-detect
Vulkan Cross-platform Fallback
CPU All platforms Always available

CGO Build Flags

// Build with GPU support
#cgo linux,amd64 LDFLAGS: -L${SRCDIR}/../../lib/llama -lllama_linux_amd64_cuda -lcudart -lcublas
#cgo darwin,arm64 LDFLAGS: -L${SRCDIR}/../../lib/llama -lllama_darwin_arm64 -framework Metal -framework Accelerate

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     NornicDB Server                              │
├─────────────────────────────────────────────────────────────────┤
│  pkg/embed/                                                      │
│  ├── Embedder interface (existing)                               │
│  └── LocalGGUFEmbedder ──────────────────────────────────────┐  │
├──────────────────────────────────────────────────────────────┼──┤
│  pkg/localllm/                                               │  │
│  ┌───────────────────────────────────────────────────────────┼──┤
│  │                    Model (Go wrapper)                     │  │
│  │  • LoadModel() - mmap GGUF, create context                │  │
│  │  • Embed()     - tokenize → forward → pool → normalize    │  │
│  │  • Close()     - free resources                           │  │
│  └───────────────────────────────────────────────────────────┘  │
│                              │                                   │
│                      ┌───────▼───────┐                           │
│                      │  CGO Bridge   │                           │
│                      │  (llama.h)    │                           │
│                      └───────┬───────┘                           │
├──────────────────────────────┼──────────────────────────────────┤
│  lib/llama/ (vendored)       │                                   │
│  ├── llama.h + ggml.h        ▼                                   │
│  └── libllama.a ◄── Static library per platform                 │
└─────────────────────────────────────────────────────────────────┘

Directory Structure

nornicdb/
├── pkg/
│   ├── embed/
│   │   ├── embed.go              # Embedder interface (unchanged)
│   │   ├── ollama.go             # OllamaEmbedder (unchanged)
│   │   ├── openai.go             # OpenAIEmbedder (unchanged)
│   │   └── local_gguf.go         # LocalGGUFEmbedder (NEW)
│   └── localllm/
│       ├── llama.go              # CGO bindings + Go wrapper
│       ├── llama_test.go
│       └── options.go            # Config structs
├── lib/
│   └── llama/                    # Vendored llama.cpp
│       ├── llama.h
│       ├── ggml.h
│       ├── libllama_linux_amd64.a       # CPU only
│       ├── libllama_linux_amd64_cuda.a  # With CUDA
│       ├── libllama_linux_arm64.a
│       ├── libllama_darwin_arm64.a      # With Metal
│       ├── libllama_darwin_amd64.a
│       └── libllama_windows_amd64.a     # With CUDA
└── scripts/
    └── build-llama.sh            # Build static libs

Implementation

Step 1: CGO Bindings

File: pkg/localllm/llama.go

package localllm

/*
#cgo CFLAGS: -I${SRCDIR}/../../lib/llama

// Linux with CUDA (GPU primary)
#cgo linux,amd64,cuda LDFLAGS: -L${SRCDIR}/../../lib/llama -lllama_linux_amd64_cuda -lcudart -lcublas -lm -lstdc++ -lpthread
// Linux CPU fallback
#cgo linux,amd64,!cuda LDFLAGS: -L${SRCDIR}/../../lib/llama -lllama_linux_amd64 -lm -lstdc++ -lpthread

#cgo linux,arm64 LDFLAGS: -L${SRCDIR}/../../lib/llama -lllama_linux_arm64 -lm -lstdc++ -lpthread

// macOS with Metal (GPU primary on Apple Silicon)
#cgo darwin,arm64 LDFLAGS: -L${SRCDIR}/../../lib/llama -lllama_darwin_arm64 -lm -lc++ -framework Accelerate -framework Metal -framework MetalPerformanceShaders
#cgo darwin,amd64 LDFLAGS: -L${SRCDIR}/../../lib/llama -lllama_darwin_amd64 -lm -lc++ -framework Accelerate

// Windows with CUDA
#cgo windows,amd64 LDFLAGS: -L${SRCDIR}/../../lib/llama -lllama_windows_amd64 -lcudart -lcublas -lm -lstdc++

#include <stdlib.h>
#include <string.h>
#include "llama.h"

// Initialize backend once (handles GPU detection)
static int initialized = 0;
void init_backend() {
    if (!initialized) {
        llama_backend_init();
        initialized = 1;
    }
}

// Load model with mmap for low memory usage
llama_model* load_model(const char* path, int n_gpu_layers) {
    init_backend();
    struct llama_model_params params = llama_model_default_params();
    params.n_gpu_layers = n_gpu_layers;
    params.use_mmap = 1;
    return llama_load_model_from_file(path, params);
}

// Create embedding context (minimal memory)
llama_context* create_context(llama_model* model, int n_ctx, int n_batch, int n_threads) {
    struct llama_context_params params = llama_context_default_params();
    params.n_ctx = n_ctx;
    params.n_batch = n_batch;
    params.n_threads = n_threads;
    params.n_threads_batch = n_threads;
    params.embeddings = 1;
    params.pooling_type = LLAMA_POOLING_TYPE_MEAN;
    return llama_new_context_with_model(model, params);
}

// Tokenize using model's vocab
int tokenize(llama_model* model, const char* text, int text_len, int* tokens, int max_tokens) {
    return llama_tokenize(model, text, text_len, tokens, max_tokens, 1, 1);
}

// Generate embedding
int embed(llama_context* ctx, int* tokens, int n_tokens, float* out, int n_embd) {
    llama_kv_cache_clear(ctx);

    struct llama_batch batch = llama_batch_init(n_tokens, 0, 1);
    for (int i = 0; i < n_tokens; i++) {
        batch.token[i] = tokens[i];
        batch.pos[i] = i;
        batch.n_seq_id[i] = 1;
        batch.seq_id[i][0] = 0;
        batch.logits[i] = 0;
    }
    batch.n_tokens = n_tokens;

    if (llama_decode(ctx, batch) != 0) {
        llama_batch_free(batch);
        return -1;
    }

    float* embd = llama_get_embeddings_seq(ctx, 0);
    if (!embd) {
        llama_batch_free(batch);
        return -2;
    }

    memcpy(out, embd, n_embd * sizeof(float));
    llama_batch_free(batch);
    return 0;
}

int get_n_embd(llama_model* model) { return llama_n_embd(model); }
void free_ctx(llama_context* ctx) { if (ctx) llama_free(ctx); }
void free_model(llama_model* model) { if (model) llama_free_model(model); }
*/
import "C"

import (
    "context"
    "fmt"
    "math"
    "runtime"
    "sync"
    "unsafe"
)

// Model wraps a GGUF model for embedding generation
type Model struct {
    model *C.llama_model
    ctx   *C.llama_context
    dims  int
    mu    sync.Mutex
}

// Options configures model loading
type Options struct {
    ModelPath   string
    ContextSize int  // Default: 512
    BatchSize   int  // Default: 512
    Threads     int  // Default: NumCPU/2, capped at 8
    GPULayers   int  // Default: -1 (auto: all layers to GPU if available)
                     // 0 = CPU only, N = offload N layers
}

// DefaultOptions returns options optimized for GPU with CPU fallback
func DefaultOptions(modelPath string) Options {
    threads := runtime.NumCPU() / 2
    if threads < 1 {
        threads = 1
    }
    if threads > 8 {
        threads = 8
    }
    return Options{
        ModelPath:   modelPath,
        ContextSize: 512,
        BatchSize:   512,
        Threads:     threads,
        GPULayers:   -1, // Auto: use GPU if available, fallback to CPU
    }
}

// LoadModel loads a GGUF model
func LoadModel(opts Options) (*Model, error) {
    cPath := C.CString(opts.ModelPath)
    defer C.free(unsafe.Pointer(cPath))

    model := C.load_model(cPath, C.int(opts.GPULayers))
    if model == nil {
        return nil, fmt.Errorf("failed to load: %s", opts.ModelPath)
    }

    ctx := C.create_context(model, C.int(opts.ContextSize), C.int(opts.BatchSize), C.int(opts.Threads))
    if ctx == nil {
        C.free_model(model)
        return nil, fmt.Errorf("failed to create context")
    }

    return &Model{
        model: model,
        ctx:   ctx,
        dims:  int(C.get_n_embd(model)),
    }, nil
}

// Embed generates a normalized embedding
func (m *Model) Embed(ctx context.Context, text string) ([]float32, error) {
    if text == "" {
        return nil, nil
    }

    m.mu.Lock()
    defer m.mu.Unlock()

    // Tokenize
    cText := C.CString(text)
    defer C.free(unsafe.Pointer(cText))

    tokens := make([]C.int, 512)
    n := C.tokenize(m.model, cText, C.int(len(text)), &tokens[0], 512)
    if n < 0 {
        return nil, fmt.Errorf("tokenization failed")
    }

    // Embed
    emb := make([]float32, m.dims)
    if C.embed(m.ctx, (*C.int)(&tokens[0]), n, (*C.float)(&emb[0]), C.int(m.dims)) != 0 {
        return nil, fmt.Errorf("embedding failed")
    }

    // Normalize
    normalize(emb)
    return emb, nil
}

// EmbedBatch embeds multiple texts
func (m *Model) EmbedBatch(ctx context.Context, texts []string) ([][]float32, error) {
    results := make([][]float32, len(texts))
    for i, t := range texts {
        select {
        case <-ctx.Done():
            return nil, ctx.Err()
        default:
        }
        emb, err := m.Embed(ctx, t)
        if err != nil {
            return nil, fmt.Errorf("text %d: %w", i, err)
        }
        results[i] = emb
    }
    return results, nil
}

// Dimensions returns embedding size
func (m *Model) Dimensions() int { return m.dims }

// Close frees resources
func (m *Model) Close() error {
    m.mu.Lock()
    defer m.mu.Unlock()
    C.free_ctx(m.ctx)
    C.free_model(m.model)
    return nil
}

func normalize(v []float32) {
    var sum float32
    for _, x := range v {
        sum += x * x
    }
    if sum == 0 {
        return
    }
    norm := float32(1.0 / math.Sqrt(float64(sum)))
    for i := range v {
        v[i] *= norm
    }
}

Step 2: Embedder Integration

File: pkg/embed/local_gguf.go

package embed

import (
    "context"
    "fmt"
    "os"
    "path/filepath"

    "github.com/orneryd/nornicdb/pkg/localllm"
)

// LocalGGUFEmbedder implements Embedder using a local GGUF model
type LocalGGUFEmbedder struct {
    model     *localllm.Model
    modelName string
    modelPath string
}

// NewLocalGGUF creates an embedder using the existing Config pattern.
// Model resolution: config.Model → /data/models/{model}.gguf
func NewLocalGGUF(config *Config) (*LocalGGUFEmbedder, error) {
    // Resolve model path: model name → /data/models/{name}.gguf
    modelsDir := os.Getenv("NORNICDB_MODELS_DIR")
    if modelsDir == "" {
        modelsDir = "/data/models"
    }

    modelPath := filepath.Join(modelsDir, config.Model+".gguf")

    // Check if file exists
    if _, err := os.Stat(modelPath); os.IsNotExist(err) {
        return nil, fmt.Errorf("model not found: %s (expected at %s)", config.Model, modelPath)
    }

    opts := localllm.DefaultOptions(modelPath)

    // Optional: override context size explicitly if you need a smaller window
    // opts.ContextSize = 4096

    model, err := localllm.LoadModel(opts)
    if err != nil {
        return nil, fmt.Errorf("failed to load %s: %w", modelPath, err)
    }

    return &LocalGGUFEmbedder{
        model:     model,
        modelName: config.Model,
        modelPath: modelPath,
    }, nil
}

func (e *LocalGGUFEmbedder) Embed(ctx context.Context, text string) ([]float32, error) {
    return e.model.Embed(ctx, text)
}

func (e *LocalGGUFEmbedder) EmbedBatch(ctx context.Context, texts []string) ([][]float32, error) {
    return e.model.EmbedBatch(ctx, texts)
}

func (e *LocalGGUFEmbedder) Dimensions() int { return e.model.Dimensions() }
func (e *LocalGGUFEmbedder) Model() string   { return e.modelName }
func (e *LocalGGUFEmbedder) Close() error    { return e.model.Close() }

Step 3: Factory Update

Update NewEmbedder() in pkg/embed/embed.go:

func NewEmbedder(config *Config) (Embedder, error) {
    switch config.Provider {
    case "local":
        return NewLocalGGUF(config)
    case "ollama":
        return NewOllama(config), nil
    case "openai":
        if config.APIKey == "" {
            return nil, fmt.Errorf("OpenAI requires an API key")
        }
        return NewOpenAI(config), nil
    default:
        return nil, fmt.Errorf("unknown provider: %s", config.Provider)
    }
}

Step 4: Default Config Update

Update defaults in cmd/nornicdb/main.go:

// Change default from mxbai to bge-m3
serveCmd.Flags().String("embedding-model", 
    getEnvStr("NORNICDB_EMBEDDING_MODEL", "bge-m3"), 
    "Embedding model name")

Build System

Build Script: scripts/build-llama.sh

#!/bin/bash
set -euo pipefail

VERSION="${1:-b4535}"
OUTDIR="lib/llama"
mkdir -p "$OUTDIR"

git clone --depth 1 --branch "$VERSION" https://github.com/ggerganov/llama.cpp.git /tmp/llama.cpp
cd /tmp/llama.cpp

OS=$(uname -s | tr '[:upper:]' '[:lower:]')
ARCH=$(uname -m)
[[ "$ARCH" == "x86_64" ]] && ARCH="amd64"
[[ "$ARCH" == "aarch64" ]] && ARCH="arm64"

CMAKE_ARGS="-DLLAMA_STATIC=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=OFF -DLLAMA_BUILD_SERVER=OFF"
[[ "$OS" == "darwin" && "$ARCH" == "arm64" ]] && CMAKE_ARGS="$CMAKE_ARGS -DLLAMA_METAL=ON"

cmake -B build $CMAKE_ARGS
cmake --build build --config Release -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)

cp build/libllama.a "$OUTDIR/libllama_${OS}_${ARCH}.a"
cp llama.h ggml.h "$OUTDIR/"
echo "Built: libllama_${OS}_${ARCH}.a"

GitHub Actions

name: Build llama.cpp
on: workflow_dispatch

jobs:
  build:
    strategy:
      matrix:
        include:
          - os: ubuntu-latest
            lib: libllama_linux_amd64.a
          - os: macos-14
            lib: libllama_darwin_arm64.a
          - os: macos-13  
            lib: libllama_darwin_amd64.a
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/build-llama.sh
      - uses: actions/upload-artifact@v4
        with:
          name: ${{ matrix.lib }}
          path: lib/llama/${{ matrix.lib }}

Environment Variables

Variable Default Description
NORNICDB_EMBEDDING_PROVIDER ollama local, ollama, openai
NORNICDB_EMBEDDING_MODEL bge-m3 Model name (looked up in models dir)
NORNICDB_EMBEDDING_DIMENSIONS 1024 Vector dimensions
NORNICDB_MODELS_DIR /data/models Directory for .gguf files
NORNICDB_EMBEDDING_GPU_LAYERS -1 GPU offload: -1=auto, 0=CPU only, N=N layers

Note: NORNICDB_EMBEDDING_GPU_LAYERS only applies to local provider. External providers (ollama, openai) manage their own GPU usage.


Model Setup (User's Responsibility)

# Create models directory
mkdir -p /data/models

# Option 1: Download pre-converted GGUF (if available)
wget -O /data/models/bge-m3.gguf \
  https://huggingface.co/...some-community-conversion.../bge-m3.Q4_K_M.gguf

# Option 2: Convert from HuggingFace yourself
pip install llama-cpp-python
python -m llama_cpp.convert \
  --outfile /data/models/bge-m3.gguf \
  BAAI/bge-m3

We don't ship models. Users bring their own, handle their own licensing.


Model Selection Guide

Quick Reference

Model Best For Context Dims License
bge-m3 Long docs, code+docs hybrid (default) 8192 1024 MIT
e5-large-v2 Natural language search, multilingual 512 1024 MIT
jina-embeddings-v2-base-code Pure code search 8192 768 Apache 2.0

When to Use Which

Use BGE-M3 (default) when: - Indexing entire files (handles 8K tokens) - Code repositories with long functions - Need hybrid retrieval (lexical + semantic) - Want single model for code + docs

Use E5 when: - Need 100+ language support - Simpler dense-only retrieval preferred - Working with shorter content (<512 tokens) - Slightly lower memory footprint needed

Use Jina Code when: - Pure code-to-code similarity - "Find functions similar to this one" - 30+ programming languages - Don't need natural language understanding

Practical Notes

  1. E5 and BGE-M3 both work for code - they understand variable names, comments, and semantic patterns even though they weren't code-specialized

  2. Context length matters - if your code files average >512 tokens, BGE-M3's 8K context is valuable

  3. Changing models = re-index everything - embeddings from different models are incompatible

  4. Quantization is fine - Q4_K_M loses ~2% quality but runs 3x faster, worth it for most use cases


Performance Targets

Hardware Model Latency Memory
4-core CPU E5-base Q4 <20ms ~50MB (mmap)
8-core CPU E5-large Q4 <30ms ~100MB (mmap)
M2 (Metal) E5-large Q4 <10ms ~100MB

Quick Start

# 1. Put your GGUF model in /data/models/
cp my-bge-model.gguf /data/models/bge-m3.gguf

# 2. Run with local provider (same config pattern as before!)
NORNICDB_EMBEDDING_PROVIDER=local \
NORNICDB_EMBEDDING_MODEL=bge-m3 \
nornicdb serve

# Or via CLI flags
nornicdb serve --embedding-provider=local --embedding-model=bge-m3

Checklist

  • Vendor llama.cpp headers (lib/llama/) - Placeholder headers created
  • Build static libs with GPU support (CUDA for Linux/Windows, Metal for macOS)
  • macOS ARM64 with Metal (libllama_darwin_arm64.a)
  • Linux AMD64 with CUDA (via build script)
  • Windows AMD64 with CUDA (llama_windows.go + build-llama-cuda.ps1)
  • Implement pkg/localllm/llama.go (CGO bindings with GPU detection)
  • Implement pkg/localllm/llama_windows.go (Windows CUDA CGO bindings)
  • Implement pkg/embed/local_gguf.go (Embedder)
  • Update NewEmbedder() factory to handle local provider
  • Ensure ollama and openai providers remain unchanged - Tests pass!
  • Change default model to bge-m3 (optional - kept mxbai for backward compat)
  • Add NORNICDB_MODELS_DIR env var
  • Add NORNICDB_EMBEDDING_GPU_LAYERS env var
  • GPU auto-detection with graceful CPU fallback (in CGO code)
  • Tests + benchmarks (CPU and GPU) - Tests created, skip without model
  • Docs update - README and build workflow created

Build Tags: - Use -tags=localllm to enable local GGUF support on Linux/macOS - Use -tags="cuda localllm" for Windows with CUDA support

Next Steps (Linux/macOS): 1. Run ./scripts/build-llama.sh to build llama.cpp for your platform 2. Place a .gguf model in /data/models/ 3. Build with: go build -tags=localllm ./cmd/nornicdb 4. Run with: NORNICDB_EMBEDDING_PROVIDER=local nornicdb serve

Next Steps (Windows with CUDA): 1. Run .\scripts\build-llama-cuda.ps1 to build llama.cpp with CUDA 2. Place a .gguf model in your models directory 3. Run .\build-cuda.bat or .\build-full.bat to build NornicDB 4. Run with: set NORNICDB_EMBEDDING_PROVIDER=local && bin\nornicdb.exe serve


Migration from mxbai

For users already using mxbai-embed-large via Ollama:

# Before (Ollama) - STILL WORKS, NO CHANGES NEEDED
NORNICDB_EMBEDDING_PROVIDER=ollama
NORNICDB_EMBEDDING_MODEL=mxbai-embed-large

# After (Local GGUF) - NEW OPTION, put model in /data/models/
NORNICDB_EMBEDDING_PROVIDER=local
NORNICDB_EMBEDDING_MODEL=bge-m3   # or e5-large-v2, or keep mxbai if you have the GGUF

Note: Changing embedding models requires re-indexing. Embeddings from different models are not compatible.


Version 2.0 - November 2024