RFC: Heimdall SLM Quality Control for Auto-TLP¶
Status: Proposal
Author: NornicDB Team
Created: December 2024
Summary¶
Add an optional LLM-based quality control layer to Auto-TLP (Automatic Topological Link prediction) that validates relationship suggestions before they're created. This "Heimdall" layer uses a small, local instruction-tuned model to review TLP's algorithmic suggestions and can optionally suggest additional relationships.
Motivation¶
Auto-TLP automatically creates edges based on: - Embedding similarity - Co-access patterns - Temporal proximity - Transitive inference
While these algorithms are fast and effective, they can produce false positives: - Similarity noise: Similar embeddings don't always mean meaningful relationships - Spurious co-access: Users might access unrelated nodes in the same session - Transitive errors: AβB and BβC doesn't always mean A should connect to C
An LLM can provide semantic validation that algorithms can't: - "These two notes are about the same project" β - "These nodes share keywords but aren't actually related" β - "This relationship would be more accurately typed as INSPIRED_BY" π
Design Goals¶
- Opt-in via feature flags - Disabled by default, zero impact if not enabled
- Small model friendly - Works with 1-3B parameter instruction models
- Fail-open - LLM failures don't block edge creation
- Batch efficient - Multiple suggestions per LLM call
- Size aware - Gracefully handles large nodes that exceed context limits
- Augmentation capable - LLM can suggest edges TLP missed (optional)
Architecture¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Auto-TLP Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Node Created/Accessed β
β β β
β βΌ β
β ββββββββββββββββββββ β
β β TLP Algorithms β Fast, algorithmic candidate generation β
β β β’ Similarity β β
β β β’ Co-access β β
β β β’ Temporal β β
β β β’ Transitive β β
β ββββββββββ¬ββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ β
β β LLM_QC Enabled? ββββββΆβ Skip QC, return all candidates β β
β ββββββββββ¬ββββββββββ No βββββββββββββββββββββββββββββββββββ β
β β Yes β
β βΌ β
β ββββββββββββββββββββ β
β β Batch & Check β Group candidates, check size limits β
β β Size Limits β β
β ββββββββββ¬ββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ β
β β Prompt too big? ββββββΆβ Log warning, pass batch through β β
β ββββββββββ¬ββββββββββ Yes βββββββββββββββββββββββββββββββββββ β
β β No β
β βΌ β
β ββββββββββββββββββββ β
β β Heimdall SLM β Local instruct model reviews batch β
β β Batch Review β β
β ββββββββββ¬ββββββββββ β
β β β
β βββββββββ LLM Error βββββββΆ Log, pass through β
β β β
β βΌ β
β ββββββββββββββββββββ β
β β Parse Response β Extract approved/rejected indices β
β ββββββββββ¬ββββββββββ β
β β β
β βββββββββ Parse Error βββββΆ Fuzzy parse or approve β
β β β
β βΌ β
β ββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ β
β β Augment Enabled? ββββββΆβ Include LLM's new suggestions β β
β ββββββββββ¬ββββββββββ Yes βββββββββββββββββββββββββββββββββββ β
β β No β
β βΌ β
β Return approved edges β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Feature Flags¶
| Flag | Default | Description |
|---|---|---|
NORNICDB_AUTO_TLP_ENABLED | β Off | Enable TLP candidate generation |
NORNICDB_AUTO_TLP_LLM_QC_ENABLED | β Off | Enable Heimdall batch review |
NORNICDB_AUTO_TLP_LLM_AUGMENT_ENABLED | β Off | Allow Heimdall to suggest new edges |
Progressive enablement:
# Stage 1: TLP only (fast, no LLM)
export NORNICDB_AUTO_TLP_ENABLED=true
# Stage 2: TLP + Heimdall review (higher quality)
export NORNICDB_AUTO_TLP_ENABLED=true
export NORNICDB_AUTO_TLP_LLM_QC_ENABLED=true
# Stage 3: Full hybrid (TLP + review + augmentation)
export NORNICDB_AUTO_TLP_ENABLED=true
export NORNICDB_AUTO_TLP_LLM_QC_ENABLED=true
export NORNICDB_AUTO_TLP_LLM_AUGMENT_ENABLED=true
Unified SLM Architecture¶
Heimdall QC uses the same SLM instance as Bifrost commands: - Stateless: No context accumulates between calls - One-shot: Each call is independent, complete in single pass - KV Cache: Static system prompt cached, only data varies
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SINGLE SLM INSTANCE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β KV Cache (static, loaded once): β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β [Bifrost Commands] [Heimdall QC Instructions] ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Per-call (dynamic): β
β βββββββββββββββββββββ βββββββββββββββββββββββββββββββββ
β β Bifrost: "CREATE β β Heimdall: "SRC:node-1[Note] ββ
β β (n:Person)" β β EDGES:0.node-2βREL(80%)" ββ
β βββββββββββββββββββββ βββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Prompt Format¶
System Prompt (static, KV cached):
Review graph edges. Output JSON only.
Format: {"approved":[indices],"rejected":[indices],"reasoning":"why"}
Approve if nodes are meaningfully related. Reject spam/duplicates.
User Content (dynamic, per-call):
SRC:node-123[Memory,Note]
title:Machine Learning Basics
content:Introduction to neural networks...
EDGES:
0.node-456βRELATES_TO(85%)
1.node-789βRELATES_TO(72%)
Response (JSON only):
With augmentation:
{"approved":[0],"additional":[{"target_id":"node-999","type":"INSPIRED_BY","conf":0.8,"reason":"both discuss backprop"}]}
Configuration¶
type HeimdallQCConfig struct {
Enabled bool // Master switch
Timeout time.Duration // Default: 10s
MaxContextBytes int // Default: 4096 (~1000 tokens)
MaxBatchSize int // Default: 5 suggestions per call
MaxNodeSummaryLen int // Default: 200 chars per property
MinConfidenceToReview float64 // Default: 0.5 (skip weak candidates)
CacheDecisions bool // Default: true
CacheTTL time.Duration // Default: 1 hour
}
Error Handling¶
Principle: Fail-open, log, continue
| Error | Action |
|---|---|
| LLM timeout | Log warning, approve batch, continue |
| LLM crash | Log error, approve batch, continue |
| Invalid JSON | Fuzzy parse or approve all |
| Prompt too large | Log warning, skip review, pass through |
| Context cancelled | Return immediately with current results |
No retries - If the LLM fails, we don't retry. We log the decision made without LLM input and move on.
Usage Example¶
import (
"github.com/orneryd/nornicdb/pkg/inference"
"github.com/orneryd/nornicdb/pkg/config"
"github.com/orneryd/nornicdb/pkg/heimdall"
)
// Heimdall QC uses the SAME Generator as Bifrost commands
// Direct llama.cpp via localllm - no HTTP calls
func setupHeimdallQC(generator heimdall.Generator) {
systemPrompt := inference.GetSystemPrompt(config.IsAutoTLPLLMAugmentEnabled())
heimdallFunc := func(ctx context.Context, userContent string) (string, error) {
// Combine static system prompt + dynamic user content
prompt := systemPrompt + "\n\n" + userContent
return generator.Generate(ctx, prompt, heimdall.GenerateParams{
MaxTokens: 256,
Temperature: 0.1, // Low temp for deterministic QC
})
}
qc := inference.NewHeimdallQC(heimdallFunc, nil)
engine.SetHeimdallQC(qc)
}
// Both Bifrost commands and Heimdall QC share:
// - Same heimdall.Generator (in-memory llama.cpp)
// - Same KV cache (system prompts cached)
// - Stateless one-shot calls
Performance Expectations¶
| Metric | Without QC | With QC |
|---|---|---|
| Latency per node | ~5-20ms | ~100-500ms |
| Edge quality | Good | Better |
| False positives | Some | Fewer |
| LLM calls | 0 | ~1 per 5 suggestions |
Mitigations: - Batch processing reduces calls - Caching prevents redundant reviews - Size limits prevent slow large-context calls - Async processing possible for background indexing
Alternatives Considered¶
1. Pre-trained classifier¶
- Pro: Faster than LLM
- Con: Requires training data, less flexible
- Decision: LLM is more adaptable to diverse data
2. Rule-based filtering¶
- Pro: Zero latency
- Con: Can't understand semantics
- Decision: TLP already does this; LLM adds semantic layer
3. Post-hoc cleanup¶
- Pro: Doesn't slow down creation
- Con: Edges exist until cleaned; user sees noise
- Decision: Better to validate before creation
Open Questions¶
-
Batch size tuning: Is 5 the right default? Should it auto-tune based on model?
-
Augmentation scope: Should augmented edges have lower initial confidence?
-
Model recommendations: Which small models work best? (Qwen 1.5B? Phi-3? Gemma 2B?)
-
Async mode: Should there be an option to review edges asynchronously?
Feedback Requested¶
- Does this solve a real problem for your use case?
- Are the feature flags granular enough?
- What small models have you had success with?
- Should there be a "strict mode" that blocks on LLM errors?
- Other edge types Heimdall should suggest?
Implementation PR: [Link to PR when ready]
Related Issues: - #XXX Auto-TLP implementation - #XXX Edge decay system