Graph-RAG: Typical Distributed vs NornicDB In-Memory¶

This document compares a typical distributed Graph-RAG architecture (separate services for embedding, reranking, vector store, and LLM) with NornicDB’s unified in-process design, and summarizes the latency reduction when all three model roles (embedding, reranker, inference) run in memory in a single process.

Typical Distributed Graph-RAG (Reference)¶

Multiple network hops: Orchestrator → Tool Plugin → Embedding API → Vector Store → Reranker API → LLM. Each arrow implies serialization, network RTT, and often queueing.

flowchart LR
    subgraph Users
        U[("Users")]
    end

    subgraph Typical["Typical Graph-RAG (Distributed)"]
        direction TB
        Orch["Orchestrator"]
        Tool["Tool Plugin"]
        QE["Query Embedding<br/>TF-IDF + Embedding API"]
        VS["Vector Store<br/>(Qdrant)"]
        RRF["RRF + Adjacent Chunks"]
        Rerank["Reranker API<br/>bge-reranker-base"]
        LLM["LLM API<br/>(Meta/OpenAI)"]

        U -->|Query| Orch
        Orch -->|Query| Tool
        Tool -->|Query| QE
        QE -->|"Sparse + Dense Vec"| VS
        VS -->|Top k×2| RRF
        RRF -->|Chunks| Rerank
        Rerank -->|Top k + Metadata| Tool
        Tool --> Orch
        Orch -->|Context + Query| LLM
        LLM -->|Response| Orch
        Orch -->|Response| U
    end

    style QE fill:#ffeb3b,color:#000
    style Rerank fill:#ffeb3b,color:#000
    style LLM fill:#2196f3,color:#fff
    style VS fill:#9e9e9e,color:#fff

Latency (typical ballpark per request):

Step	Service	Est. latency (network + compute)
1	Orchestrator → Tool Plugin	1–5 ms
2	Query → Embedding API (e.g. FastAPI bge-small)	20–80 ms
3	Vectors → Vector Store (Qdrant) retrieval	10–50 ms
4	Chunks → Reranker API (e.g. FastAPI bge-reranker)	30–100 ms
5	Reranked context → Orchestrator	1–5 ms
6	Context + Query → LLM API (generation)	200–2000+ ms
Total (retrieval path)	~60–240 ms (before LLM)
Total (full request)	~260–2240+ ms

NornicDB: Single-Process Graph-RAG (Simplified)¶

Embedding, vector search (+ optional BM25), reranking, and graph traversal live in the same process as the application. The inference LLM can be local (e.g. GGUF) in the same host or a separate API; when local, “all 3 LLMs” (embedder, reranker, inference) are in-memory / on-box.

flowchart LR
    subgraph Users
        U2[("Users")]
    end

    subgraph NornicDB["NornicDB (Single Process)"]
        direction TB
        App["App / Heimdall<br/>Orchestrator + Tool"]
        Embed["Embedding Model<br/>(in-memory, e.g. bge-m3)"]
        Store["Storage + Vector Index<br/>(Badger + HNSW/BM25)"]
        Rerank2["Reranker<br/>(in-memory, optional)"]
        Infer["Inference LLM<br/>(local GGUF or API)"]

        U2 -->|Query| App
        App -->|Query text| Embed
        Embed -->|Dense vec| Store
        Store -->|Top k and graph neighbors| Rerank2
        Rerank2 -->|Ranked chunks| App
        App -->|Context + Query| Infer
        Infer -->|Response| App
        App -->|Response| U2
    end

    style Embed fill:#4caf50,color:#fff
    style Rerank2 fill:#4caf50,color:#fff
    style Infer fill:#2196f3,color:#fff
    style Store fill:#795548,color:#fff

Latency (in-process, no network between components):

Step	Component	Est. latency (in-process)
1	Query → Embedding (same process)	1–15 ms
2	Vector + BM25 search (local index)	0.5–5 ms
3	Graph traversal (depth 1, same store)	0.5–3 ms
4	Rerank (same process, optional)	2–20 ms
5	Context + Query → LLM (local or API)	200–2000+ ms (unchanged if API)
Total (retrieval path)	~4–43 ms
Total (full request, local LLM)	~204–2043 ms (LLM dominates)

Side-by-Side Latency Comparison¶

flowchart TB
    subgraph Legend
        L1["🟡 Distributed: network + service hop"]
        L2["🟢 NornicDB: in-process"]
    end

    subgraph Distributed["Retrieval path (typical)"]
        D1["Orchestrator"] -->|"20–80 ms"| D2["Embedding API"]
        D2 -->|"10–50 ms"| D3["Vector Store"]
        D3 -->|"30–100 ms"| D4["Reranker API"]
        D4 --> D5["Back to Orchestrator"]
    end

    subgraph InProcess["Retrieval path (NornicDB)"]
        N1["App"] -->|"1–15 ms"| N2["Embedding (in-mem)"]
        N2 -->|"0.5–5 ms"| N3["Vector Index"]
        N3 -->|"2–20 ms"| N4["Reranker (in-mem)"]
        N4 --> N5["Back to App"]
    end

    Distributed -.->|"~60–240 ms total"| Summary1["Retrieval total"]
    InProcess -.->|"~4–43 ms total"| Summary2["Retrieval total"]

Metric	Typical distributed	NornicDB in-memory
Retrieval path (embed → search → rerank)	~60–240 ms	~4–43 ms
Latency reduction (retrieval)	—	~5–15× lower (ballpark)
Network hops (retrieval)	3–4 (embed, store, rerank)	0
All 3 “LLMs” (embed, rerank, infer)	Separate services/APIs	Embed + rerank in-process; infer local or API

Deployment: Containers and Services¶

Typical Graph-RAG uses multiple discrete services, each often run in its own container with its own process, networking, and scaling. NornicDB collapses retrieval (embedding, vector store, reranker, graph) into one in-memory process deployable as a single Docker container.

Typical Graph-RAG: Multiple Containers¶

Each logical service typically runs in a separate container; the orchestrator and tool plugin may share a container, but embedding, vector store, reranker, and (if self-hosted) the LLM each add at least one container.

flowchart TB
    subgraph Host["Single host (e.g. Docker Compose)"]
        subgraph C1["Container 1: App / Orchestrator"]
            Orch["Orchestrator + Tool Plugin"]
        end
        subgraph C2["Container 2: Embedding API"]
            EmbedSvc["FastAPI<br/>bge-small / bge-m3"]
        end
        subgraph C3["Container 3: Vector Store"]
            Qdrant["Qdrant<br/>Vector DB"]
        end
        subgraph C4["Container 4: Reranker API"]
            RerankSvc["FastAPI<br/>bge-reranker-base"]
        end
        subgraph C5["Container 5: LLM (if self-hosted)"]
            LLMSvc["vLLM / Ollama / etc."]
        end
    end

    Orch <-->|HTTP/gRPC| EmbedSvc
    Orch <-->|gRPC| Qdrant
    Orch <-->|HTTP| RerankSvc
    Orch <-->|HTTP| LLMSvc

    style C1 fill:#e3f2fd
    style C2 fill:#fff3e0
    style C3 fill:#f5f5f5
    style C4 fill:#fff3e0
    style C5 fill:#e8f5e9

#	Container / service	Role
1	App / Orchestrator	Request handling, tool plugin, orchestration
2	Embedding API (e.g. FastAPI)	Query + chunk embedding (bge-small / bge-m3)
3	Vector Store (e.g. Qdrant)	Vector + optional sparse index, persistence
4	Reranker API (e.g. FastAPI)	Cross-encoder reranking (bge-reranker)
5	LLM (if self-hosted)	vLLM, Ollama, or similar for generation
Total	5 containers (4 if LLM is external API)

NornicDB: Single Container, Single Process¶

All retrieval components run in one process inside one container: embedding model, vector/BM25 index, optional reranker, graph storage, and (if configured) local inference LLM. No inter-container networking for the retrieval path.

flowchart TB
    subgraph Single["Single container: NornicDB"]
        subgraph Process["One process (in-memory)"]
            App2["App / Heimdall<br/>Orchestrator + Tool"]
            Embed2["Embedding<br/>(in-memory)"]
            Store2["Storage + Vector Index<br/>Badger + HNSW + BM25"]
            Rerank2["Reranker<br/>(in-memory, optional)"]
            Infer2["Inference LLM<br/>(local GGUF or outbound API)"]
            App2 --> Embed2
            Embed2 --> Store2
            Store2 --> Rerank2
            Rerank2 --> App2
            App2 --> Infer2
            Infer2 --> App2
        end
    end

    style Process fill:#c8e6c9

#	Container / process	Contents
1	NornicDB (single container, single process)	Orchestrator, Tool Plugin, Embedding, Vector Index + BM25, Graph Store (Badger), Reranker (optional), local LLM (optional)
Total	1 container	All retrieval + optional local inference in one deployable unit

Side-by-Side: Container Count and Complexity¶

flowchart LR
    subgraph TypicalDeploy["Typical Graph-RAG deployment"]
        direction TB
        T1["📦 Container 1<br/>Orchestrator"]
        T2["📦 Container 2<br/>Embedding API"]
        T3["📦 Container 3<br/>Qdrant"]
        T4["📦 Container 4<br/>Reranker API"]
        T5["📦 Container 5<br/>LLM (optional)"]
    end

    subgraph NornicDeploy["NornicDB deployment"]
        N1["📦 Single container<br/>Embed + Store + Rerank + App<br/>(+ optional local LLM)"]
    end

    TypicalDeploy -->|"5 containers, multi-service config, inter-container network"| L1[Typical]
    NornicDeploy -->|"1 container, single process, no retrieval network hops"| L2[NornicDB]

Aspect	Typical Graph-RAG	NornicDB
Containers (min)	4 (orchestrator, embedding, vector store, reranker)	1
Containers (with self-hosted LLM)	5	1
Processes (retrieval path)	4+ (one per service)	1
Inter-service networking	Yes (HTTP/gRPC between containers)	No (in-process only)
Config / env	Multiple images, ports, envs, health checks	Single image, one port, one env
Scaling	Scale each service independently (more ops)	Scale single container (simpler)

Summary¶

Typical Graph-RAG: Orchestrator, Tool Plugin, Embedding API, Vector Store (e.g. Qdrant), Reranker API, and LLM API are separate; each step adds network and serialization cost. Deployment usually means 4–5 containers (orchestrator, embedding API, vector store, reranker, and optionally self-hosted LLM), each with its own image, port, and config.
NornicDB: Embedding model, vector/BM25 index, optional reranker, and graph storage run in one process; retrieval is in-process and much faster. When inference is also local (e.g. GGUF), all three model roles (embedding, reranking, inference) are in-memory on the same machine. Deployment is a single Docker container (one process, one port, no inter-container networking for retrieval).
Latency: Retrieval path drops from roughly 60–240 ms to 4–43 ms in the NornicDB case; end-to-end latency is then dominated by the inference LLM (same as in the distributed setup if both use the same LLM API or local model).
Ops simplification: One container and one process to deploy, scale, and monitor instead of 4–5; no retrieval-path networking or cross-service health checks.