Replication Architecture¶

This document describes the internal architecture of NornicDB's replication system for contributors and advanced users.

For user documentation, see Clustering Guide

Overview¶

NornicDB supports three replication modes to meet different availability and consistency requirements:

Mode	Nodes	Consistency	Use Case
Standalone	1	N/A	Development, testing, small workloads
Hot Standby	2	Eventual	Simple HA, fast failover
Raft Cluster	3-5	Strong	Production HA, consistent reads
Multi-Region	6+	Configurable	Global distribution, disaster recovery

Architecture Diagram¶

┌────────────────────────────────────────────────────────────────────────────────┐
│                    NORNICDB REPLICATION ARCHITECTURE                            │
├────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│  MODE 1: HOT STANDBY (2 nodes)                                                 │
│  ┌─────────────┐      WAL Stream      ┌─────────────┐                         │
│  │   Primary   │ ──────────────────►  │   Standby   │                         │
│  │  (writes)   │    (async/quorum)    │  (failover) │                         │
│  └─────────────┘                      └─────────────┘                         │
│                                                                                 │
│  MODE 2: RAFT CLUSTER (3-5 nodes)                                              │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                        │
│  │   Leader    │◄──►│  Follower   │◄──►│  Follower   │                        │
│  │  (writes)   │    │  (reads)    │    │  (reads)    │                        │
│  └─────────────┘    └─────────────┘    └─────────────┘                        │
│         │                  │                  │                                │
│         └──────────────────┴──────────────────┘                                │
│                    Raft Consensus                                              │
│                                                                                 │
│  MODE 3: MULTI-REGION (Raft clusters + cross-region HA)                        │
│  ┌─────────────────────────┐      ┌─────────────────────────┐                 │
│  │      US-EAST REGION     │      │      EU-WEST REGION     │                 │
│  │  ┌───┐ ┌───┐ ┌───┐     │      │     ┌───┐ ┌───┐ ┌───┐  │                 │
│  │  │ L │ │ F │ │ F │     │ WAL  │     │ L │ │ F │ │ F │  │                 │
│  │  └───┘ └───┘ └───┘     │◄────►│     └───┘ └───┘ └───┘  │                 │
│  │     Raft Cluster A      │async │      Raft Cluster B    │                 │
│  └─────────────────────────┘      └─────────────────────────┘                 │
│                                                                                 │
└────────────────────────────────────────────────────────────────────────────────┘

Package Structure¶

pkg/replication/
├── config.go           # Configuration loading and validation
├── replicator.go       # Core Replicator interface and factory
├── transport.go        # ClusterTransport over TCP (gob-framed messages)
├── ha_standby.go       # Hot Standby implementation
├── raft.go             # Raft consensus implementation
├── multi_region.go     # Multi-region coordinator
├── codec.go            # Gob codecs (node/edge payloads, helpers)
├── handlers.go         # Cluster message dispatch/handler registration
├── chaos_test.go       # Chaos testing infrastructure
├── scenario_test.go    # E2E scenario tests
└── replication_test.go # Unit tests

Core Interfaces¶

Replicator Interface¶

All replication modes implement this interface:

type Replicator interface {
    // Start starts the replicator
    Start(ctx context.Context) error

    // Apply applies a write operation (routes to leader if needed)
    Apply(cmd *Command, timeout time.Duration) error

    // IsLeader returns true if this node can accept writes
    IsLeader() bool

    // LeaderAddr returns the address of the current leader
    LeaderAddr() string

    // LeaderID returns the ID of the current leader
    LeaderID() string

    // Health returns health status
    Health() *HealthStatus

    // WaitForLeader blocks until a leader is elected
    WaitForLeader(timeout time.Duration) error

    // Mode returns the replication mode
    Mode() ReplicationMode

    // NodeID returns this node's ID
    NodeID() string

    // Shutdown gracefully shuts down
    Shutdown() error
}

Current Implementation Notes¶

Transport: pkg/replication/transport.go uses a plain TCP listener with framed gob messages for cluster RPCs (heartbeat, WAL batch, Raft messages).
Payloads: pkg/replication/codec.go defines replication-safe payloads for nodes/edges. Node payloads include embeddings (NamedEmbeddings, ChunkEmbeddings) so embeddings replicate in HA mode.
Write Path: the base storage engine is wrapped by pkg/replication/replicated_engine.go, which converts writes into replication.Command and routes them through the active Replicator.
WAL Streaming: pkg/replication/storage_adapter.go maintains a persistent WAL (restart/recovery) plus an in-memory WAL slice to avoid repeated full-file scans during streaming, and uses an async WAL write queue for lower per-write latency.

Transport Interface¶

Node-to-node communication:

type Transport interface {
    // Connect establishes a connection to a peer
    Connect(ctx context.Context, addr string) (PeerConnection, error)

    // Listen accepts incoming connections
    Listen(ctx context.Context, addr string, handler ConnectionHandler) error

    // Close shuts down the transport
    Close() error
}

type PeerConnection interface {
    // WAL streaming (Hot Standby)
    SendWALBatch(ctx context.Context, entries []*WALEntry) (*WALBatchResponse, error)
    SendHeartbeat(ctx context.Context, req *HeartbeatRequest) (*HeartbeatResponse, error)
    SendFence(ctx context.Context, req *FenceRequest) (*FenceResponse, error)
    SendPromote(ctx context.Context, req *PromoteRequest) (*PromoteResponse, error)

    // Raft consensus
    SendRaftVote(ctx context.Context, req *RaftVoteRequest) (*RaftVoteResponse, error)
    SendRaftAppendEntries(ctx context.Context, req *RaftAppendEntriesRequest) (*RaftAppendEntriesResponse, error)

    Close() error
    IsConnected() bool
}

Storage Interface¶

Replication layer's view of storage:

type Storage interface {
    // Commands
    ApplyCommand(cmd *Command) error

    // WAL position tracking
    GetWALPosition() (uint64, error)
    SetWALPosition(pos uint64) error

    // Node/Edge operations (used by WAL applier)
    CreateNode(node *Node) error
    UpdateNode(node *Node) error
    DeleteNode(id NodeID) error
    CreateEdge(edge *Edge) error
    DeleteEdge(from, to NodeID, relType string) error
    SetProperty(nodeID NodeID, key string, value interface{}) error
}

Network Protocol¶

Port Allocation¶

Port	Protocol	Purpose
7474	HTTP	REST API, Admin, Health checks
7687	Bolt	Neo4j-compatible client queries
7688	Cluster	Replication, Raft consensus

Wire Format¶

The cluster protocol uses length-prefixed JSON over TCP:

┌─────────────┬─────────────────────────────────────┐
│ Length (4B) │          JSON Payload               │
│ Big Endian  │                                     │
└─────────────┴─────────────────────────────────────┘

Message Types¶

Type	Code	Direction	Description
VoteRequest	1	Candidate → Follower	Request vote in election
VoteResponse	2	Follower → Candidate	Grant/deny vote
AppendEntries	3	Leader → Follower	Replicate log entries
AppendEntriesResponse	4	Follower → Leader	Acknowledge entries
WALBatch	5	Primary → Standby	Stream WAL entries
WALBatchResponse	6	Standby → Primary	Acknowledge WAL
Heartbeat	7	Primary → Standby	Health check
HeartbeatResponse	8	Standby → Primary	Health status
Fence	9	Standby → Primary	Fence old primary
FenceResponse	10	Primary → Standby	Acknowledge fence
Promote	11	Admin → Standby	Promote to primary
PromoteResponse	12	Standby → Admin	Promotion status

Mode 1: Hot Standby¶

Components¶

Primary: Accepts writes, streams WAL to standby
Standby: Receives WAL, ready for failover
WALStreamer: Manages WAL position and batching
WALApplier: Applies WAL entries to storage

Write Flow¶

Client                  Primary                 Standby
  │                        │                        │
  │─── WRITE (Bolt) ───────►                        │
  │                        │                        │
  │                        │── WALBatch ───────────►│
  │                        │                        │
  │                        │◄─ WALBatchResponse ────│
  │                        │                        │
  │◄── SUCCESS ────────────│                        │

Sync Modes¶

Mode	Acknowledgment	Data Safety	Latency
`async`	Primary only	Risk of data loss	Lowest
`quorum`	Standby applied	Strongest	Highest

Failover Process¶

Standby detects missing heartbeats
After FAILOVER_TIMEOUT, standby attempts to fence primary
Standby promotes itself to primary
Clients reconnect to new primary

Mode 2: Raft Consensus¶

Components¶

RaftReplicator: Main Raft node implementation
Election Timer: Triggers leader election on timeout
Log: In-memory Raft log with commit tracking
Heartbeat Loop: Leader sends heartbeats to maintain authority

State Machine¶

                 ┌────────────┐
                 │  Follower  │◄────────────────┐
                 └─────┬──────┘                 │
                       │ election timeout       │
                       ▼                        │
                 ┌────────────┐                 │
          ┌─────►│ Candidate  │─────────────────┤
          │      └─────┬──────┘ loses election  │
          │            │                        │
          │            │ wins election          │
          │            ▼                        │
          │      ┌────────────┐                 │
          │      │   Leader   │─────────────────┘
          │      └─────┬──────┘ discovers higher term
          │            │
          └────────────┘
           starts new election

Leader Election¶

Follower's election timer expires
Increments term, transitions to Candidate
Votes for self, requests votes from peers
If majority votes received → becomes Leader
Sends heartbeats to maintain leadership

Log Replication¶

Client sends write to Leader
Leader appends entry to log
Leader sends AppendEntries to all followers
When majority acknowledge → entry committed
Leader applies to state machine, responds to client

Raft RPC Messages¶

RequestVote:

type RaftVoteRequest struct {
    Term         uint64
    CandidateID  string
    LastLogIndex uint64
    LastLogTerm  uint64
}

type RaftVoteResponse struct {
    Term        uint64
    VoteGranted bool
    VoterID     string
}

AppendEntries:

type RaftAppendEntriesRequest struct {
    Term         uint64
    LeaderID     string
    LeaderAddr   string
    PrevLogIndex uint64
    PrevLogTerm  uint64
    Entries      []*RaftLogEntry
    LeaderCommit uint64
}

type RaftAppendEntriesResponse struct {
    Term          uint64
    Success       bool
    MatchIndex    uint64
    ConflictIndex uint64
    ConflictTerm  uint64
}

Mode 3: Multi-Region¶

Components¶

MultiRegionReplicator: Coordinates local Raft + cross-region
Local Raft Cluster: Strong consistency within region
Cross-Region Streamer: Async WAL replication between regions

Write Flow¶

Write arrives at region's Raft leader
Raft commits locally (strong consistency)
Async replication to remote regions
Remote regions apply WAL entries

Conflict Resolution¶

When async replication causes conflicts:

Strategy	Description
`last_write_wins`	Latest timestamp wins
`first_write_wins`	Earliest timestamp wins
`manual`	Flag for manual resolution

Configuration¶

Environment Variables¶

Variable	Default	Description
`NORNICDB_CLUSTER_MODE`	`standalone`	`standalone`, `ha_standby`, `raft`, `multi_region`
`NORNICDB_CLUSTER_NODE_ID`	auto	Unique node identifier
`NORNICDB_CLUSTER_BIND_ADDR`	`0.0.0.0:7688`	Cluster port binding
`NORNICDB_CLUSTER_ADVERTISE_ADDR`	same as bind	Address advertised to peers

See Clustering Guide for complete configuration reference.

Testing¶

Test Categories¶

File	Purpose
`replication_test.go`	Unit tests for each component
`scenario_test.go`	E2E tests for all modes (A/B/C/D scenarios)
`chaos_test.go`	Network failure simulation

Chaos Testing¶

The chaos testing infrastructure simulates:

Packet loss
High latency (2000ms+)
Connection drops
Data corruption
Packet duplication
Packet reordering
Byzantine failures

Running Tests¶

# All replication tests
go test ./pkg/replication/... -v

# Specific test
go test ./pkg/replication/... -run TestScenario_Raft -v

# With race detection
go test ./pkg/replication/... -race

# Skip long-running tests
go test ./pkg/replication/... -short

Implementation Status¶

Component	Status	Details
Hot Standby	✅ Complete	2-node HA with auto-failover
Raft Cluster	✅ Complete	3-5 node strong consistency
Multi-Region	✅ Complete	Async cross-region replication
Chaos Testing	✅ Complete	Extreme latency, packet loss, Byzantine failures

Replication Architecture¶

Overview¶

Architecture Diagram¶

Package Structure¶

Core Interfaces¶

Replicator Interface¶

Current Implementation Notes¶

Transport Interface¶

Storage Interface¶

Network Protocol¶

Port Allocation¶

Wire Format¶

Message Types¶

Mode 1: Hot Standby¶

Components¶

Write Flow¶

Sync Modes¶

Failover Process¶

Mode 2: Raft Consensus¶

Components¶

State Machine¶

Leader Election¶

Log Replication¶

Raft RPC Messages¶

Mode 3: Multi-Region¶

Components¶

Write Flow¶

Conflict Resolution¶

Configuration¶

Environment Variables¶

Testing¶

Test Categories¶

Chaos Testing¶

Running Tests¶

Implementation Status¶

See Also¶