Replication Architecture¶
This document describes the internal architecture of NornicDB's replication system for contributors and advanced users.
For user documentation, see Clustering Guide
Overview¶
NornicDB supports three replication modes to meet different availability and consistency requirements:
| Mode | Nodes | Consistency | Use Case |
|---|---|---|---|
| Standalone | 1 | N/A | Development, testing, small workloads |
| Hot Standby | 2 | Eventual | Simple HA, fast failover |
| Raft Cluster | 3-5 | Strong | Production HA, consistent reads |
| Multi-Region | 6+ | Configurable | Global distribution, disaster recovery |
Architecture Diagram¶
┌────────────────────────────────────────────────────────────────────────────────┐
│ NORNICDB REPLICATION ARCHITECTURE │
├────────────────────────────────────────────────────────────────────────────────┤
│ │
│ MODE 1: HOT STANDBY (2 nodes) │
│ ┌─────────────┐ WAL Stream ┌─────────────┐ │
│ │ Primary │ ──────────────────► │ Standby │ │
│ │ (writes) │ (async/quorum) │ (failover) │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ MODE 2: RAFT CLUSTER (3-5 nodes) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Leader │◄──►│ Follower │◄──►│ Follower │ │
│ │ (writes) │ │ (reads) │ │ (reads) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ └──────────────────┴──────────────────┘ │
│ Raft Consensus │
│ │
│ MODE 3: MULTI-REGION (Raft clusters + cross-region HA) │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ US-EAST REGION │ │ EU-WEST REGION │ │
│ │ ┌───┐ ┌───┐ ┌───┐ │ │ ┌───┐ ┌───┐ ┌───┐ │ │
│ │ │ L │ │ F │ │ F │ │ WAL │ │ L │ │ F │ │ F │ │ │
│ │ └───┘ └───┘ └───┘ │◄────►│ └───┘ └───┘ └───┘ │ │
│ │ Raft Cluster A │async │ Raft Cluster B │ │
│ └─────────────────────────┘ └─────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────────────┘
Package Structure¶
pkg/replication/
├── config.go # Configuration loading and validation
├── replicator.go # Core Replicator interface and factory
├── transport.go # ClusterTransport over TCP (gob-framed messages)
├── ha_standby.go # Hot Standby implementation
├── raft.go # Raft consensus implementation
├── multi_region.go # Multi-region coordinator
├── codec.go # Gob codecs (node/edge payloads, helpers)
├── handlers.go # Cluster message dispatch/handler registration
├── chaos_test.go # Chaos testing infrastructure
├── scenario_test.go # E2E scenario tests
└── replication_test.go # Unit tests
Core Interfaces¶
Replicator Interface¶
All replication modes implement this interface:
type Replicator interface {
// Start starts the replicator
Start(ctx context.Context) error
// Apply applies a write operation (routes to leader if needed)
Apply(cmd *Command, timeout time.Duration) error
// IsLeader returns true if this node can accept writes
IsLeader() bool
// LeaderAddr returns the address of the current leader
LeaderAddr() string
// LeaderID returns the ID of the current leader
LeaderID() string
// Health returns health status
Health() *HealthStatus
// WaitForLeader blocks until a leader is elected
WaitForLeader(timeout time.Duration) error
// Mode returns the replication mode
Mode() ReplicationMode
// NodeID returns this node's ID
NodeID() string
// Shutdown gracefully shuts down
Shutdown() error
}
Current Implementation Notes¶
- Transport:
pkg/replication/transport.gouses a plain TCP listener with framed gob messages for cluster RPCs (heartbeat, WAL batch, Raft messages). - Payloads:
pkg/replication/codec.godefines replication-safe payloads for nodes/edges. Node payloads include embeddings (NamedEmbeddings,ChunkEmbeddings) so embeddings replicate in HA mode. - Write Path: the base storage engine is wrapped by
pkg/replication/replicated_engine.go, which converts writes intoreplication.Commandand routes them through the activeReplicator. - WAL Streaming:
pkg/replication/storage_adapter.gomaintains a persistent WAL (restart/recovery) plus an in-memory WAL slice to avoid repeated full-file scans during streaming, and uses an async WAL write queue for lower per-write latency.
Transport Interface¶
Node-to-node communication:
type Transport interface {
// Connect establishes a connection to a peer
Connect(ctx context.Context, addr string) (PeerConnection, error)
// Listen accepts incoming connections
Listen(ctx context.Context, addr string, handler ConnectionHandler) error
// Close shuts down the transport
Close() error
}
type PeerConnection interface {
// WAL streaming (Hot Standby)
SendWALBatch(ctx context.Context, entries []*WALEntry) (*WALBatchResponse, error)
SendHeartbeat(ctx context.Context, req *HeartbeatRequest) (*HeartbeatResponse, error)
SendFence(ctx context.Context, req *FenceRequest) (*FenceResponse, error)
SendPromote(ctx context.Context, req *PromoteRequest) (*PromoteResponse, error)
// Raft consensus
SendRaftVote(ctx context.Context, req *RaftVoteRequest) (*RaftVoteResponse, error)
SendRaftAppendEntries(ctx context.Context, req *RaftAppendEntriesRequest) (*RaftAppendEntriesResponse, error)
Close() error
IsConnected() bool
}
Storage Interface¶
Replication layer's view of storage:
type Storage interface {
// Commands
ApplyCommand(cmd *Command) error
// WAL position tracking
GetWALPosition() (uint64, error)
SetWALPosition(pos uint64) error
// Node/Edge operations (used by WAL applier)
CreateNode(node *Node) error
UpdateNode(node *Node) error
DeleteNode(id NodeID) error
CreateEdge(edge *Edge) error
DeleteEdge(from, to NodeID, relType string) error
SetProperty(nodeID NodeID, key string, value interface{}) error
}
Network Protocol¶
Port Allocation¶
| Port | Protocol | Purpose |
|---|---|---|
| 7474 | HTTP | REST API, Admin, Health checks |
| 7687 | Bolt | Neo4j-compatible client queries |
| 7688 | Cluster | Replication, Raft consensus |
Wire Format¶
The cluster protocol uses length-prefixed JSON over TCP:
┌─────────────┬─────────────────────────────────────┐
│ Length (4B) │ JSON Payload │
│ Big Endian │ │
└─────────────┴─────────────────────────────────────┘
Message Types¶
| Type | Code | Direction | Description |
|---|---|---|---|
| VoteRequest | 1 | Candidate → Follower | Request vote in election |
| VoteResponse | 2 | Follower → Candidate | Grant/deny vote |
| AppendEntries | 3 | Leader → Follower | Replicate log entries |
| AppendEntriesResponse | 4 | Follower → Leader | Acknowledge entries |
| WALBatch | 5 | Primary → Standby | Stream WAL entries |
| WALBatchResponse | 6 | Standby → Primary | Acknowledge WAL |
| Heartbeat | 7 | Primary → Standby | Health check |
| HeartbeatResponse | 8 | Standby → Primary | Health status |
| Fence | 9 | Standby → Primary | Fence old primary |
| FenceResponse | 10 | Primary → Standby | Acknowledge fence |
| Promote | 11 | Admin → Standby | Promote to primary |
| PromoteResponse | 12 | Standby → Admin | Promotion status |
Mode 1: Hot Standby¶
Components¶
- Primary: Accepts writes, streams WAL to standby
- Standby: Receives WAL, ready for failover
- WALStreamer: Manages WAL position and batching
- WALApplier: Applies WAL entries to storage
Write Flow¶
Client Primary Standby
│ │ │
│─── WRITE (Bolt) ───────► │
│ │ │
│ │── WALBatch ───────────►│
│ │ │
│ │◄─ WALBatchResponse ────│
│ │ │
│◄── SUCCESS ────────────│ │
Sync Modes¶
| Mode | Acknowledgment | Data Safety | Latency |
|---|---|---|---|
async | Primary only | Risk of data loss | Lowest |
quorum | Standby applied | Strongest | Highest |
Failover Process¶
- Standby detects missing heartbeats
- After
FAILOVER_TIMEOUT, standby attempts to fence primary - Standby promotes itself to primary
- Clients reconnect to new primary
Mode 2: Raft Consensus¶
Components¶
- RaftReplicator: Main Raft node implementation
- Election Timer: Triggers leader election on timeout
- Log: In-memory Raft log with commit tracking
- Heartbeat Loop: Leader sends heartbeats to maintain authority
State Machine¶
┌────────────┐
│ Follower │◄────────────────┐
└─────┬──────┘ │
│ election timeout │
▼ │
┌────────────┐ │
┌─────►│ Candidate │─────────────────┤
│ └─────┬──────┘ loses election │
│ │ │
│ │ wins election │
│ ▼ │
│ ┌────────────┐ │
│ │ Leader │─────────────────┘
│ └─────┬──────┘ discovers higher term
│ │
└────────────┘
starts new election
Leader Election¶
- Follower's election timer expires
- Increments term, transitions to Candidate
- Votes for self, requests votes from peers
- If majority votes received → becomes Leader
- Sends heartbeats to maintain leadership
Log Replication¶
- Client sends write to Leader
- Leader appends entry to log
- Leader sends AppendEntries to all followers
- When majority acknowledge → entry committed
- Leader applies to state machine, responds to client
Raft RPC Messages¶
RequestVote:
type RaftVoteRequest struct {
Term uint64
CandidateID string
LastLogIndex uint64
LastLogTerm uint64
}
type RaftVoteResponse struct {
Term uint64
VoteGranted bool
VoterID string
}
AppendEntries:
type RaftAppendEntriesRequest struct {
Term uint64
LeaderID string
LeaderAddr string
PrevLogIndex uint64
PrevLogTerm uint64
Entries []*RaftLogEntry
LeaderCommit uint64
}
type RaftAppendEntriesResponse struct {
Term uint64
Success bool
MatchIndex uint64
ConflictIndex uint64
ConflictTerm uint64
}
Mode 3: Multi-Region¶
Components¶
- MultiRegionReplicator: Coordinates local Raft + cross-region
- Local Raft Cluster: Strong consistency within region
- Cross-Region Streamer: Async WAL replication between regions
Write Flow¶
- Write arrives at region's Raft leader
- Raft commits locally (strong consistency)
- Async replication to remote regions
- Remote regions apply WAL entries
Conflict Resolution¶
When async replication causes conflicts:
| Strategy | Description |
|---|---|
last_write_wins | Latest timestamp wins |
first_write_wins | Earliest timestamp wins |
manual | Flag for manual resolution |
Configuration¶
Environment Variables¶
| Variable | Default | Description |
|---|---|---|
NORNICDB_CLUSTER_MODE | standalone | standalone, ha_standby, raft, multi_region |
NORNICDB_CLUSTER_NODE_ID | auto | Unique node identifier |
NORNICDB_CLUSTER_BIND_ADDR | 0.0.0.0:7688 | Cluster port binding |
NORNICDB_CLUSTER_ADVERTISE_ADDR | same as bind | Address advertised to peers |
See Clustering Guide for complete configuration reference.
Testing¶
Test Categories¶
| File | Purpose |
|---|---|
replication_test.go | Unit tests for each component |
scenario_test.go | E2E tests for all modes (A/B/C/D scenarios) |
chaos_test.go | Network failure simulation |
Chaos Testing¶
The chaos testing infrastructure simulates:
- Packet loss
- High latency (2000ms+)
- Connection drops
- Data corruption
- Packet duplication
- Packet reordering
- Byzantine failures
Running Tests¶
# All replication tests
go test ./pkg/replication/... -v
# Specific test
go test ./pkg/replication/... -run TestScenario_Raft -v
# With race detection
go test ./pkg/replication/... -race
# Skip long-running tests
go test ./pkg/replication/... -short
Implementation Status¶
| Component | Status | Details |
|---|---|---|
| Hot Standby | ✅ Complete | 2-node HA with auto-failover |
| Raft Cluster | ✅ Complete | 3-5 node strong consistency |
| Multi-Region | ✅ Complete | Async cross-region replication |
| Chaos Testing | ✅ Complete | Extreme latency, packet loss, Byzantine failures |
See Also¶
- Clustering Guide - User documentation
- System Design - Overall architecture
- Plugin System - APOC plugin architecture