Building a Distributed Consensus Algorithm - Raft

When multiple servers need to agree on shared state, you need a consensus algorithm. Raft was designed to be understandable. Unlike Paxos, which is notoriously difficult to reason about, Raft breaks consensus into three clearly separated subproblems: leader election, log replication, and safety.

This post walks through each piece of Raft, explains why it works, and includes a simplified Go implementation.

Why Consensus Matters

Imagine a key-value store replicated across three servers. A client writes x = 5. All three servers must agree that x = 5 and apply it in the same order relative to other writes. If servers disagree, clients will read stale or inconsistent data depending on which server they hit.

flowchart LR
    Client -->|"set x = 5"| S1["Server 1"]
    S1 -->|replicate| S2["Server 2"]
    S1 -->|replicate| S3["Server 3"]

    S2 -->|"x = 5"| Client2["Client B reads x"]
    S3 -->|"x = 5"| Client3["Client C reads x"]

    style S1 fill:#d4edda,stroke:#28a745,stroke-width:1px
    style S2 fill:#e1f5ff,stroke:#0366d6,stroke-width:1px
    style S3 fill:#e1f5ff,stroke:#0366d6,stroke-width:1px

Consensus algorithms guarantee that all servers process the same sequence of commands, even if some servers crash or network partitions occur, as long as a majority remains available.

The Three States

Every Raft node is in one of three states at any time: follower, candidate, or leader. All nodes start as followers.

stateDiagram-v2
    [*] --> Follower
    Follower --> Candidate: Election timeout\n(no heartbeat received)
    Candidate --> Leader: Receives majority votes
    Candidate --> Follower: Discovers higher term
    Candidate --> Candidate: Election timeout\n(split vote)
    Leader --> Follower: Discovers higher term

Follower - passive, responds to requests from leaders and candidates
Candidate - actively seeking votes to become leader
Leader - handles all client requests and replicates log entries to followers

Only one leader exists per term. A term is a logical clock that increases monotonically. If a node sees a message with a higher term than its own, it steps down to follower.

Leader Election

Raft uses randomized timeouts to trigger elections. Each follower waits for a random duration (typically 150-300ms). If it receives no heartbeat from a leader before the timeout expires, it becomes a candidate.

Election Steps

The candidate increments its current term
It votes for itself
It sends RequestVote RPCs to all other nodes
It waits for responses

sequenceDiagram
    participant A as Node A (Candidate)
    participant B as Node B (Follower)
    participant C as Node C (Follower)

    Note over A: Election timeout expires
    A->>A: Increment term to 2, vote for self
    A->>B: RequestVote(term=2)
    A->>C: RequestVote(term=2)
    B->>A: VoteGranted(term=2)
    C->>A: VoteGranted(term=2)
    Note over A: Received majority (3/3), becomes Leader
    A->>B: AppendEntries(heartbeat)
    A->>C: AppendEntries(heartbeat)

A candidate wins if it receives votes from a majority of nodes (including itself). Since each node votes for at most one candidate per term, at most one leader can be elected per term.

Split Votes

If two candidates start elections simultaneously, neither may get a majority. Both will time out and start a new election with an incremented term. The randomized timeout makes repeated splits unlikely.

sequenceDiagram
    participant A as Node A (Candidate)
    participant B as Node B (Candidate)
    participant C as Node C (Follower)

    Note over A,B: Both timeout simultaneously
    A->>A: Term 2, vote for self
    B->>B: Term 2, vote for self
    A->>C: RequestVote(term=2)
    B->>C: RequestVote(term=2)
    C->>A: VoteGranted(term=2)
    Note over A: Has 2/3 votes, becomes Leader
    Note over B: Has 1/3 votes, remains Candidate
    A->>B: AppendEntries(term=2)
    Note over B: Discovers leader, becomes Follower

Voting Rules

A node grants its vote only if:

It has not already voted in this term
The candidate’s log is at least as up-to-date as the voter’s log

The second rule is critical for safety. It prevents a node with a stale log from becoming leader and overwriting committed entries.

Log Replication

Once elected, the leader handles all client requests. Each request becomes a log entry. The leader appends the entry to its log, then sends AppendEntries RPCs to all followers.

flowchart TB
    subgraph leader["Leader Log"]
        L1["1: set x=1"]
        L2["2: set y=2"]
        L3["3: set x=3"]
    end

    subgraph followerA["Follower A Log"]
        A1["1: set x=1"]
        A2["2: set y=2"]
        A3["3: set x=3"]
    end

    subgraph followerB["Follower B Log"]
        B1["1: set x=1"]
        B2["2: set y=2"]
    end

    leader -->|"AppendEntries"| followerA
    leader -->|"AppendEntries"| followerB

    style L1 fill:#d4edda,stroke:#28a745,stroke-width:1px
    style L2 fill:#d4edda,stroke:#28a745,stroke-width:1px
    style L3 fill:#fff3cd,stroke:#ffc107,stroke-width:1px
    style A1 fill:#d4edda,stroke:#28a745,stroke-width:1px
    style A2 fill:#d4edda,stroke:#28a745,stroke-width:1px
    style A3 fill:#d4edda,stroke:#28a745,stroke-width:1px
    style B1 fill:#d4edda,stroke:#28a745,stroke-width:1px
    style B2 fill:#d4edda,stroke:#28a745,stroke-width:1px

Commit Process

An entry is committed once the leader has replicated it to a majority of nodes. The leader tracks the highest committed index and includes it in AppendEntries messages so followers know which entries are safe to apply.

sequenceDiagram
    participant Client
    participant Leader
    participant F1 as Follower 1
    participant F2 as Follower 2

    Client->>Leader: set x = 5
    Note over Leader: Append to log (index 4)
    Leader->>F1: AppendEntries(index=4, "set x=5")
    Leader->>F2: AppendEntries(index=4, "set x=5")
    F1->>Leader: Success
    Note over Leader: 2/3 have entry, committed
    Leader->>Client: OK
    F2->>Leader: Success
    Note over Leader: Update commitIndex, notify followers

The client receives a response only after the entry is committed (replicated to a majority). This guarantees durability: even if the leader crashes immediately after responding, the entry will survive because a majority of nodes have it.

Log Consistency

Raft maintains two key properties:

If two entries in different logs have the same index and term, they store the same command
If two entries in different logs have the same index and term, all preceding entries are also identical

The leader enforces consistency by including the index and term of the entry immediately preceding the new entries in each AppendEntries RPC. If a follower does not find a matching entry, it rejects the request. The leader then decrements the index and retries until it finds a common point, at which point the follower’s log is overwritten from there forward.

Safety: The Election Restriction

The voting rule mentioned earlier (a candidate’s log must be at least as up-to-date) ensures that any elected leader contains all committed entries. Combined with the fact that entries are committed only when replicated to a majority, and elections require a majority of votes, at least one voter in every successful election has every committed entry.

flowchart LR
    subgraph majority1["Commit majority"]
        N1["Node 1"]
        N2["Node 2"]
        N3["Node 3"]
    end

    subgraph majority2["Election majority"]
        N2b["Node 2"]
        N3b["Node 3"]
        N4["Node 4"]
    end

    N2 -.->|"overlap"| N2b
    N3 -.->|"overlap"| N3b

    style N1 fill:#d4edda,stroke:#28a745,stroke-width:1px
    style N2 fill:#fff3cd,stroke:#ffc107,stroke-width:1px
    style N3 fill:#fff3cd,stroke:#ffc107,stroke-width:1px
    style N2b fill:#fff3cd,stroke:#ffc107,stroke-width:1px
    style N3b fill:#fff3cd,stroke:#ffc107,stroke-width:1px
    style N4 fill:#e1f5ff,stroke:#0366d6,stroke-width:1px

In a 5-node cluster, both commit and election require 3 nodes. Any two majorities overlap by at least one node (yellow), guaranteeing that the new leader has all committed entries.

Handling Failures

Leader Crash

When the leader crashes, followers stop receiving heartbeats. After their election timeout expires, one of them starts an election. The new leader’s log contains all committed entries (guaranteed by the election restriction). Uncommitted entries from the old leader may be lost, which is safe because the client never received a confirmation.

Follower Crash

A crashed follower simply misses some AppendEntries calls. When it recovers, the leader detects the gap through the log consistency check and sends the missing entries. No special recovery protocol is needed.

Network Partition

If the network splits, the partition with a majority can still elect a leader and make progress. The minority partition cannot commit new entries because it cannot achieve a majority. When the partition heals, nodes in the minority catch up from the current leader.

flowchart TB
    subgraph partition1["Majority Partition"]
        L["Leader (Node 1)"]
        F1["Follower (Node 2)"]
        F2["Follower (Node 3)"]
    end

    subgraph partition2["Minority Partition"]
        F3["Follower (Node 4)"]
        F4["Follower (Node 5)"]
    end

    L <-->|"replication"| F1
    L <-->|"replication"| F2
    F3 <-.->|"no quorum"| F4

    style L fill:#d4edda,stroke:#28a745,stroke-width:1px
    style F1 fill:#d4edda,stroke:#28a745,stroke-width:1px
    style F2 fill:#d4edda,stroke:#28a745,stroke-width:1px
    style F3 fill:#f8d7da,stroke:#dc3545,stroke-width:1px
    style F4 fill:#f8d7da,stroke:#dc3545,stroke-width:1px

Raft vs Paxos

Aspect	Raft	Paxos
Understandability	Designed to be easy to learn	Notoriously difficult
Leader	Single leader per term	Multi-proposer (Multi-Paxos uses leader)
Log	Ordered log of commands	Agreement on individual values
Membership changes	Built-in joint consensus	Requires separate protocol
Implementations	etcd, Consul, CockroachDB	Chubby, Spanner (Multi-Paxos)

Raft and Multi-Paxos achieve equivalent safety guarantees. The difference is primarily pedagogical: Raft decomposes the problem into independent subproblems that are easier to reason about individually.

Production Considerations

A production Raft implementation needs several features beyond the core algorithm:

Log compaction - logs grow without bound; snapshotting truncates old entries while preserving committed state
Cluster membership changes - adding or removing nodes safely requires a joint consensus protocol to avoid split-brain during transitions
Persistent state - currentTerm, votedFor, and the log must survive restarts; write them to durable storage before responding to RPCs
Batching - grouping multiple client commands into a single AppendEntries improves throughput
Read optimization - leader can serve reads without log replication by confirming it is still the leader (lease-based or heartbeat-based)

Multi-Raft

A single Raft group handles one replicated log. This works well when the entire dataset fits into one group, but it becomes a bottleneck at scale: the single leader handles all writes, and the log grows with every operation across the entire dataset.

Multi-Raft solves this by running many independent Raft groups on the same cluster. Each group manages a subset (shard or range) of the data. Different groups can have different leaders, spreading write load across nodes.

flowchart TB
    subgraph node1["Node 1"]
        G1L["Group 1 (Leader)"]
        G2F1["Group 2 (Follower)"]
        G3F1["Group 3 (Follower)"]
    end

    subgraph node2["Node 2"]
        G1F1["Group 1 (Follower)"]
        G2L["Group 2 (Leader)"]
        G3F2["Group 3 (Follower)"]
    end

    subgraph node3["Node 3"]
        G1F2["Group 1 (Follower)"]
        G2F2["Group 2 (Follower)"]
        G3L["Group 3 (Leader)"]
    end

    G1L <-->|"keys a-m"| G1F1
    G1L <-->|"keys a-m"| G1F2
    G2L <-->|"keys n-s"| G2F1
    G2L <-->|"keys n-s"| G2F2
    G3L <-->|"keys t-z"| G3F1
    G3L <-->|"keys t-z"| G3F2

    style G1L fill:#d4edda,stroke:#28a745,stroke-width:1px
    style G2L fill:#d4edda,stroke:#28a745,stroke-width:1px
    style G3L fill:#d4edda,stroke:#28a745,stroke-width:1px
    style G1F1 fill:#e1f5ff,stroke:#0366d6,stroke-width:1px
    style G1F2 fill:#e1f5ff,stroke:#0366d6,stroke-width:1px
    style G2F1 fill:#e1f5ff,stroke:#0366d6,stroke-width:1px
    style G2F2 fill:#e1f5ff,stroke:#0366d6,stroke-width:1px
    style G3F1 fill:#e1f5ff,stroke:#0366d6,stroke-width:1px
    style G3F2 fill:#e1f5ff,stroke:#0366d6,stroke-width:1px

Each node participates in multiple Raft groups simultaneously. Leaders (green) are spread across nodes, so no single node is responsible for all writes.

Why Not Just Run Separate Raft Instances?

Running hundreds or thousands of independent Raft processes wastes resources. Each would maintain its own timers, goroutines, network connections, and storage. Multi-Raft implementations share these resources across groups on the same node:

Batched network I/O - messages for different groups destined for the same node are combined into a single network call
Shared storage - log entries from all groups are written to a single WAL (write-ahead log) in batch, reducing disk fsync overhead
Shared ticker - one timer drives election and heartbeat logic for all groups instead of one timer per group

Dragonboat

Dragonboat is a Multi-Raft library written in Go. It handles the hard parts of running thousands of Raft groups efficiently on a cluster of nodes.

Key features:

Manages thousands of Raft groups per node with shared resources
Batches disk writes across groups into a single WAL for high throughput
Coalesces network messages to reduce per-group overhead
Built-in snapshotting and log compaction per group
Pluggable state machine interface: implement Update, Lookup, and SaveSnapshot methods to define your application logic

The application implements the IStateMachine interface with five methods: Update (apply a committed log entry), Lookup (read-only queries), SaveSnapshot and RecoverFromSnapshot (serialize and restore state for log compaction), and Close. Dragonboat handles leader election, log replication, and group management.

When to Use Multi-Raft

Scenario	Single Raft	Multi-Raft
Small dataset, few writes	Good fit	Unnecessary complexity
Large dataset, sharded	Bottleneck	Distributes load
Many independent state machines	One group per machine	Efficient resource sharing
High write throughput needed	Single leader limit	Parallel leaders

Multi-Raft adds operational complexity. The routing layer must know which group owns each key, and cross-group transactions require additional coordination (two-phase commit or similar). Use it when a single Raft group cannot handle the write volume or data size.

Real-World Implementations

etcd - Kubernetes uses etcd as its backing store; etcd uses Raft for replication
Consul - HashiCorp’s service mesh uses Raft for its catalog and KV store
CockroachDB - distributed SQL database uses Multi-Raft with one group per range (subset of data)
TiKV - distributed key-value store (part of TiDB) uses Multi-Raft with one group per region
Dragonboat - Go library for building Multi-Raft applications with thousands of groups per node