Distributed Systems - Leader Election: How to elect the one
In any distributed system that needs a single coordinator, you need a reliable way to elect one server and to replace it when it fails. Over the course of time researchers and enthusiasts have formed several ways to achieve leader election, and each algorithm is based on its assumptions and inherent trade-offs.
- Quorum-Based(Raft1) - has a voting phase in which a server that receives a majority vote first, wins.
- Lock & Lease2 - a server acquires a shared resource for a period of time via a conditional write and attempts to renew periodically. This isn’t really disjoint from quorum-based though. It just relocated the consensus, not removed it.
- ID Priority (Bully/Ring) - every server has a unique ID and pre-defined rule/s picks the winner. In case of leader’s heartbeat failure, due to timeout or failed server, vote request is propagated around the ring on a rule based order.
- Deterministic or static assignment - hash(key) % num_server is the leader for example.
A raft style leader election is clean and digestible and one which we will explore further.
Assumptions
- Safety over Liveness - It optimizes for safety using
termas logical time for correctness. Thus, safety doesn’t require synchronized clocks. - Crash-Fault-Tolerance - Servers can crash-stop, recover, drop, delay, reorder, duplicate messages, but never lie. see (BFT3)
- 2F+1 majority - Requires more than half of servers to be reachable inorder to elect a leader. In other words, can tolerate f failures with 2f+1 servers.
In a Raft like leader election protocol, a server can be in three states Leader, Candidate, Follower. Raft servers communicate using remote procedure calls (RPCs).
Invariants
- A server votes for at most one candidate per term. This is the mechanism that makes safety possible. Any two majorities share at least one server, so if a server can only vote once, two candidates cannot both reach majority in the same term.
- At most one leader is elected per term. This is the safety property we want. It is not an independent rule though. It follows directly from invariant 1 combined with the quorum requirement.
- Terms only increase. A server that sees a higher term steps down immediately. This is what prevents split-brain after a partition heals. A leader that was isolated comes back, sees a newer term, and becomes a follower. It cannot hold on to leadership it no longer has a claim to.
In full Raft, there is a fourth: a candidate’s log must be at least as up to date as any voter’s log. Without this you could elect a leader missing committed entries. Out of scope here since this is election only, but worth knowing before extending to full consensus.
Algorithm
Every server starts as a follower. Each follower has a randomized election timeout (say, 500–1000ms). If it does not hear from a leader within that window, it concludes the leader is dead and calls an election.
To start an election, a follower increments its term, the monotonically increasing logical clock, and promotes itself to candidate. A candidate votes for itself, and broadcasts a VoteRequest to every peer. Peers grant a vote if they haven’t already voted in that term and the candidate’s term is at least as large as their own. The first candidate to collect a majority wins and becomes leader.
Two details make this work in practice.
- Randomized timeouts - ensure split votes are rare as every server picks a random timeout interval, and a leader is elected relatively quickly.
- Terms provide versioned view of leadership: any server that receives a message from a higher term knows it’s out of date and steps down. A partitioned leader that reconnects can’t reclaim authority, it just sees a newer term and becomes a follower again.
The leader then sends periodic heartbeats to reset followers’ timers and suppress new elections. Heartbeats could be empty messages. Their job can be simple as to say: I’m alive, don’t start a vote.
What about split-brain?
Split-brain is when two servers simultaneously believe they are the leader. It happens during a network partition when the leader gets cut off. The majority side times out and elects a new leader with a higher term, and now both sides have a leader.
Raft contains the damage through quorum and terms. The old leader on the minority side cannot commit anything to log without reaching a majority. When the partition heals it receives a message with the higher term, sees it is out of date, and steps down. It could not do real damage while isolated, and it cannot hold on to leadership once reconnected.
Sample Implementation
An attempt at a vanilla version of Raft style leader election using a simulated network is at djleaderelection. Nodes run as goroutines communicating over in-memory channels, which makes the state machine easy to follow without network noise.
Further Probing
- What happens in a production grade system when leader is unreachable, and none of the follower randomized timers have expired ? - You deliberately trade off availability for consistency/safety, and tune it in myriad ways to minimize the affects.
- What are the cons of using an external shared mutex ? hint: SPOF if unreplicated, and if replicated what stops a paused old lock holder from acting if its lock expired ?
Footnotes
-
Ongaro, D. & Ousterhout, J. (2014). In Search of an Understandable Consensus Algorithm. USENIX ATC. ↩