paxos vs raft for leader election

distributed-system consensus paxos raft

13,554

Solution 1

It is a common misconception that the original Paxos papers don't use a stable leader. In Paxos Made Simple on page 6 in the section entitled “The Implementation” Lamport wrote:

The algorithm chooses a leader, which plays the roles of the distinguished proposer and the distinguished learner.

This is simply achieved using the Phase 1 messaging of prepare and promises.

Then on pages 9 and 10 under the section “Implementing a State Machine” we have:

In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm.

Here it is using the term 'state machine' in the most generic sense covering the obvious cases such as a key value store or database server where we replicate a log of actions applied to the store.

People get confused about this because of the way Lamport proved Paxos which is now the way it is taught. Lamport proved the correctness of a class of applications known as Paxos by stripping it down to a mathematical model that can be reasoned about. He called this “The Single-Decree Synod” in the original paper The Part-Time Parliament:

Paxon religious leaders asked mathematicians to formulate a protocol for choosing the Synod’s decree. The protocol’s requirements and assumptions were essentially the same as those of the later Parliament except that instead of containing a sequence of decrees, a ledger would have at most one decree. The resulting Synod protocol is described here; the Parliamentary protocol is described in Section 3.

If you find that statement confusing don’t worry it is a bad joke; literally. A translation of this in my own words would be:

“In order to prove the correctness of the consensus algorithm for choosing a stream of commands we can first demonstrate the correctness of a mathematical model which chooses a single command. The mathematical model for selecting a single command can then be extended to the practical algorithm for selecting a stream of commands (Section 3) as long as the invariants of the single command mathematical model are not violated.” – simbo1905

In order to justify my interpretation we can look at Section 3 entitled “The Multi-Decree Parliament” which says:

Instead of passing just one decree, the Paxon Parliament had to pass a series of numbered decrees. As in the Synod protocol, a president was elected. Anyone who wanted a decree passed would inform the president, who would assign a number to the decree and attempt to pass it. Logically, the parliamentary protocol used a separate instance of the complete Synod protocol for each decree number. However, a single president was selected for all these instances, and he performed the first two steps of the protocol just once.

To labour the point both the original “The Part-Time Parliment” paper introducing Paxos as interesting to computer scientists because of its multi-degree algorithm; the parliament protocol. That and the clarification paper “Paxos Made Simple” both define Paxos as having a distinguished leader assigning sequence numbers to a stream of commands. Furthermore, the distinguished leader only sends “prepare” messages when it assumes leadership; after that in steady-state the distinguished leader streams only “accept” messages. He also says elsewhere in the paper to collapse the roles and have all servers run all three roles of the algorithm.

Where you ask "What's the advantage of Paxos's approach over the simple random timeout approach in raft's leader election?" I am not sure what approach you are referring to? With Paxos, you can just randomise the timeouts and issue Prepare messages. The Paxos Made Simple paper indicates that you are free to use timeouts else some other faster mechanism long as you follow the Paxos protocol to “ensure safety”:

The famous result of Fischer, Lynch, and Pat- terson 1 implies that a reliable algorithm for electing a proposer must use either randomness or real time—for example, by using timeouts. However, safety is ensured regardless of the success or failure of the election.

Randomised timeouts are very easy to code and are very understandable. Yet in the worse case, they can lead to a long delay in recovery. You don't like that you can use your own leader election mechanism. For example this one.

Solution 2

After reading the question and @simbo1905's answer, I feel like I have to throw in my 2 cents as I don't think the question has been answered.

What's the advantage of paxos's approach over the simple random timeout approach in raft's leader election?

tl;dr: Paxos is optimal, but Raft has stronger practical guarantees of liveness.

For more information, read on.

As Lamport states in section 3 of Paxos Made Simple,

It can be shown that phase 2 of the Paxos consensus algorithm has the minimum possible cost of any algorithm for reaching agreement in the presence of faults [2]. Hence, the Paxos algorithm is essentially optimal.

So Paxos implements consensus in a way that is maximally efficient when there are no faults.

On the other hand in the same section he also states

If multiple servers think they are leaders, then they can all propose values in the same instance of the consensus algorithm, which could prevent any value from being chosen. However, safety is preserved—two different servers will never disagree on the value chosen as the ith state machine command. Election of a single leader is needed only to ensure progress.

Which means practically Paxos can see violations of it's liveness guarantee.

13,554

Oliver Young

Updated on August 05, 2022

Comments

Oliver Young almost 2 years

After reading paxos and raft paper, I have following confusion: paxos paper only describe consensus on single log entry, which is equivalent the leader election part of the raft algorithm. What's the advantage of paxos's approach over the simple random timeout approach in raft's leader election?
- simbo1905 almost 7 years
  
  This is common misconception due to the way the original paper was written to teach the correctness proof. The author wrote a follow up paper to demystify the original which is cited in this post which explains leadership simbo1905.blog/2016/01/02/…
- simbo1905 almost 7 years
  
  Another detailed post about Paxos leader elections is at simbo1905.blog/2017/08/22/pre-voting-in-distributed-consensu‌s in the comments to that post there is a discussion about renaming paxos states and messages to make the leader election process more explicit.
- Oliver Young almost 7 years
  
  Thank you for your blog. I didn't read part-time parliament paper before. After reading it, I understand that Multi-Paxos may also need a way to elect leader to avoid too many competing leaders to prevent the previous leader progressing. However, the paper doesn't give details about how to get a leader. I imagine that in real world, Multi-Paxos may also use raft-like leader election policy?
- Oliver Young almost 7 years
  
  Also, since single-decree paxos could be used to elect leader. Is it feasible to elect a leader (using single-decree paxos) everytime we need a new leader?
- simbo1905 almost 7 years
  
  Yes. If a node thinks there is no leader it runs the single decree and if it gets back promises after that it just streams accept messages under the same ballot number. So under steady state the leader of a three node cluster can self-accept each value and then commit it after exchanging one message with a second node (sending the value and receiving an ack) which is the optimal message pattern for a quorum scheme of a three node cluster. The paper Paxos Made Simple is just that algorithm. microsoft.com/en-us/research/wp-content/uploads/2016/12/…
- simbo1905 almost 7 years
  
  now written up over at github.com/trex-paxos/trex/wiki/Leader-Elections
- Oliver Young almost 7 years
  
  @simbo1905: Hey I just read it. Thank you for your clarification. Two questions: (1) Just need to verify that you are using random timeout + heartbeat (not single decree paxos) to do the leader election (same as raft), Because you mention using single decree to elect leader in the previous comments. (2) You are doing additional check to prevent infinite loop of leadership changes. The infinite loop of leadership changes problem should also occur in raft, right?
- Oliver Young almost 7 years
  
  And Also, after reading your clarification, I realize the paxos and raft are essentially equivalent. (Single leader, tolerate leader failure, quorum, etc) But somehow, raft seems to be more natural to me. Any thoughts on why we would prefer paxos over raft?
- simbo1905 almost 7 years
  
  They are equivalent when applied to replication so it may be down to personal taste. I find the way the raft paper trashed paxos with unscientific teaching comparisons not to my taste. I am aware of people using paxos techniques such as single synod in more flexible scenarios such as distributed transactions that raft wouldnt fit. Also corfu scales paxos to huge levels simbo1905.blog/category/corfu One could argue that paxos is more fundamental so worth mastering. The counter argument would be that raft solves a very common problem in a very clear way. So its probably down to taste.
- Oliver Young almost 7 years
  
  That makes sense. And also raft does not allow parallel execution of agreement protocol of multiple log entries. This may also be a reason if one want to prefer paxos.
Oliver Young almost 7 years

Thank you again for your detailed answer. I believe I have read that in your blog. It really clarifies my original confusion. Now I understand what happens in steady state and why is it correct. However, I guess my question is that since Lamport doesn't talk much about details of leader election mechanism. Could clarify the main techniques industry use to elect a president in Multi-Paxos protocol?
Oliver Young almost 7 years

Your blog github.com/trex-paxos/trex/wiki/Leader-Elections describes the leader election mechanism. If you may, please clarify the two questions : (1) Just need to verify that you are using random timeout + heartbeat (not single decree paxos) to do the leader election (same as raft), since you mention possibility of using single decree to elect leader in the previous comments. Which technique is more common for leader election? (2) You are doing additional check to prevent infinite loop of leadership changes. The loop of leadership changes problem should also occur in raft, right?
simbo1905 almost 7 years

(1) single decree is prepare -> promise -> accept -> learn (aka commit) exchanges. this happens on leader heartbeat timeout. so both.
simbo1905 almost 7 years

what other technique should industry use beyond what Lamport wrote that "just works" when you write the code and it causes a leader to "just happen"? the only reason to do anything more than what he said is to make it more efficient in terms of timeouts and leader duels and interruptions. if people don't understand that leader election is part of the algorithm or know of a more efficient mechanism to pick a leader such as a dedicated algorithm they can just use that. i have seen zookeeper used often as a leader election service (not for paxos but for many distributed applications).
simbo1905 almost 7 years

(2) see CAP theorem. Paxos, Raft, ZAB aim for strong consistency ("C"). Everyone wants high availability ("A"), you cannot have those when you have a network partition ("P"). You cannot avoid network partitions with normal hardware. So you must choose either consistency (C) or availability (A). The leader loops of Paxos is its demonstration of this reality. Whilst the leader loops you have no "A" as no client gets work done.
simbo1905 almost 7 years

every strongly consistent algorithm must have a mode where under partition or message loss you cannot make progress. else it violates CAP theorem and would be an amazing new discovery. so Raft must have a failure to make progress and be available (e.g. elect leader) during partitions or lost messages. i am not an expert in Raft but it must have a mode where lost messages cause no clear leader to emerge. how similar to paxos that looks could not comment.
Oliver Young almost 7 years

For the second point, we can't have C and A at the same time in the presence of Partition. That's right. However, I don't think that leader loops is a demonstration of that. Since one single link going down in a 3-node cluster does not partition the network. It's still one single partition. Since your strong/weak leader evidence solves the problem (guarantee the availability), if it's truly portioned network, it will be violating the CAP theorem, right?
Oliver Young almost 7 years

I guess your strong/weak leader evidence approach is trying to guarantee progress for a corner case (which is not network partition), right?
simbo1905 almost 7 years

Fair points yes. It was inaccurate of me to equate it to a partition. The node that cannot see the leader can stay up to date by fetching committed values from the node it can see. So its just a temporary inefficiency in propagating data between all nodes until the network heals.
GManNickG over 6 years

@OliverYoung: The P in CAP means "all possible partitions", not just any particular one. That is, there is no system that is both consistent and available under arbitrary partitions. There are (as you've demonstrated) plenty of network partitions that don't trigger the C/A trade-off. However, these are just a subset of the space of all possible partitions; there is always some partition out there that triggers this trade-off. So, that is a network partition, just not one that actually causes the system to stop.