Status: WIP
This is my attempt to describe the multi paxos algorithm in a programmer friendly way.
The multi-paxos algorithm is an extension of the basic-paxos algorithm described in here.
The key enhancements are as follows:
- Each process maintains multiple decrees in its ledger, and the highest consecutive committed decree number is tracked as
commitNum
. - During the protocol, each process passes its
commitNum
with the message - this allows the receiving process to inform the sender of any decrees committed greater than the sender'scommitNum
. - When a new ballot is initiated, the conductor obtains agreement on all pending decrees, not just one of them. Pending decrees are those that are not yet committed but have been voted in some ballot.
In this version of multi-paxos there is no explicit leader election. Rather we assume that clients decide who to contact - and are sticky to a preferred process for a period of time, and the process that is contacted, tries to become leader if it is not already the leader.
The overall algorithm follows closely the description of multi-paxos in the PTP paper, but has the later enhancement that in phase 2, an acceptor can respond even if it has never participated in the ballot before. This requires the acceptor to catch-up on commits it doesn't know about, and we introduce a new message type PendingVote
to support this requirement.
Process (p)
- a Paxos participant (priest in PTP paper).Ballot
- a referendum on a single decree. Each Ballot is identified by a unique ballot number, and ballots are ordered by the ballot number.Decree
- represents the value being agreed upon. If we were replicating a log, then the decree would be the log record.Decree number (dnum)
- decrees are numbered from 0, and sequentially allocated, eachdnum
identifies adecree
.Process pid
- a unique id given to each process. First process gets0
, second1
, etc. The ordering of processes is not specified, however it must be a constant.Proposal number (pnum)
- a monotonically increasing sequence number maintained by each process,-1
indicates none, valid values are >=0
.Ballot Number (b)
- a unique number made up of a pair - proposal numberpnum
, and process idpid
. Ballot Numbers are sorted by proposal number, followed by process id.Vote (v)
- a vote cast by a process, is a tuple containing thepid
of the voter, the ballot number, and the decree being voted. Votes are ordered by ballot numbers.Ledger
- each process must maintain some data in persistent storage - the ledger represents the storage data structure.quorumSize
-(Number of Participants + 1) / 2
, where Number of participants is an odd number.commitNum
- The last sequentially committeddnum
, i.e. alldnum
s <=commitNum
are also committed.
MaxVote(votes)
- function that returns the max vote cast, wheremax()
is based on the ordering of votes by ballot number.owner(b)
- returns the process id that initiated the ballot numberb
.
outcome(dnum)
- this is value of thedecree
in the ledger for givendnum
, or NULL if there is nothing written yet. When an outcome is saved, the ledger must check and updatecommitNum
to be the highest sequentialdnum
that is committed.lastTried
- The ballot number that the processp
last tried to initiate, or(-1,p.id)
if none.maxBal
- The maximum ballot number that processp
ever agreed to participate in, or(-1,p.id)
ifp
has never agreed to participate in a ballot.maxVBal(dnum)
- For decree numbereddnum
, the ballot number in whichp
last voted or(-1,p.id)
ifp
never voted.maxVal(dnum)
- For decree numbereddnum
, the value of the decree associated withmaxVBal
, i.e. the decree thatp
last voted, or NULL ifp
never voted.commitNum
- Decree number of last sequential committed decree, all decrees withdnum
<=commitNum
must have been committed
Notes: for a given dnum
, the decree is initially undefined, then goes into in-ballot
status, and finally into committed
or noop
status.
Thus, maxVal(dnum)
and maxVBal(dnum)
only have meaning when the decree is in in-ballot
status.
status
- the status of processp
, which can be one of the following:idle
- not conducting or trying to begin a ballot - aka FOLLOWINGtrying
- trying to begin ballot numberledger.lastTried
- aka TRYING_TO_BE_LEADERpolling
- Conducting ballot numberledger.lastTried
- aka LEADING- On startup the status is assumed to be
idle
.
pid
- the process idprevVotes[dnum]
- For eachdnum
, the set of votes received inLastVote
(phase 1) messages for the current ballot (ballot number inledger.lastTried
).prevVoters
- the set of voters (in phase 1) who returnedLastVote
messages in response toNextBallot
.voters
- the set of processes (in phase 2) includingp
, from whom the ballot conductor has receivedVoted
messages in the current ballot, only meaningful whenstatus == polling
.chosenValues[]
- ifstatus == polling
, then the set of decrees in the current ballot, otherwise meaningless.chosenDnum
- ifstatus == polling
, then thednum
assigned to client decree assigned for current ballot, otherwise meaningless.
NextBallot
- aka PREPARE 1a - message sent by the ballot conductor.LastVote
- aka PROMISE 1b - message sent by participant to ballot conductor.BeginBallot
- aka ACCEPT 2a - messages sent by the ballot conductor.PendingVote
- message sent by a process in response toBeginBallot
that indicates willingness to vote, pending update of outcomes not known to the process. This triggers a resend ofBeginBallot
with outcomes for committed decrees not known to the voting process.Voted
- aka ACCEPTED 2b - message sent by participant to ballot conductor in response toBeginBallot
.Success
- message sent by ballot conductor to all processes once the ballot is successfully completed.
The content and timing of each message is described below, except for PendingVote
- the definition and purpose of each
message is as per the PTP paper. The PendingVote
message is an additional message used to bring a process up-to-date
before it responds with a Voted
message.
The following algorithm must be implemented for each process that is participating in a Multi Paxos run.
This step enabled by p
if there are no client requests in process and a new request has arrived.
- If
p.status == idle
doP101 Start New Ballot
. - Else if
p.status == polling
andledger.maxBal == ledger.lastTried
, doP200 Start New Vote
.
This step is enabled when a new ballot must be started, e.g. on client request.
- If
p.status
==idle
, i.e. processp
is a follower so must try to be leader.- Let
b = Ballot(max(p.ledger.lastTried.pnum, p.ledger.maxBal.pnum) + 1, p.pid)
, where1
is added to the proposal number. First valid proposal number is thus0
, as initial value ofledger.lastTried
andledger.maxBal
is(-1, p.pid)
. - Set
p.ledger.lastTried
tob
. - Set
p.status
totrying
. - Set
p.prevVotes
andp.prevVoters
to empty set. - Set
p.chosenValues
to the empty set. - Set
p.chosenDnum
to-1
. - To each process participating in multi paxos, send
NextBallot(p.ledger.lastTried, p.pid, p.ledger.commitNum)
PREPARE 1a message, including to itself.
- Let
PTP note: If process p
has all decrees with numbers less than or equal to commitnum
then he sends
a NextBallot(b,commitnum)
message in all instances of the Synod protocol for decree numbers larger than commitnum
.
Any process receiving the NextBallot
is required to inform p
of any commits it is missing before responding with
a LastVote
message.
- If
p.ledger.maxBal == p.ledger.lastTried
andp.status == polling
then and , i.e. process is already a leader- Set
p.chosenValues
to the empty set. - Set
p.chosenDnum
to-1
. - Jump to
P201 Start Polling
(Phase 2).
- Set
This is executed by each process q
that receives the NextBallot
message.
- Let
b
=NextBallot.b
. - To the sender of ballot
b
, i.e.owner(b)
, sendSuccess
message for all decrees with outcomes(dnum,outcome)
fromq.ledger
wherednum
>NextBallot.commitnum
. This step ensures that the ballot conductor is made aware of any commits it is missing. - if
b > q.ledger.maxBal
then- Set
q.ledger.maxBal
tob
- If
q
is not the owner of ballotb
then at this point it needs to reset its state toidle
as it is going to be a follower from here on. - To the sender of ballot
b
, i.e.owner(b)
, sendLastVote
(PROMISE 1b) message with following contents- all decrees that
q
voted in and whose outcome is not yetcommitted
inq.ledger
:pid
-q.id
(i.e. the id of the process sending theLastVote
)b
- ballot numberq
is responding tov[]
Votes containingdnum
ledger.maxVBal(dnum)
- max ballot numberq
voted inledger.maxVal(dnum)
- value associated with max ballot number above
commitNum
toq.ledger.commitNum
.
- all decrees that
- Set
PTP note: In his response to the NextBallot
message, q
informs p
of all decrees numbered greater than commitnum
that already appear in q's ledger
(in addition to sending the usual LastVote
information for decrees not in his ledger), and he asks p
to send him any
decrees numbered commitnum
or less that are not in his ledger.
Note that the last bit is somewhat different, because the way we manage commitnum
ensures that it is the highest
consecutive committed dnum
, the invariant being that all decrees below commitnum
already appear in q
's ledger at
this point. Instead, we send commitnum
so that p
can inform us of any commits greater than our commitnum
.
- if
LastVote.commitnum < p.ledger.commitnum
then sendSuccess
message toLastVote.owner
for all decrees with outcomes fromLastVote.commitnum+1
top.commitnum
. - if
b == ledger.lastTried
andp.status == trying
then- Add
LastVote.votes[]
to the setp.prevVotes[]
. Note that two votes that are from different participants ( i.e.LastVote.voter
is different), must be considered as distinct votes here. - Add
pid
of voter toprevVoters
- If count
prevVoters
>= toquorumSize
then doP201 Start Polling
.
- Add
Goal is that all dnum
s that are less than the max dnum
in LastVote
messages either get already voted values or
no-op if no value was voted. The client requested value is assigned a new dnum
greater than all previous dnum
s.
We need to ensure there are no gaps in dnum
s.
- If
p.prevVotes
is not empty- For all
dnum
greater thanp.ledger.commitNum
and less or equal tomax(dnum)
inp.prevVotes
, wherep.prevVotes[dnum]
is not an empty set, setp.chosenValues[dnum]
=MaxVote(p.prevVotes[dnum])
. - For all
dnum
greater thanp.ledger.commitNum
and less or equal tomax(dnum)
inp.prevVotes
, wherep.prevVotes[dnum]
is an empty set, setp.chosenValues[dnum]
=no-op
.
- For all
- Set
p.chosenDnum
tomax(max(dnum in p.prevVotes), p.ledger.commitNum)+1
- Set
p.chosenValues[chosenDnum]
to client provided decree
This step is enabled when status=trying
and there is a quorum of votes in prevVotes[*]
as described above.
- Set
p.status
topolling
- Do
P201a Determine chosen values
- Send
P202 BeginBallot(b,p.pid,p.commitNum,p.chosenValues,committedDecrees=[])
ACCEPT 2a to all the participants, including itself. NotecommittedDecrees
is set to the empty set.
P202 Receive BeginBallot(b,pid,commitNum,chosenValues,committedDecrees)
ACCEPT 2a message, conditionally send Voted
ACCEPTED 2b
This is executed by each process q
that receives the BeginBallot
message.
- If
BeginBallot.b >= ledger.maxBal
- Set
q.ledger.maxBal
toBeginBallot.b
- for each decree in
BeginBallot.committedDecrees
- Set
q.ledger.outcome[dnum]
toBeginBallot.committedDecrees[dnum]
- Advance
q.ledger.commitNum
to the highest consecutive committeddnum
.
- Set
- for each decree in
BeginBallot.chosenValues
- Set
q.ledger.maxVBal[dnum]
toBeginBallot.b
- Set
q.ledger.maxVal[dnum]
toBeginBallot.decree[dnum]
- Set
- If
q.ledger.commitNum
<BeginBallot.commitNum
- Send to the owner of ballot
owner(BeginBallot.b)
, aPendingVote(b, q.pid, q.ledger.commitNum)
PENDING ACCEPT 2ba message whereq.pid
is the process sending thePendingVote
message.
- Send to the owner of ballot
- Else
- Send to the owner of ballot
owner(BeginBallot.b)
, aVoted(b, q.pid)
ACCEPTED 2b message whereq.pid
is the process sending theVoted
message.
- Send to the owner of ballot
- Set
This message means that the sender is willing to vote but is lagging behind in commits, and needs to update its commits
prior to voting. This step is enabled when status=polling
.
- If
Voted.b == p.ledger.lastTried
andp.status == polling
- Let
p.committedDecrees
be the set of decrees withdnum
>PendingVote.commitNum
that are inp
's ledger - Send
BeginBallot(b,p.pid,p.commitNum,p.chosenValues,p.committedDecrees)
ACCEPT 2a toPendingVote.voter
.
- Let
This step is enabled when status=polling
.
- if
Voted.b == p.ledger.lastTried
andp.status == polling
- Add voter process
Voted.voter
to the set ofp.voters
. - If count of
p.voters
is>=
toquorumSize
then- For all
dnum
inp.chosenValues
ifp.ledger.outcome[dnum]
is NULL then setp.ledger.outcome[dnum]
top.chosenValues[dnum]
- Send
Success(p.chosenValues)
to all participants, including itself.
- For all
- Add voter process
- For all
Success.outcomes[]
, ifp.ledger.outcome[dnum]
is NULL then setp.ledger.outcome[dnum]
toSuccess.outcome[dnum]
- Advance
p.ledger.commitNum
to the highest consecutive committeddnum
.
- Almost all of the multi-paxos protocol is implemented in ThisPaxosParticipant.
- For more description of the implementation, refer to the README.