These are notes taken during CMSC 818e: Distributed And Cloud-Based Storage Systems. Course webpage and syllabus here.
Epaxos Big Ideas
- No leader: any replica can act as a leader at any time
- hope for non-conflicts (in the case of a key-value store, this would be a key equality)
- conflict is tough because not everybody votes every time
- need to agree on a way to de-conflict
- what are all the commands that can conflict? (need to account for all of them!)
- separate “commit” and “execute” (doesn’t exist in any other type of consensus protocol)
- this means that their claims about the latency to commit are a little bit specious, since a commit is not as meaningful in ePaxos
- “replicated log” is now a 2D set of subspaces (then you have to deal with ordering later on)
- thrifty optimization - not everyone has to vote on every command
- propose a command and all know dependencies
- optimistically hope we don’t learn any new conflicts
- If the receiver has no additional dependencies for that commit, they both continue on the fast path.
- if there are conflicts, we need a second round…
- a second round, only used if we learn new dependencies
- PreAccept messages include dependencies to tell whoever gets the message that all these things need to get executed first. If the receiver has additional dependencies, these are added to the next PreAccept and communicated back along the slow path.
- Once the leader gets acks back and is ready to commit, it immediately replies to the client first that the commit is committed.
- Then it sends commit messages to everyone in the entire system (so everyone always commits even if they haven’t all voted)
- Commits don’t have to be acknowledged, you want it to get there fast but you know that it will happen eventually.
- Writes can stack up, reads have to happen in order
- With quorum size 3, this is really fast, never a need for the slow path - BUT, it’s bad for failure.
- Size 3 is also good for geo-replication, because you can make use of local physical proximity, and avoid latency problems with the slow (further away) replicas (they only get commits, which are asynchronous)
- in Paxos, a write doesn’t always write what it originally was asked to write, this isn’t the case in ePaxos
- in ePaxos commit is a promise to execute
- execute only needed on read
- no incoming messages that tell you to execute the command (so need to do something similar to the asynchronous flusher)
- first sort all the commits using topological then total ordering
- algorithm: get set of all dependencies yet to be executed; use Tarjan to separate into strongly connected components; for each SCC in inverse topological order, sort commands in sequence number order, and execute in increasing order
- if a replica needs to read something that has dependencies it hasn’t seen yet, it blocks
Can this be implemented for N>5? If so, the dependencies start to be a problem. Dependencies are a function of how many replicas there are, they grow linearly. Dependencies will be N^2 in the worst case.