Unlike Spanner, which we explored in a recent post, and which uses the strategy of replicating commits, Calvin’s approach to distributed storage instead leverages a fully replicated log to coordinate distributed transactions across partitions and replicas.
Calvin’s approach pushes most of the work to the sequencing and scheduling of operations and the requisite locks. This means that we don’t have to worry about latencies that arise from 2-phase locking. On the other hand, we have a lot more work to do upfront; as soon as we get word that a client wants us to do some kind of transaction, even before we tell anyone else in the system, we have to start doing “recon” (i.e. checking for any dependencies), so that we can assess the “contention footprint” of the transaction. This allows us to coordinate the ordering of all the reads and writes involved in the proposed transaction. But then, we can reduce every transaction to a deterministic series of transactions, which allows us to make not just synchronous but nearly synchronized updates across all replicas.
I have a couple questions about Calvin, particularly around the kinds of applications that would best be able to leverage something like Calvin (and also some that would be less than ideal in terms of the kinds of transactions they require). I’m also curious about the relevance of Bailis’ HAT paper with respect to Calvin, particularly around the quantification of reasonable bounds on latency.
The CalvinFS paper poses the question “how would it work to use a distributed transaction log-approach to maintaining file metadata in a fully distributed file system?” In Calvin, the distributed log is a schedule of operations that have been Paxos-ed first, then to be executed. In CalvinFS, the distributed log gives the file system a way to correctly schedule transactions that touch files/directories spread across multiple machines.
One insight of this paper is that most distributed file systems assume you’ll have a bunch of big files, but that the structure of your overall file system will be a shallow, small network. In practice, there are a lot of filesystems (we saw an example in GFS), where there are mostly small files. But now imagine that those small files exist in deeply nested directories, and that we open and close them constantly (the assumption in GFS was that we’d probably only open/write our small files a couple times before forgetting them forever). “Modern file systems,” Thompson and Abadi say, “utilize a fundamentally unscalable design for metadata management in order to avoid high synchronization costs necessary to maintain traditional file system semantics for file and directory metadata.”
So here’s CalvinFS’s strategy; different parts of the system are going to hold different files (or even chunks of files). Now someone wants to make some changes to a distributed directory. We’ve got a log that tells us about all the of read and write dependencies for that directory. We read the part of the directory we have locally first, then we broadcast our results to everyone else involved in the transaction; now everyone has all the reads. Next we all “execute” the full transaction (though for those of us who don’t store all the relevant data chunks locally, we’ll just be going through the motions in some parts).
You might wonder what happens when a machine goes down (since the metadata is stored in memory), but we’ve got that taken care of because we take periodic snapshots of that memory and store it to disk for recovery purposes, when we can just replay the tail of the log.
My questions for this paper are somewhat similar to the ones I had from the Calvin paper; namely, what are the reasonable bounds on latency, particularly for writes, and for recursive or nondeterministic (sorry, I’m an ML person) operations, when we might not know the read and write sets in advance? Also, I would be interested to hear what strategies were explored for variable block sizing, mainly since that’s something we encountered with Rabin-Karp chunking.