These are notes taken during CMSC 818e: Distributed And Cloud-Based Storage Systems. Course webpage and syllabus here.
Day Four
Elephant Paper
- differentiating between undo and long term history
- policies:
- keep one (browser cache, core, /tmp)
- keep all
- keep safe (just undo)
- keep landmarks <–
- landmarks - people write/edit in blocks; cluster the times together and take the last one; call that a landmark
- versioning for files but not for directories (why was that again?)
- should changes propagate all the way to the root?
- make it cheaper by using file chunking/hashing (like LBNFS)
- set flag for “dirty” to signal that a node has been modified
cd foo/@12-nov-1999:11:30
tls `ls @v`
tgrep
- user-level process called when the cleaner comes across high-temp file
- Downsides: less locality in inodes, data blocks; pressure on buffer cache
- could use lbfs chunking to lessen the buffer cache pressure
- what about something like video editing?
- use diffs between versions of a chain to a land-mark
- applications used to write to filesystems; once you have sufficient operating systems, you can take the load off the applications
- operational transforms
- incomplete description of results (not enough set-up for the graphs), for example:
- is the cleaner running? how often?
- what is the keep-safe window?
- challenges with duplicating code from papers (e.g. epaxos)
- Inferno - Bell Labs/Rob Pike
- Network Application
- snapshot: store a pointer to a previous root, use that to drive policies that reclaim old versions no longer needed
Knockoff Paper
- an attempt to generalize operation shipping
- the specific details are in a previous paper on Arnold
- eidetic versioning: any past state in the file system or in application memory
- stores non-deterministic log for replay
- nondeterministic log: system call results (always happen at the same time), external data reads (references to other file in the FS), thread scheduling (sometimes less predictable), unexpected signals (how to recreate?)
- Store by values when programs to produce are not in cloud, computation costlier than communication
- versioning policies: none; on close; on write; eidetic (system call)
- Note: cost comparison can be difficult for long-running applications. Greedy policy is good in the short term, but might not be the globally optimal solution. Uses per-application histories to catch long-running apps that might benefit from ops (multiple versions?)
- Problems: Reproducing past file may need input data from other logs; version vector show dependency graph between files; materialization delay is delay to reproduce inputs an the file
- Costs - 7-8% recording; up to a minute to re-constitute; doesn’t mention the word interactive
- sha1 deprecation
- creating images of the environment - e.g. docker or vagrant