CMSC 818e Day 4

Reading time ~2 minutes

These are notes taken during CMSC 818e: Distributed And Cloud-Based Storage Systems. Course webpage and syllabus here.

Day Four

Elephant Paper

link to post

  • differentiating between undo and long term history
  • policies:
    • keep one (browser cache, core, /tmp)
    • keep all
    • keep safe (just undo)
    • keep landmarks <–
  • landmarks - people write/edit in blocks; cluster the times together and take the last one; call that a landmark
  • versioning for files but not for directories (why was that again?)
  • should changes propagate all the way to the root?
    • make it cheaper by using file chunking/hashing (like LBNFS)
    • set flag for “dirty” to signal that a node has been modified
cd foo/@12-nov-1999:11:30
tls    `ls @v`
tgrep
  • user-level process called when the cleaner comes across high-temp file
  • Downsides: less locality in inodes, data blocks; pressure on buffer cache
  • could use lbfs chunking to lessen the buffer cache pressure
  • what about something like video editing?
  • use diffs between versions of a chain to a land-mark
  • applications used to write to filesystems; once you have sufficient operating systems, you can take the load off the applications
  • operational transforms
  • incomplete description of results (not enough set-up for the graphs), for example:
    • is the cleaner running? how often?
    • what is the keep-safe window?
  • challenges with duplicating code from papers (e.g. epaxos)
  • Inferno - Bell Labs/Rob Pike
  • Network Application
  • snapshot: store a pointer to a previous root, use that to drive policies that reclaim old versions no longer needed

Knockoff Paper

link to post

  • an attempt to generalize operation shipping
  • the specific details are in a previous paper on Arnold
  • eidetic versioning: any past state in the file system or in application memory
  • stores non-deterministic log for replay
  • nondeterministic log: system call results (always happen at the same time), external data reads (references to other file in the FS), thread scheduling (sometimes less predictable), unexpected signals (how to recreate?)
  • Store by values when programs to produce are not in cloud, computation costlier than communication
  • versioning policies: none; on close; on write; eidetic (system call)
  • Note: cost comparison can be difficult for long-running applications. Greedy policy is good in the short term, but might not be the globally optimal solution. Uses per-application histories to catch long-running apps that might benefit from ops (multiple versions?)
  • Problems: Reproducing past file may need input data from other logs; version vector show dependency graph between files; materialization delay is delay to reproduce inputs an the file
  • Costs - 7-8% recording; up to a minute to re-constitute; doesn’t mention the word interactive
  • sha1 deprecation
  • creating images of the environment - e.g. docker or vagrant

A Parrot Trainer Eats Crow

In this post, we'll consider how it is that models trained on massive datasets using millions of parameters can be both "low bias" and al...… Continue reading

Embedded Binaries for Go

Published on February 06, 2021