These are notes taken during CMSC 818e: Distributed And Cloud-Based Storage Systems. Course webpage and syllabus here.

Day Five

Google File System

my post on the paper here

Assumptions:

failures are the norm
files are huge
most mutations by appends
co-design w/ apps

Features:

relaxed consistency
atomic record append (without locking)
no caches! anywhere (though linux does this underneath!) (type of app)
- consistency not much of an issue
- clients cache chunk servers, so out-of-date, but just prefixes, not wrong

By:

single master
- maintains all metadata in memory
- big log that is checkpointed
- shadow masters (for slightly stale read access)
- chunkservers not persisted
multiple chunkservers
- chunks replicated
- no caches (streaming!)

Write op:

leases from master to chunkserver
client get chunk replica list from master
- sends data everywhere
- waits for acknowledgement

Guarantees:

namespace mutations atomic because single master
client caches may return stale data
- limited to timeout window
- mostly appends anyway
file region “consistent” if same on all replicas
file region “defined” is consistent and writes therein seen in entirety
- file broken into multiple writes if too big, or across chunk boundaries
- write might fail on some replicas, be re-tried
writes sent to replicas in same order, so “consistency is common” :)
Applications responsible for checking all data

Inconsistencies

concurrent writes or failed writes lead to undefined regions
need to tell the difference between defined (each mutation in entirety) and undefined
no need for consistency

Locking

no per-dir state
lookup table mapping full pathnames to metadata
to modify a lead node (metainfo for a file)
- start at root, read-locking all the way down
- exclusive lock on the file
SIX: Shared intention to lock
IX: Intention lock
X: Exclusive lock

Errata

Multi-master is “Colossus” (CFS)
Quinlan: “in retrospect I think the consensus is that [record append] proved to be more painful than it was worth”
Borg
File counts became a problem (storage size not really a problem)
Map-reduce: starts by sharding data to a lot of different machines, that ends up being a bottleneck for a single master on a single server to keep track of
protocol buffer serialization format - why is it better than JSON? JSON is huge. protobufs use as few bits as possible
zeroMQ
Google RPC
what changed with Google bought YouTube - latency became more important

Ceph

paper here

Enormous HPC file system
- tens or hundreds of thousands of OSDs
- meant for scientific workloads (like GlusterFS or HadoopFS)
- OSD intelligent object store devices (autonomous, like computers in their own right - think a Linux box with one big disk)
- Expensive!
Motivation
- metadata operations make up as much as half of FS workloads
- metadata operations don’t scale
- petabyte scale systems are inherently dynamic (usage will change dramatically from one job to the next)
- many different clients with different needs
What they did:
- huge engineering effort
- CRUSH data distribution function rather than state
- replication, failure, recovery handled by OSDs
- dynamic subtree partitioning
Consistency
- generally strict, but
- dirs + inodes sent at same time (ReadDirAll), inodes cached briefly
- O_LAZY (file open flag - part of POSIX now?) allows read and write buffering w/ multiple clients (set tolerance for staleness)
Data layout w/ CRUSH
- object names are just inumbers and stripes (sequences of objects)
  - files mapped to stripes, stripes map to objects
- objects assigned to “placement group” w/ hash
- placement groups mapped to OSDs w/ CRUSH and OSC cluster map
Any party can find any object in a completely distributed fashion (don’t have to go to metadata server)
Dynamic Subtree Partitioning - Ceph deals with hotspots by dynamically mapping subtrees of the directory hierarchy to metadata servers based on the current workload. Individual directories are hashed across multiple nodes only when they become hot spots
EBOFS - user level file system that runs on the different OSDs - “existing kernel interface limits our ability to understand when object updates are safely committed on disk”

Next Project! Serialization, Persistence, and Immutability

due Sept 30…

Use LBFS chunking to take data from file, store separately in a key-value store called LevelDB (local bindings in Go)

go run protoget.go db list

go run protoget.go db head

go run protoget.go db /

go run protoget.go db /main.go

CMSC 818e Day 5

September 12, 2018