These are notes taken during CMSC 818e: Distributed And Cloud-Based Storage Systems. Course webpage and syllabus here.
Day Five
Google File System
Assumptions:
- failures are the norm
- files are huge
- most mutations by appends
- co-design w/ apps
Features:
- relaxed consistency
- atomic record append (without locking)
- no caches! anywhere (though linux does this underneath!) (type of app)
- consistency not much of an issue
- clients cache chunk servers, so out-of-date, but just prefixes, not wrong
By:
- single master
- maintains all metadata in memory
- big log that is checkpointed
- shadow masters (for slightly stale read access)
- chunkservers not persisted
- multiple chunkservers
- chunks replicated
- no caches (streaming!)
Write op:
- leases from master to chunkserver
- client get chunk replica list from master
- sends data everywhere
- waits for acknowledgement
Guarantees:
- namespace mutations atomic because single master
- client caches may return stale data
- limited to timeout window
- mostly appends anyway
- file region “consistent” if same on all replicas
- file region “defined” is consistent and writes therein seen in entirety
- file broken into multiple writes if too big, or across chunk boundaries
- write might fail on some replicas, be re-tried
- writes sent to replicas in same order, so “consistency is common” :)
- Applications responsible for checking all data
Inconsistencies
- concurrent writes or failed writes lead to undefined regions
- need to tell the difference between defined (each mutation in entirety) and undefined
- no need for consistency
Locking
- no per-dir state
- lookup table mapping full pathnames to metadata
- to modify a lead node (metainfo for a file)
- start at root, read-locking all the way down
- exclusive lock on the file
- SIX: Shared intention to lock
- IX: Intention lock
- X: Exclusive lock
Errata
- Multi-master is “Colossus” (CFS)
- Quinlan: “in retrospect I think the consensus is that [record append] proved to be more painful than it was worth”
- Borg
- File counts became a problem (storage size not really a problem)
- Map-reduce: starts by sharding data to a lot of different machines, that ends up being a bottleneck for a single master on a single server to keep track of
- protocol buffer serialization format - why is it better than JSON? JSON is huge. protobufs use as few bits as possible
- zeroMQ
- Google RPC
- what changed with Google bought YouTube - latency became more important
Ceph
- Enormous HPC file system
- tens or hundreds of thousands of OSDs
- meant for scientific workloads (like GlusterFS or HadoopFS)
- OSD intelligent object store devices (autonomous, like computers in their own right - think a Linux box with one big disk)
- Expensive!
- Motivation
- metadata operations make up as much as half of FS workloads
- metadata operations don’t scale
- petabyte scale systems are inherently dynamic (usage will change dramatically from one job to the next)
- many different clients with different needs
- What they did:
- huge engineering effort
- CRUSH data distribution function rather than state
- replication, failure, recovery handled by OSDs
- dynamic subtree partitioning
- Consistency
- generally strict, but
- dirs + inodes sent at same time (ReadDirAll), inodes cached briefly
- O_LAZY (file open flag - part of POSIX now?) allows read and write buffering w/ multiple clients (set tolerance for staleness)
- Data layout w/ CRUSH
- object names are just inumbers and stripes (sequences of objects)
- files mapped to stripes, stripes map to objects
- objects assigned to “placement group” w/ hash
- placement groups mapped to OSDs w/ CRUSH and OSC cluster map
- object names are just inumbers and stripes (sequences of objects)
- Any party can find any object in a completely distributed fashion (don’t have to go to metadata server)
- Dynamic Subtree Partitioning - Ceph deals with hotspots by dynamically mapping subtrees of the directory hierarchy to metadata servers based on the current workload. Individual directories are hashed across multiple nodes only when they become hot spots
- EBOFS - user level file system that runs on the different OSDs - “existing kernel interface limits our ability to understand when object updates are safely committed on disk”
Next Project! Serialization, Persistence, and Immutability
due Sept 30…
Use LBFS chunking to take data from file, store separately in a key-value store called LevelDB (local bindings in Go)
go run protoget.go db list
go run protoget.go db head
go run protoget.go db /
go run protoget.go db /main.go