Optimizing cloud storage (but sacrificing privacy?)

Reading time ~2 minutes

What if we thought of files not as data but as recipes for creating data? In the log-structured file system paper, we were asked to think of files not as complete entities assigned some fixed place on disk, but as pieces that could be stored separately and later aggregated on demand. In “Knockoff: Cheap versions in the cloud,”, Dou et al. present a new approach; a file system that examines how files are constructed, and uses that information to decide whether to send and store a file in the tradition way (i.e. represented as its data), or as “recipes,” logs of all the system calls, operations, mouse movements and clicks that went into producing the file.

Lazy Cloud Users

Up until now, we’ve been reading a lot of papers from the 90’s. Flash forward to 2017. Cloud storage is convenient, but it isn’t cheap. The authors begin by explaining that there are essentially two ways to optimize cloud-based storage; we can either send fewer messages, thereby consuming less bandwidth, or we can keep less data (or fewer copies), thereby reducing storage costs. But doesn’t that call into question the whole convenience part of the cloud? As users, making decisions about what and how much to send or store requires a lot of effort, thought, and maintenance. Dou et al’s Knockoff file system is essentially an attempt to intelligently automate that decision-making.

Files that are Data, and Files that are Not

What’s unique about the Knockoff file system is that it distinguishes between files that should be sent and stored as data, and those that can be represented more efficiently as logs: “In lieu of the actual file data, we selectively represent a file as a log of the nondeterminism needed to recompute the data (e.g. system call results, thread scheduling, and external data read by a process).” Knockoff can compute how long it will take to execute that re-computation. For files that contain a lot of data, but that required few steps to create it, Knockoff is able to achieve more efficiency by passing/storing the log. If the re-computation time is above some threshold, it passes/stores the data instead.

The implementation is very clever, and leverages several of the ideas we’ve already encountered, including “operation shipping” (from the Coda File System), and chunk-based deduplication (from low-bandwidth networked file system paper). I particularly liked the approach to computing the cost of storing data versus a log, which leverages a graph of nodes that are iterative versions and their requisite logs, connected by edges when their difference isn’t represented in the database; this allows the authors to compute the longest path, using the logic that the complexity of the graph is proportional to the time cost of reconstituting a file from its logs.

At the Cost of Privacy?

I suppose my main concern with this approach is the amount of access Knockoff would require to the user-level processes, including things like browsing sessions; to me this seems slightly creepy at best, and potentially problematic from a legal/privacy standpoint.

A Parrot Trainer Eats Crow

In this post, we'll consider how it is that models trained on massive datasets using millions of parameters can be both "low bias" and al...… Continue reading

Embedded Binaries for Go

Published on February 06, 2021