Sharding the Shards

August 17, 2019

In “Sharding the Shards: Managing Datastore Locality at Scale with Akkio”, Annamalai, et al. present Akkio, a locality management service that can be tacked on to an application (like a social media platform) its distributed datastore to strategically migrate data to the places where it is being accessed. The premise of Akkio is that for a subset of applications, data access is going to be much more related to locality (specifically the locations of the users who are going to access the data) than it is by other metadata features like time (e.g. how recent the social media post is) or type (e.g. whether the content is a picture or a video). Essentially, Akkio tracks accesses to determines how to distribute data and when to move it around. From a business standpoint, the goals of Akkio are to (1) improve the user experience by reducing app response time and (2) to make the backend more efficient, avoiding expensive caching schemes (or even worse, the cost of full replication for an enormous amount of data) and better adapting to shifting access patterns as daylight hours shift across the globe.

Model Selection Tutorial with Yellowbrick

April 05, 2019

Model Selection Tutorial with Yellowbrick

Module Main has No Attribute... (on Pipelines and Pickles)

March 07, 2019

It’s no secret that data scientists love scikit-learn, the Python machine learning library that provides a common interface to hundreds of machine learning models. But aside from the API, the useful feature extraction tools, and the sample datasets, two of the best things that scikit-learn has to offer are pipelines and (model-specific) pickles. Unfortunately, using pipelines and pickles together can be a bit tricky. In this post I’ll present a common problem that occurs when serializing and restoring scikit-learn pipelines, as well as a solution that I’ve found to be both practical and not hacky.

The Georeplication Bake-off

March 02, 2019

In this post, I’ll present a comparison of the experimental results of several published implementations of consensus methods for wide-area/geo replication. For each, I’ll attempt to capture which experiments were reported, what quorum sizes were used, and what the throughput and latency numbers were.

Boxing and Unboxing - Kubernetes for ML

February 24, 2019

There is, at best, a tenuous relationship between the emerging field of DevOps and the more prevalent but still incipient one of Data Science. Data Science (when it works) works best by contributing exploratory data analysis and munging, hypothesis tests, rapid prototypes, and non-deterministic features. But then who is responsible for transforming ad hoc EDA and data cleaning steps into ETL pipelines? Who containerizes the experimental code and models into packages for deployment? Who sets up CI tools to monitor deployed code as new predictive features are integrated? Increasingly, such responsibilities are becoming the domain of the DevOps Specialist. And if the mythical Data Scientist (as the world imagined her 5 years ago during Peak Data Science Mysticism) was a mage capable of squeezing predictive blood from data stones, the lionhearted DevOps Specialist is now the real hero of the story — Moses holding back the Red Sea while the Data Scientists sit in tidepools digging in the sand for little crabs.