August 17, 2019
In “Sharding the Shards: Managing Datastore Locality at Scale with Akkio”, Annamalai, et al. present Akkio, a locality management service that can be tacked on to an application (like a social media platform) its distributed datastore to strategically migrate data to the places where it is being accessed. The premise of Akkio is that for a subset of applications, data access is going to be much more related to locality (specifically the locations of the users who are going to access the data) than it is by other metadata features like time (e.g. how recent the social media post is) or type (e.g. whether the content is a picture or a video). Essentially, Akkio tracks accesses to determines how to distribute data and when to move it around. From a business standpoint, the goals of Akkio are to (1) improve the user experience by reducing app response time and (2) to make the backend more efficient, avoiding expensive caching schemes (or even worse, the cost of full replication for an enormous amount of data) and better adapting to shifting access patterns as daylight hours shift across the globe.
April 05, 2019
Model Selection Tutorial with Yellowbrick
March 07, 2019
It’s no secret that data scientists love scikit-learn, the Python machine learning library that provides a common interface to hundreds of machine learning models. But aside from the API, the useful feature extraction tools, and the sample datasets, two of the best things that scikit-learn has to offer are pipelines and (model-specific) pickles. Unfortunately, using pipelines and pickles together can be a bit tricky. In this post I’ll present a common problem that occurs when serializing and restoring scikit-learn pipelines, as well as a solution that I’ve found to be both practical and not hacky.
March 02, 2019
In this post, I’ll present a comparison of the experimental results of several published implementations of consensus methods for wide-area/geo replication. For each, I’ll attempt to capture which experiments were reported, what quorum sizes were used, and what the throughput and latency numbers were.
February 24, 2019
There is, at best, a tenuous relationship between the emerging field of DevOps and the more prevalent but still incipient one of Data Science. Data Science (when it works) works best by contributing exploratory data analysis and munging, hypothesis tests, rapid prototypes, and non-deterministic features. But then who is responsible for transforming ad hoc EDA and data cleaning steps into ETL pipelines? Who containerizes the experimental code and models into packages for deployment? Who sets up CI tools to monitor deployed code as new predictive features are integrated? Increasingly, such responsibilities are becoming the domain of the DevOps Specialist. And if the mythical Data Scientist (as the world imagined her 5 years ago during Peak Data Science Mysticism) was a mage capable of squeezing predictive blood from data stones, the lionhearted DevOps Specialist is now the real hero of the story — Moses holding back the Red Sea while the Data Scientists sit in tidepools digging in the sand for little crabs.