February 22, 2019
Recently I joined a new team, which consists of a mixture of Python and Clojure programmers. While this has caused me to spend a little time thinking about Clojure and it’s application to the kind of work I do, I’m still mainly using Python. Python (and all the 3rd party libraries available via PyPI) make it a particularly good tool for building NLP, video processing, and machine learning prototypes and microservices in the service of my company’s larger application. But, even though I’m primarily writing Python and reviewing code from data scientists who write in Python, the position change has definitely required some time for mutual acclimation!
February 08, 2019
This past Fall, I took a course in distributed systems at the University of Maryland College Park. As someone working in software development but who didn’t study computer science in school, I went into this class with less of a theoretical and more of an applied understanding of reliability, scalability, and maintainability. I also had a healthy skepticism about the buzzwords that characterize most conversations in industry about the latest tools and techniques. Having now finished the class and covered topics from eventually consistent systems to distributed consensus algorithms and many in between, I thought it would be interesting to revisit reliability, scalability, and maintainability from an industry perspective by reading Martin Kleppmann’s book Designing Data Intensive Applications. In this post I’ll discuss some takeaways and favorite quotes from the first part of the book.
February 04, 2019
Even with a modest dataset, the hunt for the most effective machine learning model is hard. Getting to that optimal combination of features, algorithm, and hyperparameters frequently requires significant experimentation and iteration. This lead some of us to stay inside algorithmic comfort zones, some to trail off on random walks, and others to resort to automated processes like gridsearch. But whatever the path we take, many of us are left in doubt about whether our final solution really is the optimal one. And as our datasets grow in size and dimension, so too does this ambiguity.
January 25, 2019
Happy New Year and welcome to 2019, the final year of Python 2 support! In honor of the new year, here’s a short post on how to convert pickles from your legacy Python 2 codebase to Python 3.
December 09, 2018
While batching in distributed systems is great for performance, it adds complexity and can make diagnostics tough. For example, trying to find a small error inside a file inside a commit that has been aggregated together with many other commits is a needle-in-haystack problem. Orca is a search engine for finding such needles.