Capsicum

RDF Basics

July 02, 2018

The Resource Description Framework, or RDF, is a standard model for data interchange that allows structured and semi-structured data to be shared across different applications. RDF expresses relationships between entities as triples; essentially a graph that links unique URIs via edges that describe their relationships. In this post, we’ll use the Python library rdflib to build a graph from RDF data about products and to extract information about individual products.

Introduction to Document Similarity with Elasticsearch

June 25, 2018

In a text analytics context, document similarity relies on reimagining texts as points in space that can be close (similar) or different (far apart). However, it’s not always a straightforward process to determine which document features should be encoded into a similarity measure (words/phrases? document length/structure?). Moreover, in practice it can be challenging to find a quick, efficient way of finding similar documents given some input document. In this post I’ll explore some of the similarity tools implemented in Elasticsearch, which can enable us to augment search speed without having to sacrifice too much in the way of nuance.

Visualizing High-Performance Gradient Boosting with XGBoost and Yellowbrick

June 13, 2018

In this post we’ll explore how to evaluate the performance of a gradient boosting classifier from the xgboost library on the poker hand dataset using visual diagnostic tools from Yellowbrick. Even though Yellowbrick is designed to work with scikit-learn, it turns out that it works well with any machine learning library that provides a sklearn wrapper module.

Creating Categorical Values from Continuous Values

March 17, 2018

A lot of machine learning problems in the real world suffer from the curse of dimensionality; you’ve got fewer training instances than you’d like, and predictive signal is distributed (often unpredictably!) across many different features. Sometimes when your target is continuously-valued, there simply aren’t enough instances to predict these values to the precision of regression. In this case, we can sometimes transform the regression problem into a classification problem by binning the continuous values into makeshift classes. But how do we pick the bins? In this post, I’ll walk through a case study, starting with a naive approach and moving to a more informed strategy using the visual diagnostics library Yellowbrick.

Colorizing text based on part-of-speech tags

March 07, 2017

In this post, I’ll describe the use case and preliminary implementation of a new Yellowbrick feature that enables the user to print out colorized text that illustrates different parts-of-speech.