Snake_Bytes #5: Vector Distances

Last time, we took a look at the application of the dot product in a simple classifier and near the end there, Bryan set the stage for the next post in this mini-series: a deep dive into the differences of vector distances and their downstream affect on classification models. In the previous post, we classified two dimensional vectors into two classes: apples or bananas. However, what happens though, when we are working in feature spaces of a higher dimension and/or introduce sparsity to our problem? Which vector distance is the preferred approach for labeling inputs in higher dimension vector spaces?

We've already seen one distance metric: cosine similarity. Mathematically speaking, cosine similarity measures the angle between the two vectors to indicate their similarity in direction and magnitude. But, there are other ways to measure the distance between two entities. To illustrate this, let's introduce three additional ways to calculate the distance between vectors.

Vector Distance Functions

For all of the definitions below, let's assume that we have two vectors of real numbers with same length:

\[v, w \in R^n\]

Euclidean Distance

Euclidean distance is one of the most widely used distance functions (and debatably one of the worst for evaluating vectors in high dimensions...). Regardless, it deserves a shout out. Euclidean distance calculates the direct, measurable distance between two points. This can be easily visualized in two dimensions: imagine you and a buddy are standing far away from each other at a local park. The measurable distance in a straight line between you and your friend is the Euclidean distance between your two positional vectors (assuming you two spend the time to map out a Cartesian coordinate system).

For any vector, this is calculated as follows:

\[EuclideanDist(v, w) = \sqrt{\sum_{i=0}^{n-1}(v_i - w_i)^2}\]

In Python, we reach out to SciPy to get this done:

City Block Distance

This time, you and your friend have traveled to New York City and are trying to walk to the Empire State Building. The Euclidean distance between you and the ESB does not represent how far you will have to walk to get there because you are forced to use the street system. This type of distance is referred to as the city block distance because you can't just cut through buildings to get from point A to point B. Formally,

\[CityBlockDist(v, w) = \sum_{i=0}^{n-1}\left|v_i - w_i\right|\]

And in SciPy:

Minkowski Distance

Cosine, Eudlidean, and City Block distances can be mentally visualized (relatively easily) in different scenarios in two dimensions. However, vector space models can introduce a high dimension space (much larger than 2); understanding the concept of distance in these high dimension models is integral to deciphering their results. Try visualizing the same distance problems above (you in a park vs. you in a city), but this time apply the scenarios to the 3rd dimension. Okay, maybe that isn't too bad. But what about a 4th dimension? 5th? 10000th? At some point, it becomes difficult to mentally understand the concept of distance in higher dimensions, and the previously discussed metrics introduce debatable interpretation.

To understand and calculate distance in these higher dimensions, we need a new metric: Minkowski Distance. The Minkowski Distance metric is accepted to be one of the best performing distance metrics in higher order classification models, like k Nearest Neighbors (k-NN). This metric can be thought of as a generalization of Euclidean distance, but for any dimension. In the definition and example below, when p=2, we are calculating the Euclidean distance.

Formally,

\[MinkowskiDist(v, w) = \left(\sum_{i=0}^{n-1}(v_i - w_i)^p\right)^{1/p}\]

And in SciPy:

These few distance metrics are just the icing on the cake. Scipy's spatial distance package has everything under the sun for calculating the distance between two vectors. Check it out.

Given all of these different distance metrics, it would be interesting to note how they would affect (or not affect) the fruit classification model from our previous post. Additionally, how do these metrics influence the classification of input data as the number of features or classes grow? These questions are some topics that we will be diving into as we continue this mini-series on distance metrics and classification models.

In the mean time, check out this paper to get an idea of how these metrics perform on a k-NN model.

About Denise Gosnell, PhD

Dr. Gosnell, a driving member of the PokitDok Data Science team since 2014, has brought her research in applied graph theory to help architect the graph database while also serving as an analytics thought leader. Her work with the Data Science team aims to extract insight from the trenches of hidden data in healthcare and build products to bring the industry into the 21st century. She also helps organize the local chapter of Charleston Data Analytics, a Meetup PokitDok now sponsors, and has represented PokitDok's Data Science Team at numerous conferences including, PyData, KDD (Knowledge Discovery & Data Mining) and the inaugural GraphDay.

Prior to PokitPok, she earned her Ph.D. in Computer Science from the University of Tennessee - where she founded a branch of Sheryl Sandberg's Lean In Circle. The goal of this impressive organization is to guide women interested in computer science careers, as TechCrunch noted, and Denise has done that and more.

View All Posts

Leave a Reply

Your email address will not be published.