Snake_Bytes #4: A Simple Classifier

In the first installment of Snake_Bytes, Ted brought to our readers' attention the dot product, which prompted a reader question about vector multiplication, and... fruit.  To illustrate an application of the dot product in statistical modeling, it is useful to consider a vector as a collection of coordinates (also sometimes referred to as attributes, parameters, or features) that define the location of a point in \( \mathbf{R}^n \).  With a little rearrangement of equation 2 from Ted's post, we see that the dot product of two L2-normed vectors is equal to the cosine of the angle between those points:

\[cos(\theta) = \|A\|\cdot\|B\|\]

The normalization bounds the inner product in [-1, 1], which makes this metric convenient for interpreting object (point) similarities.

To take this a step further, rather than thinking about quantities of fruits as posed in Ms. Dalton's question, imagine instead we need to create a classifier that allows a computer to easily distinguish between two very different kinds of fruits: apples and bananas.  Since the typical colors of bananas and apples differ so strongly, we can use a simple device that measures only 2 properties – the "redness" and the "yellowness" – of a large sample of known apples and bananas.  Let's use Python to generate a table of what such measurements may look like:


Note that in this highly simplistic 2-dimensional model we are assuming that (ripe) bananas are mostly yellow, whereas most apples are red.   For the sake of convenience, we've structured the fruit_samples matrix with the 1000 observations going down the rows, and the [red, yellow] measurements for bananas in the first two columns, followed by the measurements on apples in columns 3 and 4.  With these measurements, we can define the 'exemplar' banana and apple as the average of the red and yellow values across the set of both fruits in our sample.


Using the average of our measurements allows us to easily use dot products when we want to classify new [red, yellow] measurements as either bananas or apples: intuitively, if a test vector is more yellow than it is red, it should be 'closer' to the exemplar banana, which in turn will be reflected in the cosine similarity score. The following plot shows all 1000 samples of bananas and apples plotted according to the [red, yellow] vector measured on each, along with the exemplar banana and apple shown as a large diamond and circle, respectively.

plot of all simulated bananas (yellow) and apples (red)

Now we can create a function that takes as input a test [red, yellow] color vector, the dictionary that identifies the order and names of our sample fruits, and our model fruit matrix.  We use the numpy library's dot function to compute the cosine similarity, fruit_similarity, between the input test_fruit vector and the fruit_model matrix:


Let's have a look at the resulting fruit_similarity array given a test_fruit input of an apple with [red, yellow] vector [0.9, 0.1]:

The numbers in this 2D array correspond to the cosine similarity scores between our test vector and the banana in the 0th element of the array (0.263) and the apple in the first element (0.999).  We then simply use the numpy argmax function to return the index of the largest value from this array, pass that index to our fruit_names dictionary, and it returns the expected fruit_estimate: 'an apple'.


With only two dimensions, it is a bit difficult to realize the power of this method... one could easily just pick the index of the color value that is largest and call that the winner.  But what if we had yellow and green apples in addition to red, or unripe bananas?  Better yet, what if we wanted to create a program to distinguish between all different kinds of fruits?  We would clearly need a more sophisticated model, based on a higher-dimensional set of observed features such as color, shape, texture, size, etc.  Now the power of a dot product becomes more evident: we would be able to compare test vectors against all the features in this higher dimensional model simultaneously, still producing accurate predictions based on this larger feature space.  However, if the space gets too large, and/or the sample vectors are sparse, some interesting problems arise.  We will explore the consequences of these effects in future posts.

About W Bryan Smith

Bryan is a research scientist with expertise in computational statistics and machine learning. As Chief Scientist at PokitDok, Bryan ensures that a scientific approach is applied across the business, from operations to analytics product development. He holds a PhD in Neuroscience from Caltech.

View All Posts