Snake_Byte #28: Term x Document Matrices via Sklearn's DictVectorizer

One of the places which I frequently visit for fun, open source data sets is Kaggle. It is the place that data scientists visit when they are looking for new and different data sets to play around with. Last year, one of the competitions opened up a labeled data set of Amazon fine food product reviews. For those who enjoy exploring semantic modeling or natural language processing, this is a great data set to get your hands on.

Using the ~500,000 reviews, let's dig in and explore the concept of creating a term by document matrix. At its core, a term by document matrix is a very large, sparse table which records all of the unique words seen within any of the documents. Specifically, each row in the table (matrix) represents one of the documents and each column represents a unique word. A number in the table indicates how many times the word for that column is seen in the document for that row.

As an example, since this data set is about food, let's assume there is a review that contains the word apple. If the word "apple" is used in review #1, then the matrix will have a 1 in the cell for row 1, column 1. If the word "apple" is not used in review #2, then the matrix will have a 0 in the cell for row 2, column 1.

Let's build a small term by document matrix with a concrete example. Let's consider 3 food reviews:

Typically, when creating these types of data structures, data scientists ignore common words like are, these, I, ... etc. These types of words are commonly called stop words and are not included as columns in the matrix. (stop-words is a Python package that contains typical stop words across many languages.) When filtering out stop words in our above example, the resulting term by document matrix is:

apples bananas delicious grapes hate red
Review 1 1 0 1 0 0 1
Review 2 0 0 0 1 1 1
Review 3 0 1 1 0 0 0

When looking at the above small table, it is possible to see why researchers model semantic information in this structure. Looking down the values within a column, we can start to infer which reviews are similar (or dissimilar) according to which words they share (or do not share). For example, reviews 1 and 2 are similar because they both contain a 1 in the last column; this indicates that they share the word red. Further, review 3 only contains one word in common with another review, and that is seen in column 3. From this data structure, we can calculate similarity (or dissimilarity) between one row or another by calculating these types of scores.

For the Amazon fine foods review data set, the term by document matrix will have about ~500,000 rows; one row for each product review. As for the number of columns, I counted almost 34,000 unique words in the first 10,000 reviews. I would estimate that there are more than 2 million unique words in this data set; indicating that the size of the matrix is approximately ~500,000 rows by ~2,000,000 columns. Since each review has about 100 words, the majority of the entries in this matrix will be 0.

Once you have downloaded the data set, you can create a term by document matrix in Python via the sklearn module. The first step in this process is to create a dictionary for each food review where the structure of the dictionary is: {word: count}. In python:

With this list of dictionaries, we can use the sklearn module to transform it into a term by document matrix. The documentation for DictVectorizer is available here.

The param sparse for the DictVectorizer module has a significant trade off. The parameter is set to be sparse=True by default and initalizes the term by document matrix to be a condensed version that takes up less memory. Under the hood, when sparse=True, the term by document matrix is a dict that has {column:value} pairs for non-zero elements of the matrix only. If you set sparse=False, then the term by document matrix is a dense array that is basically a giant rectangle full of zeros, all of which take up 32-64 bits of memory. The subsequent analysis algorithms run faster when sparse=False, but you will need a large amount of memory to process the matrix.

From here, you can inspect the term_doc_mat object to examine properties of the term by document matrix:

With this matrix, the world is now open for different analysis methods. You can explore compression algorithms like PCA or NMF. Here is a post from Kaggle that explores the use of K-means clustering or Latent Dirichlet Allocation to predict helpfulness. Or, here is another discussion with this data set on different sentiment classification algorithms that are available.

Using and understanding term by document matrices is one of the first steps in semantic modeling and natural language processing. For a full tutorial and additional code with this matrix, check out my python notebook that is available on GitHub. This notebook uploads the first 10,000 reviews from this data set and dives into different types of semantic analysis.