## I can see clearly now the dimensionality is gone…

Often, when investigating a new dataset, it is helpful to create visualizations of the data to get some intuition of what is going on. This is straightforward when the original data has dimension 2 or 3 but what can we do about higher dimensional data? One option is to reduce the dimensionality of the dataset into something that can be easily visualized (usually 2 or 3 dimensions). There are a slew of algorithms for this task including principal component analysis (PCA), multidimensional scaling (MDS), isomap, etc. The manifold learning algorithm we want to explore today is called t-distributed stochastic neighbor embedding (t-SNE). t-SNE works by trying to find a nonlinear, lower dimensional representation of the data such that points that are “close” in the original data remain “close” in the lower dimensional space and points that are “far” stay “far”.

## Show me the data…

At PokitDok we process a lot of claims through our API and determining what a certain insurance carrier will pay for a certain claim is a very nontrivial task. One attribute we are often interested in modeling is the relationship between diagnosis codes (ICD9/10) and insurance payments for a given procedure (CPT code). So let's look at a de-identified set of claim service lines submitted to a single payer for a single CPT code.

Let’s gather some dependencies:

Pandas is highly leveraged at PD for analysis of datasets that can fit on a single machine so let’s load up a pandas DataFrame. We start off with 5 columns; the provider’s NPI, the patient’s ID, the diagnosis code (ICD9/10), the billed amount, and the payment amount. The NPI, patient ID, and ICD columns are categorical variables so we can use pandas to encode these values into numerical values by expanding the column space of the DataFrame.

Now our DataFrame has 496 columns and the resulting space is very sparse due to the categorical encodings. t-SNE works best when all features have the same scale and scaling pandas DataFrames is very straightforward using scikit-learn's preprocessing module.

Finally before seeing what t-SNE can tell us about our dataset, we define some basic helper functions and environment for running and plotting the resulting low dimensional space.

So let’s see what the algorithm produces without any hyperparameter tuning:

As you can see, all the data is contained in a single cluster, which is not too surprising since the charge and payment amounts are rather normally distributed for this dataset. Also note that the ICD code colors appear uniformly distributed across this space..

t-SNE offers several hyperparameters that can be tuned to affect the embedding. The perplexity parameter tells the algorithm whether to favor local or global distances more and has a default value of 30. Let’s change the value to 5 to favor preserving local neighborhoods over global structure. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high so we reduce the learning rate from 1000 to 500. The algorithm is bounded in running time by the number of iterations used in the optimization, so we bump that up from the default of 1000 to 10000. Finally, by default, the initial embedding into 2 dimensions is done randomly, but we can use PCA (principal component analysis) to bootstrap the embedding.

And here are the results….

While we have retained the single cluster shape, we have obtained a much better distinction in terms of grouping within the cluster by ICD code. So even though the prices, provider, and patient features regulate the embedding into a single cluster, we can still see that ICD code is a relatively important feature of the claims space.

- Snake_Byte #12: t-SNE?...Gesundheit - December 9, 2016
- Dirty data ETL in healthcare: because IBM EBCDIC is still a thing. - January 5, 2016