Hola and welcome to this 20+ installment of Snake Bytes,
counter_bytes. While Python is the supreme choice for the data scientists, there is often a need when you are given a set of data records and you would want to extract the distribution for some attributes or maybe you want to analyze and confirm some results from a data mining algorithms run.
Counter container, added in Python 2.7, in collections provides this functionality in a nifty way.
Given a sequence of items, a set of dicts, or keyword arguments, it will provide a tally of all the hashable objects present with their counts. This can also be used as a mechanism in finding duplicates with matching identifiers.
When working with the patient identity records, we often end up with a cluster of related source records. Each of these records are collated together using an identity matching algorithm. Each algorithm will produce a different set of clusters from the same corpus of identity records.
With each cluster made, a reconciliation process is used to create a master identity. Prerequisite to that is learning the distribution of different identifiers/attributes values.
Lets see some of the variations on first names for the records belonging to supposedly the same identity:
A counter can be created with data in the arguments as well:
The counter object is a dict itself. The following operations can be performed on it:
To increase or decrease the count of the items like in a python container:
If we were to change the clustering criteria, similarly to incrementing frequency, the count of members can be decreased, set to a particular value, or a member removed all together:
To find out the top value or 'N' top picks based on their frequency values, the values are auto sorted and presented in order of most to least frequency:
Got two different counters for same attributes through different algorithms? No problem. You can analyze the variance by subtracting the respective frequency of keys:
Similarly you can also add the counters:
The dict can also be sorted on the key values to get the top most or bottommost value based on the ascending or descending criteria using sorted utility method for sorting the items in dicts:
Hope you enjoyed the counter feature of Python and find it useful in your projects. 'Til next byte. . .