Snake_Byte #7: Luhn at First Byte

As mandated by HIPAA, every healthcare provider in the United States has a National Provider Identifier, or NPI. This ID is a ten digit number prefixed by 80840 ('80' refers to health, 840 refers to the U.S).  Not every ten digit number corresponds to a valid NPI.  Only ten digits numbers that pass a checksum algorithm called the Luhn Algorithm can be used.  The purpose of the checksum is to catch accidental transcription errors, such as swapping two adjacent digits or mistaking a “1” for a “7”, etc.

The algorithm is almost as simple as checking that the last digit of the sum of the digits is equal to zero (with 80840 prepended).  The twist is that every other digit does something special.  Instead of just adding its value to the total, you add twice its value to the total.  Well, almost.  If twice the value is a two digit number, instead of adding the doubled value to the total, we sum those two digits before adding the value to the total.

As an example, we will compute the Luhn checksum of the year that Hans Peter Luhn was born, 1896.  We start by adding the first (rightmost) digit, 6, to the tally.  The next digit is 9, which needs to be doubled, giving 18.  Since 18 is a two digit number, add 1+8 = 9 to the tally, giving 6 + 9 = 15.  For the next digit we are back to the simple case of just adding to the tally, leaving 15 + 8 = 23.  Back to doubling for the final digit, we add 2*1 to the tally, 23 + 2 = 25.  The last step is to reduce the tally to its last digit, which is 5.  Therefore the Luhn checksum of 1896 is 5.

NPI numbers will always have a Luhn checksum of zero when prepended with 80840.  Or, equivalently, they will always have a Luhn checksum of 4 without the prefix.

PokitDok validates a lot of NPIs for our APIs.  When our eligibility API is called with an invalid NPI, an error is returned: "Invalid NPI based on checksum calculation".  The amount of time our back end spends calculating this checksum is negligible compared to other concerns like network latency, so it's not necessary that we optimize the Luhn algorithm.  We do hit this particular section of our codebase a lot, though, so if we see an obvious improvement, we're happy to take it.  PokitDok engineer Brian Corbin talked about this at the most recent PyCon.

Wikipedia has a python implementation of the Luhn Algorithm:

This code example is great for understanding how the algorithm works, but if we care less about exposition and more about performance, we can make some improvements.

First of all, we don't need to compute digits_of(2*6) every time we see a six in an even position.  We can write the answer down once and be done with it!  The list of all such computations is just:

(where the computation for the digit i is the i-th element of this list).

Notice that this list is just a permutation of the digits.  From this we can already see some properties of the Luhn checksum.  The only digits fixed by this permutation are 0 and 9, so the only way you can swap two adjacent digits of a number and end up with the same Luhn checksum are if the digits are 0 and 9.  (Ok, well, you could swap two identical digits, but that's like the proverbial treefall...)  For example, 1091 and 1901 have the same Luhn checksum, but that only works since we swapped 0 and 9, no other two digits would have worked.

Using this lookup table instead of computing digits_of(2*n) for every evenly placed digit is one improvement, what else can we do to improve the algorithm?  We don’t need to collect the the even/odd digits into separate lists as in the Wikipedia snippet, we can just iterate through the digits instead of creating these intermediate lists.  There are several other things to think about that do not have obvious answers.  At what point in the algorithm should we convert from strings to integers (which is something we at PokitDok need to do for our APIs)?  How should we convert from strings to integers to strings?  Should we use recursion?  Let's attack this with science!  There's a great python library, timeit, for benchmarking code snippets.

If we decide to hang on to our string representations of the numbers, iterating character by character, how should we convert the character to an integer when it's time to do the arithmetic?
We could use str(number), or we could use the ordinal value of the ascii character
and subtract what's needed: ord(number)-48.  We'll test this with Python 2, Python 3, and PyPy.

In all cases, it was better to use ord(n)-48.

Python famously doesn't support tail-call optimization, but it wasn't clear to me if it would be better to use recursion as a way to get around continually checking if the digit in question is in an even or odd position.  More specifically, I was curious how well it would work to use a pair of mutually recursive functions, call them luhn_even and luhn_odd,
which do this:

  • luhn_odd lops off the last digit, adds this digit to the tally, then passes the truncated number to luhn_even.
  • luhn_even lops off the last digit, adds twice this digit (with digits summed if greater than ten) to the tally, then passes the truncated digit to luhn_odd.
  • When nothing remains of the digits, the tally is returned modulo ten.

We now have two questions: Should we work with the numbers as a string or an integer? Should we avoid the even/odd conditional by using recursion?

Using our even Luhn conversion table as described above:

we have four implementations to test:

strings with the even/odd check:

strings with recursion:

integers with the even/odd check:

integers with recursion:

And now we can test each of these with timeit:

Python 2 Python 3 PyPy
string with if check 2.30 2.78 0.103
string with recursion 2.51 2.89 0.00237
integer with if check 2.42 2.60 0.116
integer with recursion 2.68 3.25 0.000583

(usec per loop)

You can see that Python 2 and 3 are pretty unaffected by this particular choice of algorithms,
but PyPy runs much faster with the recursion and integer arithmetic.

We use PyPy at PokitDok, and in fact the “integer with recursion” algorithm above is what we have running in production.  It’s a great check for accidental transcription errors, and you can see that it comes at very little time cost to us.

About jaredcorduan

Ph.D. From Dartmouth in mathematical logic with a passion for all things Charleston.

View All Posts

Leave a Reply

Your email address will not be published.