Snake_Bytes #3: Statistics!

Welcome to the 3rd installment of Snake_Bytes! TODAY we are covering Python and Statistics

So many of you may remember your statistics class or maybe you choose not to remember your statistics class or maybe you are unable to remember your statistics class  - whatever the case this Snake_Byte is for you.  If you do anything with data at some point you will need statistics.  Besides linear algebra and probability theory, statistics is one of the fundamental mathematical tools one can have when working with data.  Even the most basic statistical methods allow you to make inferences about the data and draw conclusions based on the results.

A little mathematical theory:  In this Snake_Byte we will focus on Descriptive or Summary Statistics which most commonly are seen in some form of visualization.  

The same data set can be considered both a population and a sample depending on the reason for acquisition and analysis. So for, say, a clinical study, if the purpose is to do a scoring of outcomes with a distribution of those scores, then that would be a population. If the reasoning is to describe an inference of the outcomes of the clinical data, that would be a sample.

Also something I would like to make a point of here is that just because something is popular and easy to implement does not make it correct.  We will be talking about summary statistics: mean, median, standard deviation, and variance.  These are based on unimodal distributions.  Make sure initial conditions are met because these numbers will look just fine.  They will be wrong.  

In most cases, central tendency, which is a measure of location, is typically the first statistics used for a new data set.  If the distribution only has a single peak, one intuitively asks - where is the peak located?  (Note: In machine learning situations you also inquire about data symmetry and if and where there are any outliers present)

The Mean

The Mean (or average) is very intuitive.  We sum up the numbers then divide by the number of values. The mean of a population is denoted by (MU) and the mean of a sample is typically denoted by (X-Bar).

Note: one way to reduce the influence of outliers is to calculate what is called the trimmed mean.

standard deviation of a population mean

 

The Median

The median of a data set is the middle value when the values are ranked in ascending or descending order. The median is a better measure of the central tendency.  It also is a better measure for asymmetric data or data that contains outliers.

 

The Mode

The mode is the most frequently occurring value in a data set.  The mode is really useful in describing ordinal data or categorical data.  

 

The Standard Deviation

The standard deviation measures how far the respective points are spread from the mean.  We take all the differences between the data points and the mean and then we average over all the differences - basically how far the data points are from the mean.  The standard deviation is also the square root of the variance.

standard deviation of a population standard deviation of a population
standard deviation of a sample standard deviation of a sample

 

The Variance

The variance is the measure of the of distance from the mean and is calculated based on the average of the squared deviations from the mean.   This is also a measure of spread. Measures of center and spread together form the basis for error functions.  

standard deviation of a population variance of a population
standard deviation of a sample variance of a sample

 

The Skew

Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.

The Kurtosis

Kurtosis is derived from the Greek root “kurtos,” which means curved. Arching is a measure of tailedness of a distribution that uses skew in the calculations.  Many times when people are talking of fat tail or long tail they are actually making reference to the kurtosis and probably don't realize it.

 

As a last tidbit for descriptive statistics, chances are that you are familiar with percentiles. A percentile is the percent of cases occurring at or below a score.  The relationship between quartiles and percentiles is:

Q1 = marks the 1st quartile = 25th percentile

Q2 = marks the 2nd quartile = 50th percentile

Q3 = marks the 3rd quartile = 75th percentile

Q4 = marks the 4th quartile = 100th percentile

Percentiles are useful for showing how a particular score ranks with regard to other scores on the same variable.

 

There are several pythonic options when having to compute statistical attributes of your data:

Pandas - one of the premier data manipulation libraries written by Wes Mckinney

Statsmodels — various statistical routines and comprehensive - great documentation

scikit-learn — The machine learning library in Python.

pyMC — for Bayesian data analysis

pystan Bayesian analysis based on Stan, one of the original stat packages

 

We will be using pandas in this example.   We obtained some data from a really cool site: data site.  The example data I am using is chicken feed data.  An experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens. The format is a frame with 71 observations on 2 variables. In the .csv Weight is a numeric variable giving the chick weight and Feed a factor giving the feed type.

Lets load it up in pandas data frame:

The output for the dataframe is:

    Weight     Feed
1 179 horsebean
2 160 horsebean
3 136 horsebean
4 227 horsebean
5 217 horsebean
6 168 horsebean
7 108 horsebean
8 124 horsebean
9 143 horsebean
10 140 horsebean
11 309 linseed

 

It was truncated for concise readability.  Ok, here is where the magic comes in:  Ready to do some Statistics?

Pandas has a call DataFrame.describe() which returns for numeric dtypes will include: count, mean, std, min, max, and lower, 50, and upper percentiles.

That is it, folks. Here is the output:
Weight

count  - 71.000000

mean -  261.309859

std -  78.073700

min -  108.000000

25% - 204.500000

50%  - 258.000000

75% - 323.500000

max - 423.000000

This will get you most of the way to performing meaningful (magical?) feedback from your company's data. Explore the other packages.  Trust me they are just as easy.  While this Snake_Byte was a little long in the fang we hope it gave you something to think about.

@tctjr

About Ted Tanner

Ted Tanner is co-founder and CTO of PokitDok. Ted is an engineering executive with extensive experience ranging from startups to public corporations. Focused mainly on growth scale computing he has held architect positions at both Apple and Microsoft and has held instrumental roles in several start-ups, including digidesign (IPO and acquired by Avid), Crystal River Engineering (acquired by Creative Labs), VP of R&D at MongoMusic (acquired by Microsoft) and Co-founder and CTO of BeliefNetworks (acquired by Benefitfocus). He was also the CTO of Spatializer Audio Labs (NASDAQ: SPAZ), a company specializing in digital signal processing solutions. He is on the IAB for the University of South Carolina Computer Science Department as well as the Center for Intelligent Systems and Machine Learning at the University of Tennessee. Mr. Tanner has published numerous articles in leading technical magazines and holds several patents in the areas of semantics, machine learning, signal processing and signal protection.

View All Posts

Leave a Reply

Your email address will not be published.