# Snake_Bytes #3: Statistics!

Welcome to the 3rd installment of Snake_Bytes! TODAY we are covering Python and Statistics

So many of you may remember your statistics class or maybe you choose not to remember your statistics class or maybe you are unable to remember your statistics class  - whatever the case this Snake_Byte is for you.  If you do anything with data at some point you will need statistics.  Besides linear algebra and probability theory, statistics is one of the fundamental mathematical tools one can have when working with data.  Even the most basic statistical methods allow you to make inferences about the data and draw conclusions based on the results.

A little mathematical theory:  In this Snake_Byte we will focus on Descriptive or Summary Statistics which most commonly are seen in some form of visualization.

The same data set can be considered both a population and a sample depending on the reason for acquisition and analysis. So for, say, a clinical study, if the purpose is to do a scoring of outcomes with a distribution of those scores, then that would be a population. If the reasoning is to describe an inference of the outcomes of the clinical data, that would be a sample.

Also something I would like to make a point of here is that just because something is popular and easy to implement does not make it correct.  We will be talking about summary statistics: mean, median, standard deviation, and variance.  These are based on unimodal distributions.  Make sure initial conditions are met because these numbers will look just fine.  They will be wrong.

In most cases,  which is a measure of location, is typically the first statistics used for a new data set.  If the distribution only has a single peak, one intuitively asks - where is the peak located?  (Note: In machine learning situations you also inquire about data symmetry and if and where there are any outliers present)

The Mean

The Mean (or average) is very intuitive.  We sum up the numbers then divide by the number of values. The mean of a population is denoted by (MU) and the mean of a sample is typically denoted by (X-Bar).

Note: one way to reduce the influence of outliers is to calculate what is called the trimmed mean.

 mean

The Median

The median of a data set is the middle value when the values are ranked in ascending or descending order. The median is a better measure of the central tendency.  It also is a better measure for asymmetric data or data that contains outliers.

The Mode

The mode is the most frequently occurring value in a data set.  The mode is really useful in describing ordinal data or categorical data.

The Standard Deviation

The standard deviation measures how far the respective points are spread from the mean.  We take all the differences between the data points and the mean and then we average over all the differences - basically how far the data points are from the mean.  The standard deviation is also the square root of the variance.

 standard deviation of a population standard deviation of a sample

The Variance

The variance is the measure of the of distance from the mean and is calculated based on the average of the squared deviations from the mean.   This is also a measure of spread. Measures of center and spread together form the basis for error functions.

 variance of a population variance of a sample

The Skew

Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.

The Kurtosis

Kurtosis is derived from the Greek root “kurtos,” which means curved. Arching is a measure of tailedness of a distribution that uses skew in the calculations.  Many times when people are talking of fat tail or long tail they are actually making reference to the kurtosis and probably don't realize it.

As a last tidbit for descriptive statistics, chances are that you are familiar with percentiles. A percentile is the percent of cases occurring at or below a score.  The relationship between quartiles and percentiles is:

Q1 = marks the 1st quartile = 25th percentile

Q2 = marks the 2nd quartile = 50th percentile

Q3 = marks the 3rd quartile = 75th percentile

Q4 = marks the 4th quartile = 100th percentile

Percentiles are useful for showing how a particular score ranks with regard to other scores on the same variable.

There are several pythonic options when having to compute statistical attributes of your data:

Pandas - one of the premier data manipulation libraries written by Wes Mckinney

Statsmodels — various statistical routines and comprehensive - great documentation

scikit-learn — The machine learning library in Python.

pyMC — for Bayesian data analysis

pystan Bayesian analysis based on Stan, one of the original stat packages

We will be using pandas in this example.   We obtained some data from a really cool site: data site.  The example data I am using is chicken feed data.  An experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens. The format is a frame with 71 observations on 2 variables. In the .csv  Weight is a numeric variable giving the chick weight and Feed a factor giving the feed type.

Lets load it up in pandas data frame:

The output for the dataframe is:

 Weight Feed 1 179 horsebean 2 160 horsebean 3 136 horsebean 4 227 horsebean 5 217 horsebean 6 168 horsebean 7 108 horsebean 8 124 horsebean 9 143 horsebean 10 140 horsebean 11 309 linseed

It was truncated for concise readability.  Ok, here is where the magic comes in:  Ready to do some Statistics?

 Pandas  has a call  DataFrame.describe() which returns for numeric  dtypes  will include: count, mean, std, min, max, and lower, 50, and upper percentiles.

That is it, folks. Here is the output:
Weight

count  - 71.000000

mean -  261.309859

std -  78.073700

min -  108.000000

25% - 204.500000

50%  - 258.000000

75% - 323.500000

max - 423.000000

This will get you most of the way to performing meaningful (magical?) feedback from your company's data. Explore the other packages.  Trust me they are just as easy.  While this Snake_Byte was a little long in the fang we hope it gave you something to think about.

@tctjr