Welcome to the 3rd installment of Snake_Bytes! TODAY we are covering Python and Statistics
So many of you may remember your statistics class or maybe you choose not to remember your statistics class or maybe you are unable to remember your statistics class - whatever the case this Snake_Byte is for you. If you do anything with data at some point you will need statistics. Besides linear algebra and probability theory, statistics is one of the fundamental mathematical tools one can have when working with data. Even the most basic statistical methods allow you to make inferences about the data and draw conclusions based on the results.
A little mathematical theory: In this Snake_Byte we will focus on Descriptive or Summary Statistics which most commonly are seen in some form of visualization.
The same data set can be considered both a population and a sample depending on the reason for acquisition and analysis. So for, say, a clinical study, if the purpose is to do a scoring of outcomes with a distribution of those scores, then that would be a population. If the reasoning is to describe an inference of the outcomes of the clinical data, that would be a sample.
Also something I would like to make a point of here is that just because something is popular and easy to implement does not make it correct. We will be talking about summary statistics: mean, median, standard deviation, and variance. These are based on unimodal distributions. Make sure initial conditions are met because these numbers will look just fine. They will be wrong.
In most cases, central tendency, which is a measure of location, is typically the first statistics used for a new data set. If the distribution only has a single peak, one intuitively asks - where is the peak located? (Note: In machine learning situations you also inquire about data symmetry and if and where there are any outliers present)
The Mean
The Mean (or average) is very intuitive. We sum up the numbers then divide by the number of values. The mean of a population is denoted by (MU) and the mean of a sample is typically denoted by (X-Bar).
Note: one way to reduce the influence of outliers is to calculate what is called the trimmed mean.
mean |
The Median
The median of a data set is the middle value when the values are ranked in ascending or descending order. The median is a better measure of the central tendency. It also is a better measure for asymmetric data or data that contains outliers.
The Mode
The mode is the most frequently occurring value in a data set. The mode is really useful in describing ordinal data or categorical data.
The Standard Deviation
The standard deviation measures how far the respective points are spread from the mean. We take all the differences between the data points and the mean and then we average over all the differences - basically how far the data points are from the mean. The standard deviation is also the square root of the variance.
standard deviation of a population | |
standard deviation of a sample |
The Variance
The variance is the measure of the of distance from the mean and is calculated based on the average of the squared deviations from the mean. This is also a measure of spread. Measures of center and spread together form the basis for error functions.
variance of a population | |
variance of a sample |
The Skew
Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.
The Kurtosis
Kurtosis is derived from the Greek root “kurtos,” which means curved. Arching is a measure of tailedness of a distribution that uses skew in the calculations. Many times when people are talking of fat tail or long tail they are actually making reference to the kurtosis and probably don't realize it.
As a last tidbit for descriptive statistics, chances are that you are familiar with percentiles. A percentile is the percent of cases occurring at or below a score. The relationship between quartiles and percentiles is:
Q1 = marks the 1st quartile = 25th percentile
Q2 = marks the 2nd quartile = 50th percentile
Q3 = marks the 3rd quartile = 75th percentile
Q4 = marks the 4th quartile = 100th percentile
Percentiles are useful for showing how a particular score ranks with regard to other scores on the same variable.
There are several pythonic options when having to compute statistical attributes of your data:
Pandas - one of the premier data manipulation libraries written by Wes Mckinney
Statsmodels — various statistical routines and comprehensive - great documentation
scikit-learn — The machine learning library in Python.
pyMC — for Bayesian data analysis
pystan Bayesian analysis based on Stan, one of the original stat packages
We will be using pandas in this example. We obtained some data from a really cool site: data site. The example data I am using is chicken feed data. An experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens. The format is a frame with 71 observations on 2 variables. In the .csv
Weight is a numeric variable giving the chick weight and Feed a factor giving the feed type.
Lets load it up in pandas data frame:
1 2 3 4 5 6 |
import pandas as pd import numpy as np from tabulate import tabulate import scipy as sp data = pd.read_csv('chickwts.csv', names=['Weight', 'Feed'],skiprows=1) df = pd.DataFrame(data,columns = ('Weight','Feed')) |
The output for the dataframe is:
Weight | Feed | |
1 | 179 | horsebean |
2 | 160 | horsebean |
3 | 136 | horsebean |
4 | 227 | horsebean |
5 | 217 | horsebean |
6 | 168 | horsebean |
7 | 108 | horsebean |
8 | 124 | horsebean |
9 | 143 | horsebean |
10 | 140 | horsebean |
11 | 309 | linseed |
It was truncated for concise readability. Ok, here is where the magic comes in: Ready to do some Statistics?
Pandas
has a call DataFrame.describe()
which returns for numeric dtypes
will include: count, mean, std, min, max, and lower, 50, and upper percentiles.
1 |
df.describe() |
That is it, folks. Here is the output:
Weight
count - 71.000000
mean - 261.309859
std - 78.073700
min - 108.000000
25% - 204.500000
50% - 258.000000
75% - 323.500000
max - 423.000000
This will get you most of the way to performing meaningful (magical?) feedback from your company's data. Explore the other packages. Trust me they are just as easy. While this Snake_Byte was a little long in the fang we hope it gave you something to think about.
- Why PokitDok Joined Hyperledger - June 26, 2018
- Full Consensus and Distributed Ledger Hardware Acceleration - May 10, 2017
- Blockchain Virtual Machines with Consensus in Silicon - October 11, 2016
- Snake_Bytes #3: Statistics! - September 23, 2016
- DokChain Technical Deep Dive - Part One - September 13, 2016
- Welcome To Snake_Bytes - September 9, 2016
- The DokChain is Now - July 13, 2016
- PokitDok Now in the Microsoft Azure Marketplace - June 8, 2016
- ClearingHouses Are Oxymorons - May 4, 2016
- Social Capital Within Social Health Networks (part 2) - August 22, 2012