Statistics


Statistics. If you want to gain deep insights from your data then you're going to need them. In this module the amazing Sam Ball gets you to dip your toe into stats.

Module Coordinator

Sam Ball




Introduction to Stats in Python



Why Python for Statistics? R - another programming language - is purpose built, and made for statistics. So why use Python instead of R? Python allows us more flexibility with our data; Python has more options for importing data, working with controllers and image analysis that makes it more versitile than R. Prehaps more importantly is the size of the community - Python has a very large following and sees regular updates and improvements to it's ecosystem, moreso than R.



Hypothesis Testing



Hypothesis testing is the most basic form of analyical statistics. It works by us giving a null hypothesis, say, the mean of a population being equal to 50, then using samples from the same population we can test whether our null hypothesis is true or not. Hypothesis testing is easy to do with Python using the scipy library - and the statsmodels library can give us some additional functionality when needed. By the end of this guide you should be comfortable using Python to perform simple hypothesis test and be able to apply it to real world problems.



ANOVA



ANOVA (Analysis Of Variance) lets us test to see if a number of samples have the same mean. It's similar to the independent 2-sample t-test, but doesn't restrict us to just 2 samples. We will look at one-way and two-way ANOVA, and by the end of this guide you should be confident in using these statistical methods in your own work.



Statistical Regression



We've looked at Hypothesis testing and ANOVA, which can give us insights into continuous data that's categorically split. What if we have two continuous variables and we want to see the relationship between the two? We need to use regression. Regression using statistical libraries like scipy and statsmodels is different to scikit-learn's method due to the output being used for different things. Statistical libraries will output more analytical data and summary statistics, whereas scikit-learn's methodology is more about making a working model for predictions. By the end of this guide you should be comfortable with applying basic linear regression models to your own data.