I am not an engineer
Must read:Is there a real future in data analysis for self-learners without a math degree?

2 Months into Data Analysis: Inferential Statistics Tamed!

Two months after hitting a concrete wall of statistical inference, I now start playing with its bricks. This is to how I overcame my first major challenge in my very own “data science bootcamp”.

Have you ever been lost in a group conversation, getting somehow the impression that there is something you do not know and therefore cannot follow the talk? This is how I felt while taking my first class on statistical inference, at least at the beginning. Despite pausing and replaying the video lectures several times, I could simply not follow the explanations of the lecturer.

Simply following online classes and doing a couple of Google searches will not do the trick

Statistical inference did not come naturally to me as I was following the corresponding course in the data science specialization of John Hopkins University. So, I decided to take a separate specialization focusing exclusively on statistics. I chose Duke’s specialization over introductory statistics offered by the University of Amsterdam as it explicitly stated the use of R.


Tamed stats kitten

The OpenIntro Stats Book

As soon as I passed the introductory course of Duke’s specialization on probability and data, I got the confirmation: following online classes and doing a couple of Google searches will not do the trick. As recommended in the course syllabus, I downloaded the OpenIntro Stats textbook.

From then on, my learning experience changed radically. Before watching the video lectures, I read the corresponding chapter and solved all the exercises ending with an uneven number (for which the answers are provided in the appendix). It was less fun and games, and almost felt like “back to school”.

Cheat sheets and notes on procedures

In addition to writing my usual set of flashcards for each chapter, I started producing notes (which you find in the practice section), summing up the different theorems. It was the only possible way to make headway. The notes, written in R markdown, are designed as legitimate cheat sheets, packed with formulae for all the necessary calculations in R. After all, almost everything was new to me: the distributions, conditions for inference, test statistics, methods to check the results, etc.

Whereas some of the notes simply present the course material in a re-useable way – ready for copy-and-paste – others are designed to provide procedural guidance for each step in data analysis: what are the steps in exploratory data analysis? What to do before inferring on the given sample? Which procedure to follow in a hypothesis test? And many more such questions…

Keep track of your calculations

Another helpful learning method was to keep track of calculations made in R while going through the exercises in the OpenIntro book. It helped me find a more structured way of solving the statistical riddles. Rather than proceeding with the calculations one-by-one in the console, I wrote short scripts in which you start by the hypotheses (in case of such a test), the assumptions, the “given data” stated in the exercise, followed by the “calculated data” required to compute the test statistic, etc. I would simply put a hashtag in front of every line, which does not contain any R code: hypotheses, assumptions, etc.

This method gave me a good understanding of the steps to be followed, for instance in a hypothesis test or when calculating a confidence interval. Writing a script for each exercise also helps you find a certain structure in how to name your variables. Browsing through code chunks written for previous exercises also helps you remind of certain variables to be defined and used. There is obviously no need to save the R code in separate files as you can just execute part of your code by highlighting it in the editor and pushing Ctrl+Enter.

Going through the quizzes and writing the assignments became a mere continuum of the learning process

The “back to school” approach completely reversed the learning experience. I was no longer struggling to understand the video lectures as I watched them last, after reading the whole corresponding chapter in the OpenStats book and solving all exercises. Going through the quizzes became a mere continuum of the exercises in the book and I finished the Coursera class at a 100%.

Still unanswered questions

However, after all the schoolwork, I still wonder about certain major elements in statistics. For instance, when it comes to checking conditions such as assessing the normality of a distribution. Of course, there is the normal probability plot, but the criteria of “reasonably assuming normality” still seem a little vague to me at this stage.

Moreover, I found contradictory answers online to my questions about pooling variances. This is not to question the general ambiguity of the wisdom on the world wide web, but I was quite surprised not to find the same answer to a statistics question. There again, it looks like I want to find out more about the spirit of statistics, which sources to trust and which hints to follow.

Leave a Reply

Your email address will not be published. Required fields are marked *