It has now been one month since I switched 100% to this data science project, poised to reveal whether there is a future for “non-engineers” (read non-mathematicians) in the world of data science. On my way of discovering data analysis, am I making any progress? How could I possibly evaluate whether I make enough headway? At this day, I still cannot answer our ultimate question, but one month of intensive study has already shown the great power of statistics, no pun intended.
Let’s get dirty
Obviously, actual “data scientists” and alikes would have been able to provide answers themselves, and of course they will be invited to do so in the “meet the people” section. Our first step however is to get our hands dirty, to practice “data science”, shall it be at a rudimentary stage.
My first steps in statsland
A couple of months ago, I enrolled in the Data Science specialization offered by the Bloomberg School of Public Health at John Hopkins University on Coursera. Alongside my full-time job as a consultant and then business developer, I slowly advanced in the 10-course programme, more or less keeping up with the suggested pace of one course per month.
Roughly speaking, this specialization programme first introduces students into basic tools and the R environment, starting with cleaning and preparing data for initial exploration. After presenting the principles of reproducible research, it takes a deep dive into statistical concepts, including inference and linear regression. The last courses focus on machine learning and data product development.
A deep dive into statistics?
The course takes a deep dive? I should rather say it hits a concrete wall of statistical matter at light speed. At least, this is what my head feels like after my daily 8-hour fix of statistics at the library, in pubs, cafés and wherever my Wifi-connection keeps returning my Google searches. Talking of which, whereas I used to find the right element of code on Stackoverflow, it is now a variety of academic and educational websites which provide human-readable explanations of statistical newspeak.
John Hopkins’ introductory course to statistical inference overwhelmed me at first. After feeling quite confident with the basics of R and exploratory data analysis, I suddenly lost grip with Brian Caffo’s short video lectures. It simply did not matter how many times I replayed the videos. I repeatedly lost track of his explanations, as if he was taking too many shortcuts. At a certain point, I knew that I didn’t know, at least not enough to follow his demonstrations. And there, the deep dive begun.
As a human science post-graduate, or shall I say not-an-engineer, I was expecting this kind of apparent dead end experience, where you simply cannot move on without doing some extra homework. Another specialization course on Coursera gained my attention: Statistics with R, offered by Duke University. So, one Coursera specialization simply became two.
One Coursera Specialization Became Two
The first, introductory course of the Duke specialization turned out to be highly pedagogical and helpful to a beginner like me. The specialization itself follows the OpenIntro book into statistics, which is available online for free. In order to follow the Duke course, you need to go through the chapters of the OpenIntro book. After passing the first course of the Duke specialization, I returned to John Hopkins’ course on inferential statistics and the learning experience became considerably smoother.
But then again, a few videos later, I could not follow the explanations given in the online lectures. To give you a scent of what it felt like: ever listened to people talking in a language you barely master? You grab a word here and there but somehow you never truly follow the conversation. Now armed with my OpenIntro Book, Brian Caffo’s own Little Inference Book and other online sources, I had to change my working method. Back to school.
A glimpse of the beauty of “data science”
Getting familiar with a new programming language such as R has been quite a lot of fun so far. You can play in your sandbox, write your first scripts processing real data and producing telling graphics, discover the grand variety of solutions to one and the same problem. Then, as soon as you trespass on statistical territory, there is no playing around anymore. There is only one answer and not too many ways leading there. Yet, there lies the beauty of data science, in the mix of competences and working cultures, somewhere between statistical accurracy and programming creativity.
In an attempt to keep myself happy playing around, I started to write a few scripts along the learning process. Of course, my creative urge was also driven by a reasonable amount of laziness as most scripts simply help me remember the correct formula, conditions and procedure. To my defense, did I really start this exploration only to learn by heart a bunch of statistical formulae in record time? Do I really have to lose time decrypting the built-in help descriptions of R if I can write down my own, more practical examples? Is this not what data science is all about – using 21st-century tools for solving much older problems?
Making learning stats easieR…
First, most help descriptions of R are like poetry: both highly precise, yet hardly understandable for non-initiates. How many times have I looked up the same command only to find clarity in one of the many examples in the big, wide web? From now on, whenever I look up a command twice or more in R, I create a customized help file containing a simple example. Quite simple, yet handy. It’s called the n00b_help function.
Second, after introducing a set of basic rules, these online courses continuously present a seemingly never-ending series of “exceptions”. First, they tell you that every car drives. Then, they let you know that you need fuel to make the car drive. At last, you find out that cars consume different amounts of fuel. Before driving, you might want to make sure you have the right amount of fuel in the tank. In other words you need to adapt your analysis procedure as you advance in the learning process.
I therefore started setting up an analysis procedure working as a decision-tree, which aims at making sure that you meet all the necessary conditions and use the right formulae. I hope it won’t be called Frankenstein’s function by the time it comprises all major questions to be asked and precautions to be taken. More about both in the practice section.