Friday, October 25, 2013

How to Learn Python and R, the Data Science Programming Languages, from Beginner to Intermediate and Advanced

The Data Science programming / analytics languages to know are, R and Python. If you're in Operations Research or another analytics field that somewhat fits under the "Data Science" hat, you: a) already know them really well, b) want to brush up on them, or c) you probably should learn them now. Here I compile my thinking on how to learn R and Python from Beginner to the Intermediate and Advanced levels, based on having tried some of these course materials.

Beginner (doing basic analysis)

R:

Computing for Data Analysis on Coursera and Youtube (weeks 1, 2, 3, 4), by Roger Peng from Johns Hopkins University

  • Summary: It covers the basics of conditioning and loop structures, R's syntax, debugging, Object Oriented Programming, performing basic tasks with R, such as importing data, basic statistical analysis, plotting and regular expressions. See syllabus for more.
  • Time commitment: 11~36 hours total, including: 
    • non-programmers: 4 weeks X [3 hours/week on video + 2~6 hours/week on exercises]
    • programmers: [3 hours of notes reading + 8~16 hours] on exercises
  • Advice for: 
    • non-programmers: Listen to all lectures (videos), make sure you understand all details, and do all the exercises to hone your skills. Programming is all about practicing. Doing the exercises are important. See below for "Advanced".
    • programmers: Don't bother with the videos, go straight to the lecture notes (link). Read the notes - much faster than the videos. if you don't understand anything, look up the video and watch, or google the topic. Then do all the exercises. You don't need me to tell you that practice is king (um, and cash too).

The swirl package within R, by the Biostatistics team at Johns Hopkins University
  • Summary: It aims to teach R and Statistics within the R environment itself, through a package called swirl. See the announcement here for more detailed info.
  • I haven't tried this, so I'm not sure how much time it takes or how good it is. However, I think it sounds pretty good, and deserves a mention. I was never a fan of reading books to learn a programming language. Show me the code, or in this case, let me write the code, and get involved, is much more, well, involving.

Python:

Google's Python course (link)
  • Summary: It's straight to the meat, no non-sense stuff, and covers all the important things. Suits my style. Enough said, so see the course page on the syllabus. 
  • Time commitment: 8-10 hours
    • including reading notes and doing exercises
  • Note, this is for experienced programmers. There are videos too, but don't bother. The notes on the course page are the same, and it always takes less time to read than watch.

Intermediate (building analytical models)

R:

Data Analysis with R on Coursera and Youtube (plus class notes), by Jeff Leek from Johns Hopkins University
  • Summary: It covers the full modelling cycle, from getting data, to structuring the analysis pipeline, exploring with graphs and statistical analysis, modelling (clustering, regression and trees), and model checking with simulation. It also talks about important statistical watch-outs like p-values, confidence intervals, multiple testing and bootstrapping. More syllabus here.
  • Time commitment: 32~56 hours
    • including 8 weeks X [2~3 hours/week videos + 2~4 hours/week exercises]


Forecasting using R (link), by Rob Hyndman from Monash University in Australia and Revolution Analytics (the enterprise R solution)
  • Summary: topics include "seasonality and trends, exponential smoothing, ARIMA modelling, dynamic regression and state space models, as well as forecast accuracy methods and forecast evaluation techniques such as cross-validation. Some recent developments in each of these areas will be explored" (quoted from course site). Read more there.
  • Note: I haven't done this (just started), so I'm not sure about its time requirement or quality. I'm also not sure if they are planning to make available the lectures. Time will tell on these questions.

Python / Octave:

Machine Learning on Coursera, by Andrew Ng from Stanford University --> My Favourite!
  • Summary: The course actually teaches in the Octave language, but it all can be done in Python. I suppose you can do it twice, first in Octave, and then in Python, if you've got the time. It certainly would solidify your understanding of the material, and Andrew Ng is sure that Octave is rather important in Machine Learning. It assumes some prior knowledge of linear algebra and probability, and refreshes you on some basics. "Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI)." (quoted from the course website)
  • Time commitment: 50~90 hours
    • including 10 weeks X [2~3 hours/week videos + 3~6 hours/week exercises]
  • Note: this course covers a subset of the statistical and modelling principles from the Data Analysis with R course above, but the overall level is more advanced. I enjoyed this course the most.

Advanced (you follow the drift from above)

Advanced = Experienced.
This is true for programming, analytics, and learning any foreign languages.

"Just do it", is how you get experienced.

There is no course on this stuff (i.e. being advanced), not without a PhD _plus_ years of field work.

My best suggestion is use your curiosity. Find a problem. Dig into it.

Plus, work with other people that are really good.



Happy learning!


5 comments:

Unknown said...

Great post! You should also check out http://learnxinyminutes.com if you're a programmer. Python and R are there with many others as well.

Anonymous said...

Thanks Dawen for your this interesting article.

Do you have any option on https://www.edx.org/course/mitx/mitx-15-071x-analytics-edge-1416 , & how it relates to Coursera/John Hopkins' Data Analysis courses?

Also, do you have any opinion on the "ID verified certificate" fee option for these edx/Coursera courses, & do these "ID verified certificates" have any value to employers in the job market?

Dawen said...

The edX course content sounds right on. Their approach to teach with real cases and the methods used sounds like a great idea. They obviously have the MIT name to make them credible too. Overall, sounds like a good course. If you (anonymous) take it, I'd love to hear about your experience.

As for certificates, I personally didn't care for one, because I took the coursera courses for personal interest. I have had interviewers mention they noticed it on my cv that I've recently taken the course, so people do take notice, I suppose. I am not an early adopter for most things, so I wouldn't jump on the certificates just yet, because I think those skills can be tested/investigated in interviews, and they should be tested/verified too.

Wayward Listener said...

The edX MIT course is pretty good. I'd started the coursera John Hopkins course on coursera, but stopped midway due to a time crunch. The MIT course doesnt cover too much of the R syntax at first, unlike the coursera course. But the case based method is very engaging while being quite (practically) educational. I'd recommend it readily.

Also, thanks for linking to the courses on Python. I'm beginning to get comfortable with R so wanted to pick up Python for general purpose programming and text parsing.

P.S. The ID verified edX course starts at $100. I opted for the honour code :)

Anonymous said...

Greetings Dawen,

Re edx/MIT 15-071x Analytics Edge

"If you (anonymous) take it, I'd love to hear about your experience."

it is a good course. As Wayward says, the "case based method" shows the interesting practical application of the concepts. There is 3-4 homework problems each week, that are another practical case. Every week is a different topic, so the course is broad in subject matter. I found this course more interesting & educational than the Coursera/John Hopkins course.

1 of the Lecturers on the course Discussion Forum said it will be re-offered in Spring 2015. Of course, before then, 1 could "register" to access the Video Lecturers & Quick Questions, however, the Homeworks & Final Exams may or may not be removed.

Thanks again for your interesting blog. Cheers!