Friday, October 25, 2013

How to Learn Python and R, the Data Science Programming Languages, from Beginner to Intermediate and Advanced

The Data Science programming / analytics languages to know are, R and Python. If you're in Operations Research or another analytics field that somewhat fits under the "Data Science" hat, you: a) already know them really well, b) want to brush up on them, or c) you probably should learn them now. Here I compile my thinking on how to learn R and Python from Beginner to the Intermediate and Advanced levels, based on having tried some of these course materials.

Beginner (doing basic analysis)

R:

Computing for Data Analysis on Coursera and Youtube (weeks 1, 2, 3, 4), by Roger Peng from Johns Hopkins University

  • Summary: It covers the basics of conditioning and loop structures, R's syntax, debugging, Object Oriented Programming, performing basic tasks with R, such as importing data, basic statistical analysis, plotting and regular expressions. See syllabus for more.
  • Time commitment: 11~36 hours total, including: 
    • non-programmers: 4 weeks X [3 hours/week on video + 2~6 hours/week on exercises]
    • programmers: [3 hours of notes reading + 8~16 hours] on exercises
  • Advice for: 
    • non-programmers: Listen to all lectures (videos), make sure you understand all details, and do all the exercises to hone your skills. Programming is all about practicing. Doing the exercises are important. See below for "Advanced".
    • programmers: Don't bother with the videos, go straight to the lecture notes (link). Read the notes - much faster than the videos. if you don't understand anything, look up the video and watch, or google the topic. Then do all the exercises. You don't need me to tell you that practice is king (um, and cash too).

The swirl package within R, by the Biostatistics team at Johns Hopkins University
  • Summary: It aims to teach R and Statistics within the R environment itself, through a package called swirl. See the announcement here for more detailed info.
  • I haven't tried this, so I'm not sure how much time it takes or how good it is. However, I think it sounds pretty good, and deserves a mention. I was never a fan of reading books to learn a programming language. Show me the code, or in this case, let me write the code, and get involved, is much more, well, involving.

Python:

Google's Python course (link)
  • Summary: It's straight to the meat, no non-sense stuff, and covers all the important things. Suits my style. Enough said, so see the course page on the syllabus. 
  • Time commitment: 8-10 hours
    • including reading notes and doing exercises
  • Note, this is for experienced programmers. There are videos too, but don't bother. The notes on the course page are the same, and it always takes less time to read than watch.

Intermediate (building analytical models)

R:

Data Analysis with R on Coursera and Youtube (plus class notes), by Jeff Leek from Johns Hopkins University
  • Summary: It covers the full modelling cycle, from getting data, to structuring the analysis pipeline, exploring with graphs and statistical analysis, modelling (clustering, regression and trees), and model checking with simulation. It also talks about important statistical watch-outs like p-values, confidence intervals, multiple testing and bootstrapping. More syllabus here.
  • Time commitment: 32~56 hours
    • including 8 weeks X [2~3 hours/week videos + 2~4 hours/week exercises]


Forecasting using R (link), by Rob Hyndman from Monash University in Australia and Revolution Analytics (the enterprise R solution)
  • Summary: topics include "seasonality and trends, exponential smoothing, ARIMA modelling, dynamic regression and state space models, as well as forecast accuracy methods and forecast evaluation techniques such as cross-validation. Some recent developments in each of these areas will be explored" (quoted from course site). Read more there.
  • Note: I haven't done this (just started), so I'm not sure about its time requirement or quality. I'm also not sure if they are planning to make available the lectures. Time will tell on these questions.

Python / Octave:

Machine Learning on Coursera, by Andrew Ng from Stanford University --> My Favourite!
  • Summary: The course actually teaches in the Octave language, but it all can be done in Python. I suppose you can do it twice, first in Octave, and then in Python, if you've got the time. It certainly would solidify your understanding of the material, and Andrew Ng is sure that Octave is rather important in Machine Learning. It assumes some prior knowledge of linear algebra and probability, and refreshes you on some basics. "Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI)." (quoted from the course website)
  • Time commitment: 50~90 hours
    • including 10 weeks X [2~3 hours/week videos + 3~6 hours/week exercises]
  • Note: this course covers a subset of the statistical and modelling principles from the Data Analysis with R course above, but the overall level is more advanced. I enjoyed this course the most.

Advanced (you follow the drift from above)

Advanced = Experienced.
This is true for programming, analytics, and learning any foreign languages.

"Just do it", is how you get experienced.

There is no course on this stuff (i.e. being advanced), not without a PhD _plus_ years of field work.

My best suggestion is use your curiosity. Find a problem. Dig into it.

Plus, work with other people that are really good.



Happy learning!


Monday, October 7, 2013

The most efficient pizza shop - Ugi's in Buenos Aires

This blog post by the Operations Room, one of my favourite operations blogs, reminded me that I should write about the Argentine pizza chain, Ugi's. Its pizza was pretty good, but I was more fascinated by its operation of the stores and the business model, while visiting Buenos Aires (BA).

The business model


The product:
They sold one pizza. Exactly one type: mozzarella on tomato sauce on pizza dough.
No variations.

Size:
The whole pizza. Or by the slice.

Extras:
You can add condiments like chilli peppers and oregano, after you get the pizza, for free.
The cardboard box is extra.

Environment:
Basic and bare. No frills.
There is basically standing room only with very few seats and tables in the shops. Most people do take out.

--------------------------------------------------------------
The USP (Unique Selling Point):
Cheap.
Fast.

The result: Very popular! Probably the longest queue for food in BA.
--------------------------------------------------------------

The operation

Each shop had a big oven and 2 guys making the pizzas. That seems to be it.

Tasks of the pizza maker:
1. Work the dough and spin it out to lay onto a wooden pizza pan.
2. Ladle the tomato sauce from a big pot onto the dough, and smooth it over the dough with a circular smudge with the bottom of the ladle.
3. Cut a brick sized block of mozzarella from a giant block. Split it in half, and chuck it in the middle of the sauced dough. (It nicely melts all over somewhat evenly.)
4. Put the pizza into the oven.
5. Check on the other pizzas in the oven. Take them out onto the table when ready.

Tasks of the pizza giver:
6. Slice the pizza.
7. Box it. Or put a slice on a plate.
8. Hand it to the customer.
9. Take money.

I could have stared at the guy making pizzas for hours. It was so well practiced and smooth, since it's the only thing he makes all day long, by the hundreds, every day. It reminded me of some of the best run factories I've visited before. Precise. Lean and Mean. The simple business model makes it possible.

Change / Improve?

Do you find yourself asking, "Given their popularity, why don't they add 1 or 2 more flavours, like pepperoni or something?" or...why should they change anything?
  • I think for one, it would trade off speed with variety.
  • Secondly, they are already maxing out their capacity, so why add more. There's unlikely more revenue to be had, and I can't comment on profitability.
  • Thirdly, given their popularity, why should they change any of it? The customers clearly like it the way it is.

You can read more about Ugi's here.

As much as I love the parrilladas Argentinas, do have a Ugi's pizza next time you're in BA!