Wednesday, January 22, 2014

Building Data Science Teams: Individuals vs Team - State of the Debate So Far

Since my last article on "Hiring 1 Data Science unicorn is hard enough, a team is impossible. To scale means to specialise", similar ideas have been expressed by InformationWeekMcKinsey/HBR, and KDnuggets (here, here, here and here).

There has a been a ton of great discussion. I attempt to summarise the viewpoints so far: 
  • Data Scientists are supposed to have some pretty deep expertise in some pretty hard areas (see diagram). 
  • Is it possible to close this talent gap when we seem to be chasing after superheroes or unicorns? (there are some, but very few)
  • Some (44%) think there should be data science sub-specialisations (which all exist today), and have them work together in a team.
  • Others (44% too) prefer the superhero approach - individuals who have it all

Opinions so far on the approach of team vs individuals to build out a data science team are as follows:

For Team / against individuals For Individuals / against team
for bigger companies for smaller companies (can't afford)
Easier to find all necessary skill-sets Easier to get things done (no coordination friction)
Don't fall apart if an individual leaves
Jack-of-all trades, master of none; Deep expertise more possible in team Automation tools will take over data engineering & cleaning from DS jobs, so can concentrate on modelling
Business domain expertise & soft skills are hard to find in math/quant majors  Higher-ed will turn out DS superstars soon, who will have the combined maths/computing skills
A good team has both Specialists and Generalists
DS is a field that's evolving fast, and so will these opinions
You want an all-round DS guy/gal to get you started, or 2-3 of them who round each other off. As your team grows with demand, it will become increasingly difficult to find those all-encompassing individuals, so your team will naturally be people with 1-2 of the DS skills.

If you are still keen to know more about what data scientists do, and who they are, listen to these DS guys talk:
  • Amazon's principal engineer: John Rauser, "What is a career in big data?" - 17 minutes of a very good stepped-back view of data science.
  • Cloudera's director of data science: Josh Wills, "Life as a data scientist" - some good nuggets in there at minute 10, 16, 25, 52:
    • "I'm a competent statistician... I'm a competent programmer... I would not say I am good... I am capable of having a conversation at each of those fields with them..."
    • "Scientists get linear regression...but they don't get the difference between linear regression and logistic regression...or the assumptions that underlie the regression models", like normal distribution of the variables for linear regression; it's more of a "mechanical" exercise to turn the crank on the data without understanding the assumptions that support the model
    • Kaggle has "done most of the hard work [for the competitors]". In my opinion, the guys who are competing are good at using the ML tools on a clean'ish data set; but it doesn't exactly test their ability to go from a business problem to a "mental model of the data required" to the type of problem to solve (segmentation, regression, etc...)
    • what stats to learn for someone from the computer engineering side of data science: "learn linear regression, t-tests, confidence intervals, binomial random variables, exponentially distributed random variables, ... the core stuff, really, really well"

P.S. After writing this all out, it sounds so obvious. But believe me, there has been so much debate around this topic, and I wanted some... sense. Go read those articles linked at the top if you want to know.

No comments: