Sunday, November 23, 2014

Spam gets a personal touch: Human 1, Machine 1

Blogging and spamming practically come hand in hand. The obvious ones have been pretty well controlled by the major blogging platforms' spam filters, thanks to advances in text analysis and machine learning algorithms. However, it is not perfect, or is it - you be the judge in this case.

This could be an example of how creative spammers are at combating algorithms.

Or, it could be an example of a business owner trying to do his own selective SEO (search engine optimization).

An old post on mandatory school uniforms got the following spam:
I think school uniforms must be compulsory in schools because after one time-investment in the uniform, it prevents the child from the traits of social inequality,inferiority complex etc.And If you have decided to buy the uniform, buy it from Wang Uniforms (link removed)

I speculate that a human wrote the comment, because it is a sensible comment, and also because of the grammatical, punctuation and spacing errors.

However, the link, which I removed for this post, does point to a legitimate school uniform maker in the UAE. I suppose there are two possibilities:
1) The uniform business had legitimately read the article, had something genuine to say, and also wanted to promote its own business.
2) The uniform business hired a spammer / mass commenter to do the job for SEO purposes.

I had a bit of a hard time deciding whether this is spam or not. Since I cannot edit the comment to remove the link, I rejected the comment. Especially after I found out that the profile for the commenter was some jewelry shop in South East Asia - nothing to do with uniforms.

Algorithms are never perfect. The underlying uncertainty is why we build algorithms at all. Given I the human had trouble identifying the authenticity of this comment, I'm glad the machine (spam filter) didn't just rule it out.

So... Human vs Machine: Human 1, Machine 1?

P.S. Unrelated, but this is quite funny. Don't be fooled by the title.
Visualizing Big Data

Tuesday, February 11, 2014

What I learned from a sabbatical year

I spent 2013 'overlanding' through South America with my partner. 1 year, 1 continent, 1 simple car, 2 people, 13 countries, 40,000 km. After moving from Canada to the UK 5 years ago, and setting up a new life there, we gave up our jobs, salary, friends and all the comforts of life in one of the greatest metropolis in the world. A lot to let go, but we gained so much more.

Above all, I learned how little I need to live on to be happy, material-wise. We converted the back of our little van into a bed, so we slept in it a lot of the nights. Wild camping at some bizarre and cool spots, like 24-hour gas stations and garages, road-side somewhere in the country, cliff edge by the sea, a lot of central plazas and town squares, in front of police stations (with permission), and once within a secure military compound. The living was rough, and it took some getting used to. I had very few possessions; I was happy; and my eyes were filled with wondrous things throughout the year.

Communities kick ass in supporting overlanding travellers of all modes, by car, motorcycle, bicycle, uni-cycle or even by donkey(!). A few hundred people gathered on a Facebook group were the best near real-time information providers. Almost all overlanders are super eager to share information with each other and help, because we've all known a few hard moments on the road. Most people have never met each other in the cyber community, but are ready to answer questions when asked.

One can have too much of a good thing. I love travelling, and still do. 60+ countries later, my imaginary list is still quite long. Doing a year of pure travel is super fortunate, and I almost don't dare to utter that sometimes I found it hard to drag myself for the 11th time in 3 months to drive through yet another beautiful wine country with breath-taking alpine scenery, or more Andean mountain villages, or serene beaches... etc.. Managing the trip is a huge challenge, but I also missed work a lot, missing the other challenges. So, in the evenings I:
  • brushed up on R and some Machine Learning techniques through Coursera (awesome!)
  • learned something new, OctavePython, more Machine Learning techniques
  • read a lot of blogs on ORanalytics and data science
  • wrote a few blog articles here (definitely neglected when I was working a busy job)
  • thought long and hard about what I want to do when I get back

A bunch of random stuff I learned a bit about:
  • Navigating in places I've never been before. "Don't listen to the British lady (aka the GPS voice), she's never been to Venezuela", and she's leading us down a dead-end.
  • Spotting and dodging potholes, rocks, livestock, cowboys, donkey carts, tree stumps, burned tires (12 day riot aftermath), flying fallen ladder (kid you not from the truck 15m in front at 90km/hr), alignment-breaking and bottom-scraping grooves in the road from heavy Brazilian trucks ... ...
  • Making it Swimming through potholes the size of a swimming pool, with muddy and seemingly endless bottoms, with a 2x4 car that had 6" clearance (nope, didn't get stuck even once! 4x4 is not a necessity for everyone)
  • Fixing cars and dealing with mechanics, and their other-worldly Spanish
  • Playing with the police to always avoid paying bribes (wasn't too often)
  • Finding out just how friendly people are (lots of home-stay invites)
  • Playing the Quena (Andean flute) is way harder than it seems - sticking with my uke instead
  • Optimising the journey in Travelling Sales Man fashion (had to return to the origin to sell the car) - yes, Operations Research is useful in every walk, or drive, of life
  • Optimising decisions under uncertain conditions
  • And of course, learning Spanish, with all sorts of accents and idioms, and the 13 countries' history, culture, landscape, food and people (P.S. mechanics and old country farmers are really hard to understand)

Having finished the year-long journey over a month ago, I was inspired to write this article after reading "Why I put my company on a year-long sabbatical". This is not a PR article, but one to say that anyone can do this sabbatical thing, and you will learn a ton. You don't need the best car all decked out. You don't need to be young. You don't need to be retired. You don't need to be without kids (met a lot of families, with kids from 6-months to 17-year olds). You don't need to have a partner. You don't need to be rich (our all-in costs: £10,000 per person, assuming 2 people sharing). Actually, you will learn how little you need at all. All you need is a bit of discipline to save some money, a bit of gut to throw yourself at it, some luck and common sense to be safe, and a lot of curiosity to explore.

In case you are inspired to consider a sabbatical year, here are some great overlanding resources:

2014 is going to be great. I am never more ready.
First step, land an awesome job.

Thursday, January 23, 2014

Finally Some Sense on Analytics & Data Science Job Ads

After yesterday's post on the state of the debate on building data science teams (individual vs team approach), it's so refreshing to stumble onto this careers page of Civis Analytics. Great example of Analytics & data science job ads done right. This page alone makes me want to apply to work there!

They actually divide their jobs into: data scientist, engagement analyst, project manager, software engineer!

data science analytics roles done right: software engineer, data scientist, engagement analyst, project manager

How sensible. I like it! 
Nothing like the typical data science job posts, asking for "everything and the kitchen sink".

Wednesday, January 22, 2014

Building Data Science Teams: Individuals vs Team - State of the Debate So Far

Since my last article on "Hiring 1 Data Science unicorn is hard enough, a team is impossible. To scale means to specialise", similar ideas have been expressed by InformationWeekMcKinsey/HBR, and KDnuggets (here, here, here and here).

There has a been a ton of great discussion. I attempt to summarise the viewpoints so far: 
  • Data Scientists are supposed to have some pretty deep expertise in some pretty hard areas (see diagram). 
  • Is it possible to close this talent gap when we seem to be chasing after superheroes or unicorns? (there are some, but very few)
  • Some (44%) think there should be data science sub-specialisations (which all exist today), and have them work together in a team.
  • Others (44% too) prefer the superhero approach - individuals who have it all

Opinions so far on the approach of team vs individuals to build out a data science team are as follows:

For Team / against individuals For Individuals / against team
for bigger companies for smaller companies (can't afford)
Easier to find all necessary skill-sets Easier to get things done (no coordination friction)
Don't fall apart if an individual leaves
Jack-of-all trades, master of none; Deep expertise more possible in team Automation tools will take over data engineering & cleaning from DS jobs, so can concentrate on modelling
Business domain expertise & soft skills are hard to find in math/quant majors  Higher-ed will turn out DS superstars soon, who will have the combined maths/computing skills
A good team has both Specialists and Generalists
DS is a field that's evolving fast, and so will these opinions
You want an all-round DS guy/gal to get you started, or 2-3 of them who round each other off. As your team grows with demand, it will become increasingly difficult to find those all-encompassing individuals, so your team will naturally be people with 1-2 of the DS skills.

If you are still keen to know more about what data scientists do, and who they are, listen to these DS guys talk:
  • Amazon's principal engineer: John Rauser, "What is a career in big data?" - 17 minutes of a very good stepped-back view of data science.
  • Cloudera's director of data science: Josh Wills, "Life as a data scientist" - some good nuggets in there at minute 10, 16, 25, 52:
    • "I'm a competent statistician... I'm a competent programmer... I would not say I am good... I am capable of having a conversation at each of those fields with them..."
    • "Scientists get linear regression...but they don't get the difference between linear regression and logistic regression...or the assumptions that underlie the regression models", like normal distribution of the variables for linear regression; it's more of a "mechanical" exercise to turn the crank on the data without understanding the assumptions that support the model
    • Kaggle has "done most of the hard work [for the competitors]". In my opinion, the guys who are competing are good at using the ML tools on a clean'ish data set; but it doesn't exactly test their ability to go from a business problem to a "mental model of the data required" to the type of problem to solve (segmentation, regression, etc...)
    • what stats to learn for someone from the computer engineering side of data science: "learn linear regression, t-tests, confidence intervals, binomial random variables, exponentially distributed random variables, ... the core stuff, really, really well"

P.S. After writing this all out, it sounds so obvious. But believe me, there has been so much debate around this topic, and I wanted some... sense. Go read those articles linked at the top if you want to know.