Monday, July 29, 2013

Learn R with Coursera for Data Analysis

Heads up: the Computing for Data Analysis course is running in September 2013.

It will teach you the R language for data analysis. The course is described as:
This course is about learning the fundamental computing skills necessary for effective data analysis. You will learn to program in R and to use R for reading data, writing functions, making informative graphs, and applying modern statistical methods. 
In this course you will learn how to program in R and how to use R for effective data analysis. You will learn how to install and configure software necessary for a statistical programming environment, discuss generic programming language concepts as they are implemented in a high-level statistical language. The course covers practical issues in statistical computing which includes programming in R, reading data into R, creating informative data graphics, accessing R packages, creating R packages with documentation, writing R functions, debugging, and organizing and commenting R code. Topics in statistical data analysis and optimization will provide working examples.


Related article:
Coursera and the Analytics Talent Gap
Starting up in Operational Research: What Programming Languages Should I Learn?

Sunday, July 28, 2013

Even Google can't get their numbers straight

Google has so many various entities and products, either grown within the organisation or externally acquired. It appears that even Google, the leader in Data Science and Analytics, cannot get all the numbers straight across their products: Google Analytics vs. Blogger.

Is this blog really that popular? Really?

While I was checking this blog's traffic numbers on Blogger's built-in "Stats" function, I was really surprised that the blog seems to be really popular, even though I have not been good (sorry!) at writing much for some time. As an ex-SEO'er, I had an inkling that something is not right. Up comes Google Analytics.

Blogger Stats numbers are 4.5 times bigger than Google Analytics'.

After checking my Google Analytics (GA) numbers. I was really surprised to see that the Blogger Pageview numbers were 4.5 times bigger than the GA numbers. That is a staggering difference!

After some research on the web, I concluded that:
  1. GA is much closer to the truth (but not quite completely true, see 3 below).
  2. Blogger stats include all kinds of bots traffic, so it's heavily inflated (GA tries to filter most out).
  3. GA cannot count any traffic if the user has disabled Javascript. Some folks suggest it undercounts traffic by 50%, but there is no hard evidence to back it up, so take it with a grain of salt.
  4. Blogger seems set on reporting only Pageviews, not any other useful metrics, such as Visits or Unique Visitors. Not sure why.
  5. This blog has probably been targeted by a spam bot. Upon closer look, one of the bots probably comes from a particular Dutch ISP.

Share best practice and be consistent.

I would have expected Google, the leader in Data Science and Analytics, to share best practice amongst its entities and products, such as reporting on key metrics (not just Pageviews).

I would also have expected Google to be able to have a consistent set of numbers amongst its entities and products. Doesn't appear so neither.


The majority of a Business Intelligence (BI) analyst's job is spent verifying and reconciling numbers amongst various reports, more often than not. Major BI tech giants sell BI applications that often allude to reducing such activities and increasing business confidence in the numbers in their data warehouse. However, it is still a major challenge to most companies, as evidenced here. Without a good and reliable data source, the validity of any following analysis is heavily undermined.

Let's try to stay consistent.
That goes for the metric choice, and the numbers.


FYI: if you want to find out if and who is attacking your site with spam bots, read this helpful post.

Saturday, April 6, 2013

7.2% raise for 1,000 best paid Ontario public sector employees




The top 1,000 employees with the highest package (salary + taxable benefits) in the Ontario Public Sector Salary Disclosure, the so-called “Sunshine List”, saw an average increase of almost $25,000 in 2012 compared to the previous year, an increase of 7.2%, much higher than the bottom half of the 80,000-strong list which saw an increase of only 2.2%.

Is this cause for alarm? Highly paid CEO's are fully in the public spotlight, and the many many school principals have their pay closely monitored, but what about the highly paid individuals near, but not at the top? The data shows that for them, 2012 was a good year.

Every year since 1996, the Ontario Ministry of Finance has released a list of all public sector employees who earned more than $100,000 in the previous year.

Oversight

We can all see that “Sunshine List” champion Thomas Mitchell, President & CEO of Ontario Power Generation took a pay cut this year, but with approaching 100,000 names on the list, more sophisticated, data-drive oversight is possible.

Government-friendly observes point out that the average salary on the list has decreased, just like last year, but that is a red herring. Anyone can add over 9,000 people earning just over $100k to a list with an average salary of $129k and bring down the average. As the list continues to grow from the bottom, we can expect the average salary to decline, without this being any indicator of public fiscal discipline.

Opposition partisans will lament the increasing growth of the list, 9,000 more this year and 7,500 the year before. This is again misleading. The pyramid shape of any organisation tells us that there are more people as you move down the salary brackets. With a perfectly reasonable average salary growth at just over 2.5%, 9,600 employees graduated to the “Sunshine List” this year after having earned around $98k last year. Probably more than 9,600 employees, currently earning around $98k will be new additions to the list next year, and more the year after. Inflation and economic growth will ensure that the list grows, and the pyramid shape will ensure that it grows faster.

Top 1,000

So who are these lucky 1,000 who on average made 7.2% more in 2012?

This year the top 1000 best packages on the list included:
  • 583 individuals working in hospitals
    • 176 Pathologists
    • 50 Chief Executive Officers
    • 66 Vice-Presidents (Senior, Executive, etc.)
    • 79 Psychiatrists
  • 86 employees in electricity
    • 56 Vice-Presidents (Senior, Executive, etc.)
  • 144 working at Universities
    • 100 Professors
Big raises

Of the 1,000, 737 can be matched exactly by name and organisation type to last year. 92 of those fortunate souls saw an increase of over 25%! At the top of the pack was Mohamed Abelaziz Elbestawi, Vice-President Research/Professor at McMaster University who was reported as paid salary $266k in 2011 and $506k in 2012! Trung Kien Mai, a Pathologist at The Ottawa Hospital saw his paid salary move from $306k in 2011 to $515k in 2012!

Of those 92 with big raises:
  • 83 work in hospitals
    • 50 are Pathologists
More questions

At this point, this analysis raises more questions than it answers, but that is to be expected from an analysis of this salary disclosure data. The Public Salary Disclosure Act can help us find questions, not answers. What we do know is that:
  • Salaries near the top grew substantially
  • Those salaries grew much more, even on a % basis than those at the bottom
  • Growth was higher than expected given slow economic growth
  • Some individuals can be shown to have experienced extraordinary raises
  • Pathologists do well, and 2012 was a particularly good year for some

Source: http://www.fin.gov.on.ca/en/publications/salarydisclosure/pssd/

Timberland customer care & operations - I approve!

Buying a brand is buying quality - that's especially true for outdoor equipment.

With this belief, I purchased a pair of Timberland hiking boots that said "Waterproof" on a piece of official-looking metal attached to them. I then ended up with wet feet during an 8-day trek in Patagonia where it often rains - that sucked.

With my toes literally swimming in water within the boots, after a soppy wet day of a 19km hike, I was not a happy camper. However, my perception of Timberland took an 180 degree turn for the better.

Having bought the boots in southern Chile in a Bata store, having used them extensively and been disappointed and upset by them, I ran into a Timberland brand store 2,500km away from where I bought them, still in Chile. I went and complained about my disappointment in these supposedly "waterproof" boots, and I was offered the chance to exchange them for a brand new pair that is indeed waterproof, paying only the small price difference between the two pairs.

This is operationally remarkable:

Different stores (Bata vs Timberland)
I bought them in Bata, which is a popular international brand that happens to carry the Timberland boots. However, I was able to exchange them in a Timberland own brand store. Given the receipts I got from the Timberland store says "Bata" on it, I suspect the two are operated by the same company. However, as a western audience, can you imagine buying something in Gap and then returning in Banana Republic (same mother company)?

Different cities and provinces
I don't know how it's like in the US, but in Canada, returns and exchanges wouldn't be possible cross provincial borders. Yet, in this case, it was not a problem.


After the 14-day exchange period without the paper receipt
It was at least 3 weeks after the original purchase date, while the receipt stated a 14-day exchange period. I also didn't keep the paper receipt (trying to be light while travelling), but I had a photo of it on my phone. This I was able to email to them to enable the processing. Again, can you imagine this to happen in a western country? 


"Waterproof"  "Gore-Tex"
Finally, for everyone's learning, apparently, if it only says "waterproof", it's not waterproof. Only if it says "Gore-Tex", then it's actually waterprof.


I went into the Timberland store only to vent my frustration. I was positively flabbergasted when they offered to exchange for a new pair. Not only is the customer care commendable, but operationally that this could happen is something I would never have expected. They basically went against all the rules I know that would make this infeasible in western countries. Yet, the teens that worked at the Timberland store were willing enough to find ways to help me, a foreigner with broken Spanish, so I would have this outstanding experience and be happy with the decently expensive pair of hiking boots. How they keep the books straight on this transaction is beyond me, 'cause surely they are running Bata and Timberland as two separate business entities. 

The result: Timberland now has a new loyal customer. This is an outstanding example of great customer care made possible by some well-integrated and smooth operations.

Sunday, December 30, 2012

Coursera and the analytics talent gap

It's been a while, and ThinkOR is back to blogging about Operational Research and its related themes.

ThinkOR authors are about to start on 3 Coursera courses over the next couple months:

I am not only learning about some new topics for my own benefit, but also interested in assessing how such easily accessible courses could help the so-called 'big data and analytics talent gap' in businesses. As a Business Analytics consultant, this is one of the biggest issues I see my clients facing in today's business world - one wouldn't think about it, if they don't know about it, and once they know about it, they don't know how to get more of it. Obviously, there would need to be some sort of a step progression, such as (just an example without much research at this point):
  1. Statistics One
  2. Data Analysis (with R) and/or Computing for Data Analysis
  3. some sort of programming course, check the computing course catalogue
  4. Focus on one or several of the main OR techniques and their associated tools, such as Discrete Event Simulation, Monte Carlo Simulation, Optimisation, Forecasting, Machine Learning, and the good old Volumetric Modelling, as some examples
  5. and if you are going to work with humongous data sets, Intro to Data Science sounds reasonable to become familiar with the various big data technology to apply data science (I suspect this often eludes traditional OR practitioners)
As ThinkOR goes along, we will be blogging about these courses and our learning experience. So far, there has only been very positive feedback. Let's get going!

Merry Christmas and Happy New Year!

Thursday, May 31, 2012

Consistent Education Divide in Cities

The Daily Viz brought this to my attention. It's a visual by the New York Times showing how the distribution of cities by proportion of adults with college degrees has changed over the last 40 years.

Nicely formatted and presented, though my ability to compare the distributions side-by-side is a little bit limited.

The key story that this visual is telling is that the average has moved from 12% to 32%, but that the number of cities more than 5% above or below the average has increased substantially. "College graduates are more unevenly distributed in the top 100 metropolitan areas now than they were four decades ago." But i'm not sure if it's as simple as that.

Suppose I was measuring trees. One species was 10 feet tall on average and species two was 100 feet tall. If the first tended to vary between 7 feet and 13 feet, but the latter tended to vary from 85 feet to 115 feet, I wouldn't remark at how much more variable these trees were. For species one, no tree was more than 3 feet from the average, but in species two, presumably many are. Is this a sign that species two is more unevenly distributed? Not really. Species one varies up and down by 30% where two does so by 15%.

So I asked myself, given that the average proportion of adults with college degrees has nearly tripled to 32%, has their variability increased proportionally? Now that these trees are 32 feet tall, it seems strange to still measure their "unevenness" by how many of them are between 27 and 37.

So I reached out to a statistic, the Coefficient of Variation. Using my eyes to collect the data from the charts (so not precisely the correct data), I calculate a coefficient of 0.25 in 1970 and 0.22 in 2010. The variation in the data as a proportion of the average has gone down in the last four decades.

Again, the NYT concludes that "College graduates are more unevenly distributed in the top 100 metropolitan areas now than they were four decades ago.", but I would argue that if anything they are slightly more evenly spread than before and not remarkably so.

Saturday, April 28, 2012

Data Journalism



I've recently started following the Guardian's Data Blog, but I was a little disappointed with their recent article on grammar schools inthe UK.

My understanding is that grammar schools are a subset of schools in the UK that supposedly offer entry on a meritocratic basis and deliver higher quality education. Depending on your political leanings you either believe that grammar schools re-enforce the class division in the UK by giving entry disproportionately to the already higher class and then giving them a better education or you believe that grammar schools enable class mobility by delivering a better education to bright lower class students who would not otherwise afford such a thing. As an outsider in the UK I’m not qualified to hold an opinion here, but I suspect that naturally each extreme fails to appreciate some nuanced details.

The article appears to have pulled off a classic journalist's ploy:
  1. Present a statistical analysis of the data in a leading way without drawing conclusions
  2. Quote somebody else's opinion on the topic

Essentially you can deliver opinion supported by the apparent full weight of objective statistical analysis without having to put your name to the conclusions which might not hold up to rigorous challenge.

Notice also that one of the opinions is much stronger than the other. Notice also that Rosemary Joyce's note has very complex implications which are not at all explored for the reader. Even I’m not sure if she has a point.

I could offer a very different view of the same data:
  •  14 of 32 schools favoured those not privately educated, giving fewer than 6% of offers to the privately educated
  • Taking 24 of the 32 schools (3/4) with the lowest privately educated proportions, the average was 6%, the same as the overall population
  • Removing the two clear outliers in the data, "Tonbridge Grammar School", "The Judd School" overall the privately educated averaged to 8.9%


I feel the key fact I’m missing is: What % of students in Kent who scored well on the 11-plus exam were privately educated? How does this compare to the 10.89%? How does this compare to the 8.9% removing outliers? Is there a social bias in the offers?

I’m also missing any information about how these numbers have been changing with time. Simon Murphy complains that the government is not taking steps to improve the chances of poor children, and yet for all I know that 10.89% was maybe 12% last year.

What about this “local context” anyhow? How do these percentages compare at a lower level of granularity that county-wide? How do these percentages compare to applications?

Is this a story of a county-wide bias, or just the story of two bad apples and handful of not-so-good-ones? I think I know what The Guardian wants me to think. Data Journalism is still Journalism I suppose.

For my readers, I ask, why do you suppose the 10.89% number is the only one in the text of the article to two decimal places?

Saturday, April 14, 2012

Thursday, March 29, 2012

My Jealous Supermarket II

Last week I wrote the Figure It Out article that was published to the Capgemini Consulting UK Operational Research team blog, My Jealous Supermarket.

I encourage to click through and read the article. To summarise, my supermarket is targeting me with discount coupons in order to maintain my loyalty which it mistakenly thinks it is losing because I am a travelling consultant and have shopped very little lately.

Anyhow, after returning from a week in the north of England followed by a weekend in Florence followed by another week in the north, it was immensely satisfying to see the "Spend £30, get £3 off" coupon roll off the receipt printer when I purchased ingredients for my meal this evening after making no purchases for two weeks. They DO care!

I wonder how this initiative is going for them? Are they successfully winning people back? Any proper initiative would have a benefits tracking element following implementation, but comparing before and after and asserting causality is always difficult. Consider myself. One day I will wrap up my project in the north and spend some time in London again. I will return to my supermarket and purchase lots of food. Success! After months of giving me coupons, they will have finally won back my favour and loyalty. Or not...

Monday, February 13, 2012

Numbers in 2011 - from More or Less podcast

One of my favourite podcasts is BBC's More or Less. At the start of 2012, they did a series on Numbers in 2011. I know it's a little late in sharing this, but here we go - enjoy.

I'm sharing with you a selection of the numbers from the 30min podcast. They are somewhat UK centric, but still worthwhile sharing.

Listen to the whole podcast here.

  1. 80%: developed world's debt to GDP ratio
  2. 1.37: cost of petro in GBP on 9 May 2011 (highest in 2011), due to duty, value added tax (20%) & exchange rate (weaker GBP against USD)
  3. 1%: BBALIBOR (interest to be paid in 3 months time) 10 Nov 2011 crossed 1%, doubling of the bank interest rate. BBALIBOR indicates the risk of money not being paid back in 3 months - a show of lower confidence/trust between banks.
  4. 2.64m: unemployment in UK by December 2011 (highest in 17 years). Note UK population is just over 62m.
  5. 900k: people today working beyond 65 years old in the UK
  6. 12,500: people celebrated their 100's birthday in 2011 in the UK; and will rise to 100,000 over the next 25 years
  7. 7bn: world population
  8. 2.5: average fertility of women on earth (babies per lifetime of earth, falling from 6 from 60 years ago), easing on the environment I suppose
  9. 3,000gbp: cost of sequencing the human genome; in 2003, the first sequencing of human genome cost 600m GBP - that's a 200,000 fold reduction in cost in 8 years
  10. 2 weeks: to sequence 5 human genomes in 2010; in 2003, it took 10 years for one