Showing posts with label Data Text Mining. Show all posts
Showing posts with label Data Text Mining. Show all posts

Sunday, November 23, 2014

Spam gets a personal touch: Human 1, Machine 1

Blogging and spamming practically come hand in hand. The obvious ones have been pretty well controlled by the major blogging platforms' spam filters, thanks to advances in text analysis and machine learning algorithms. However, it is not perfect, or is it - you be the judge in this case.

This could be an example of how creative spammers are at combating algorithms.

Or, it could be an example of a business owner trying to do his own selective SEO (search engine optimization).

An old post on mandatory school uniforms got the following spam:
I think school uniforms must be compulsory in schools because after one time-investment in the uniform, it prevents the child from the traits of social inequality,inferiority complex etc.And If you have decided to buy the uniform, buy it from Wang Uniforms (link removed)

I speculate that a human wrote the comment, because it is a sensible comment, and also because of the grammatical, punctuation and spacing errors.

However, the link, which I removed for this post, does point to a legitimate school uniform maker in the UAE. I suppose there are two possibilities:
1) The uniform business had legitimately read the article, had something genuine to say, and also wanted to promote its own business.
2) The uniform business hired a spammer / mass commenter to do the job for SEO purposes.

I had a bit of a hard time deciding whether this is spam or not. Since I cannot edit the comment to remove the link, I rejected the comment. Especially after I found out that the profile for the commenter was some jewelry shop in South East Asia - nothing to do with uniforms.

Algorithms are never perfect. The underlying uncertainty is why we build algorithms at all. Given I the human had trouble identifying the authenticity of this comment, I'm glad the machine (spam filter) didn't just rule it out.

So... Human vs Machine: Human 1, Machine 1?


P.S. Unrelated, but this is quite funny. Don't be fooled by the title.
Visualizing Big Data

Sunday, December 30, 2012

Coursera and the analytics talent gap

It's been a while, and ThinkOR is back to blogging about Operational Research and its related themes.

ThinkOR authors are about to start on 3 Coursera courses over the next couple months:

I am not only learning about some new topics for my own benefit, but also interested in assessing how such easily accessible courses could help the so-called 'big data and analytics talent gap' in businesses. As a Business Analytics consultant, this is one of the biggest issues I see my clients facing in today's business world - one wouldn't think about it, if they don't know about it, and once they know about it, they don't know how to get more of it. Obviously, there would need to be some sort of a step progression, such as (just an example without much research at this point):
  1. Statistics One
  2. Data Analysis (with R) and/or Computing for Data Analysis
  3. some sort of programming course, check the computing course catalogue
  4. Focus on one or several of the main OR techniques and their associated tools, such as Discrete Event Simulation, Monte Carlo Simulation, Optimisation, Forecasting, Machine Learning, and the good old Volumetric Modelling, as some examples
  5. and if you are going to work with humongous data sets, Intro to Data Science sounds reasonable to become familiar with the various big data technology to apply data science (I suspect this often eludes traditional OR practitioners)
As ThinkOR goes along, we will be blogging about these courses and our learning experience. So far, there has only been very positive feedback. Let's get going!

Merry Christmas and Happy New Year!

Tuesday, October 14, 2008

Social Media and Operations Research

Social media and Web 2.0 have been the buzz words in the internet marketing world for a few years now. Of course, we can count on the Numerati (the new term for Operations Researchers in reference to the title of the new book by Stephen Baker) to start scratching their heads and eventually come up with systematic ways of mining vast amount of data (i.e. analytics), and then applying the harvest of knowledge from other disciplines, such as psychology, to study people’s behaviours (hence diverting from the non-traditional OR application field of mechanical processes). Claudia Perlich from IBM and Sanmay Das from Rensselaer Polytechnic Institute individually explain ways they have used OR to dissect the world of blogs and Wikipedia to provide insight to marketers and to demonstrate the conversion of Wikipedia’s highly edited pages to a stable and credible source.

Ever since the existence of blogs, marketers have been nervous about the reputation of their products. Lucky for the IBMers, when the marketers at IBM are nervous about the client response to their product (i.e. Lotus), help is within reach from the IBM Research team. Marketers want to know: 1. What are the relevant blogs? 2. Who are the influencers? 3. What are they saying about us? 4. What are the emerging trends in the broad-topic space? Perlich’s team went about these four questions by starting with a small set of blogs identified by the marketers (the core “snowball” of blogs), then “rolling” over the “snowball” twice to discover other blogs related the core set (i.e. max 2 degrees of separation from the core). To find the authoritative sources in the snowball, Google’s Page Rank algorithm came to the rescue. Using sentiment labeling, the team was able to say whether the overall message of a blog was positive or negative. Then to allow useful interaction with and leveraging of the data by the users (i.e. marketers), a visual representation was rendered to show the general trend in the blogosphere about the product in question (see image). At which stage, marketers are able to dig into each blog that is identified as positive or negative, and the problem would seem much more manageable.



Das’ talk fits in rather nicely with Perlich’s, as he examines the trend of blog comment activities and Wikipedia’s edit histories to try to demonstrate the eventual conversion of these information sources to a stable state. The underlying assumptions are that the more edits a Wikipedia page has, the more reliable its information is, hence the higher the quality; and the higher the quality a page is, the less likely that there will be more talks/edits on that page, because most likely what needed to be said has already been said (aka informational limit). Das obtained the data dump of all pages on Wikipedia in May 2008, and obtained all 15,000 pages (out of 13.5 million in total) that had over 1,000 edits. Using dynamic programming to model a stochastic process, Das was able to find a model for the edit rate of an aggregation of these highly edited Wikipedia pages. Then he applied the same model to blog comment activities. In both cases, the model fit extremely well to the data, and surprisingly the shape of the activity pattern over time looked very much alike between blog comment and Wikipedia page edit activities. An interesting inference made by Das was that people contribute less of their knowledge to Wikipedia pages than blogs.

This is the beauty of Operations Research. It is a horizontal plane that can cut through and be applied to many sciences and industries. Aren’t you proud to be a dynamic Numerati?

Credits: The talk was given at the INFORMS 2008 conference in Washington DC. The track session was MA12 & MB12. Speakers are Claudia Perlich, T.J. Watson IBM Research, and Sanmay Das, Assistant Professor, Rensselaer Polytechnic Institute, Department of Computer Science. The talks were titled "BANTER: Anything Interesting Going On Out There in the Blogosphere?", and "Why Wikipedia Works: Modeling the Stability of Collective Information Update Processes".

Sunday, October 12, 2008

Business Intelligence: Data Text Mining & Its Challenges

In the world of business intelligence (BI), data and text mining is a rising star, but it has a lot of challenges. Seth Grimes points out the importance of having structured data in relational databases, and the need for statistical, linguistic and structural techniques to analyze various dimensions of the raw text. He also shares with the audience some useful, open source tools in the field of text mining. John Elder, on the other hand, shares the top 5 lessons he has learned through mistakes in data mining, where he also reveals one of the biggest secret weapons of data miners.

Grimes took the audience on a journey the traditional BI work, where data miners take raw csv (comma separated values) files, and turn them into relational databases, which then gets displayed as fancy monitoring dashboards in analytics tools – all very pretty and organized. However, most of the data that BI deals with are “unstructured” data, where information is hiding in pictures and graphs, or in documents stuffed with text. According to Grimes, 80% of enterprise information is in “unstructured” form. To process the raw text information, Grimes says it needs to be 3-tiered: statistical/lexical, linguistic and structural. Statistical techniques help cluster and classify the text for ease of search (try ranks.nl). Syntactical analysis from linguistics helps with sentence structures to provide relational information between clusters of words (try Machinese from connexor.eu to form a sentence tree). Finally, content analysis helps to extract the right kind of data by tagging words and phrases for building relational databases and predictive models (try GATE from gate.ac.uk).

Elder’s list of top 5 data mining mistakes includes:
1. Focus on training the data
2. Rely on one technique
3. Listen (only) to data (not applying common sense to processing data)
4. Accept leaks from future
5. Believe your model is the best model (don’t be an artist and fall in love with your art)

In particular, Elder shares with the audience the biggest secret weapon of data mining: combining different techniques that do well in 1-2 categories will give much better results. See Figure 1. 5 algorithms on 6 datasets & Figure 2. Essentially every bundling method improves performance. Figure 3. Median (and Mean) Error Reduced with each Stage of Combination also illustrates the combinatorial power of methods for another example in his talk.





Data text mining is still in its early stage, and the “miners” have a lot of challenges to overcome. However, given the richness of information floating around on the internet and hiding in thick binding books in the library, data text mining could revolutionize the business intelligence field.


Credits: The talk was given at the INFORMS 2008 conference in Washington DC. The track session was SB13. Speakers are: Seth Grimes, Intelligent Enterprise, B-eye network; and John Elder, Chief Scientist, Elder Research, Inc. The talk was titled "Panel Discussion: Challenges Facing Data & Text Miners in 2008 and Beyond".