Business Intelligence: Data Text Mining & Its Challenges

Sunday, October 12, 2008

Business Intelligence: Data Text Mining & Its Challenges

In the world of business intelligence (BI), data and text mining is a rising star, but it has a lot of challenges. Seth Grimes points out the importance of having structured data in relational databases, and the need for statistical, linguistic and structural techniques to analyze various dimensions of the raw text. He also shares with the audience some useful, open source tools in the field of text mining. John Elder, on the other hand, shares the top 5 lessons he has learned through mistakes in data mining, where he also reveals one of the biggest secret weapons of data miners.

Grimes took the audience on a journey the traditional BI work, where data miners take raw csv (comma separated values) files, and turn them into relational databases, which then gets displayed as fancy monitoring dashboards in analytics tools – all very pretty and organized. However, most of the data that BI deals with are “unstructured” data, where information is hiding in pictures and graphs, or in documents stuffed with text. According to Grimes, 80% of enterprise information is in “unstructured” form. To process the raw text information, Grimes says it needs to be 3-tiered: statistical/lexical, linguistic and structural. Statistical techniques help cluster and classify the text for ease of search (try ranks.nl). Syntactical analysis from linguistics helps with sentence structures to provide relational information between clusters of words (try Machinese from connexor.eu to form a sentence tree). Finally, content analysis helps to extract the right kind of data by tagging words and phrases for building relational databases and predictive models (try GATE from gate.ac.uk).

Elder’s list of top 5 data mining mistakes includes:
1. Focus on training the data
2. Rely on one technique
3. Listen (only) to data (not applying common sense to processing data)
4. Accept leaks from future
5. Believe your model is the best model (don’t be an artist and fall in love with your art)

In particular, Elder shares with the audience the biggest secret weapon of data mining: combining different techniques that do well in 1-2 categories will give much better results. See Figure 1. 5 algorithms on 6 datasets & Figure 2. Essentially every bundling method improves performance. Figure 3. Median (and Mean) Error Reduced with each Stage of Combination also illustrates the combinatorial power of methods for another example in his talk.

Data text mining is still in its early stage, and the “miners” have a lot of challenges to overcome. However, given the richness of information floating around on the internet and hiding in thick binding books in the library, data text mining could revolutionize the business intelligence field.

Credits: The talk was given at the INFORMS 2008 conference in Washington DC. The track session was SB13. Speakers are: Seth Grimes, Intelligent Enterprise, B-eye network; and John Elder, Chief Scientist, Elder Research, Inc. The talk was titled "Panel Discussion: Challenges Facing Data & Text Miners in 2008 and Beyond".

2 comments:

Anonymous said...: Thanks for this post - agree this will be a very hot field in the years ahead.

Any recommendations for further reading regarding data mining in the enterprise? This is something we are trying to get our hands around, as our software produces a lot of data in CSV format - now to make sense of it!; October 15, 2008 at 2:45 AM
Dawen said...: Hello Brett,

This is definitely a newer field that's been getting increasingly more research attention in recent years. To find out more about data text mining, you could start by going over the program for the INFORMS 2008 conference by looking at the abstracts of the talks related to data mining or text mining. You can search the program here:

https://informs.emeetingsonline.com/emeetings/WebSitePapersv2.asp?mmnno=176&pagename=SITE35953

Hope that helps (besides the obvious Google and contacting the researchers/authors of the papers you are interested in).; October 21, 2008 at 5:05 AM