In the world of business intelligence (BI), data and text mining is a rising star, but it has a lot of challenges. Seth Grimes points out the importance of having structured data in relational databases, and the need for statistical, linguistic and structural techniques to analyze various dimensions of the raw text. He also shares with the audience some useful, open source tools in the field of text mining. John Elder, on the other hand, shares the top 5 lessons he has learned through mistakes in data mining, where he also reveals one of the biggest secret weapons of data miners.
Grimes took the audience on a journey the traditional BI work, where data miners take raw csv (comma separated values) files, and turn them into relational databases, which then gets displayed as fancy monitoring dashboards in analytics tools – all very pretty and organized. However, most of the data that BI deals with are “unstructured” data, where information is hiding in pictures and graphs, or in documents stuffed with text. According to Grimes, 80% of enterprise information is in “unstructured” form. To process the raw text information, Grimes says it needs to be 3-tiered: statistical/lexical, linguistic and structural. Statistical techniques help cluster and classify the text for ease of search (try ranks.nl). Syntactical analysis from linguistics helps with sentence structures to provide relational information between clusters of words (try Machinese from connexor.eu to form a sentence tree). Finally, content analysis helps to extract the right kind of data by tagging words and phrases for building relational databases and predictive models (try GATE from gate.ac.uk).
Elder’s list of top 5 data mining mistakes includes:
1. Focus on training the data
2. Rely on one technique
3. Listen (only) to data (not applying common sense to processing data)
4. Accept leaks from future
5. Believe your model is the best model (don’t be an artist and fall in love with your art)
In particular, Elder shares with the audience the biggest secret weapon of data mining: combining different techniques that do well in 1-2 categories will give much better results. See Figure 1. 5 algorithms on 6 datasets & Figure 2. Essentially every bundling method improves performance. Figure 3. Median (and Mean) Error Reduced with each Stage of Combination also illustrates the combinatorial power of methods for another example in his talk.
Data text mining is still in its early stage, and the “miners” have a lot of challenges to overcome. However, given the richness of information floating around on the internet and hiding in thick binding books in the library, data text mining could revolutionize the business intelligence field.
Credits: The talk was given at the INFORMS 2008 conference in Washington DC. The track session was SB13. Speakers are: Seth Grimes, Intelligent Enterprise, B-eye network; and John Elder, Chief Scientist, Elder Research, Inc. The talk was titled "Panel Discussion: Challenges Facing Data & Text Miners in 2008 and Beyond".
Preview of EARL San Francisco - The first ever EARL (Enterprise Applications of the R Language) conference in San Francisco will take place on June 5-7 (and it's not too late to register)...
1 minute ago