Cafe Cerebral - Text Mining

Text mining , also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text.

Text mining has been defined as “the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources”

Text mining is similar to data mining, except that data mining tools are designed to handle structured data from databases or XML files, but text mining can work with unstructured or semi-structured data sets such as emails, full-text documents, HTML files, etc. As a result, text mining is a much better solution for companies, where large volumes of diverse types of information must be merged and managed.

Purpose of text mining

  • To discover and use knowledge that is contained in a document collection as a whole, extracting essential information from document collections and from a variety of different sources.
  • Text mining lets executives ask questions of their text-based resources, quickly extract information and find answers they never imagined.

Steps to Text Mining:

  1. "Preprocessing" the text to distill the documents into a structured format.
  2. Reducing the results into a more practical size.
  3. Mining the reduced data with traditional data mining techniques.

Text preprocessing transforms text into an information-rich, term-by-document matrix. This large grid indicates the frequency of every term within the document collection. During this stage, feature extraction is also used to locate specific bits of information, such as customer names, organizations and addresses.

Next, a mathematical technique called singular value decomposition (SVD) is used to replace the original term-by-document matrix with a much smaller matrix. As part of this process, unimportant words get discarded or ignored, and more important or highly relevant words are singled out. The new matrix can be used to place associated terms and documents into categories.

Lastly, clustering, classification and predictive methods are applied to the reduced data, using traditional data mining techniques. Conventional structured data sources can also be included in the analysis to enrich the discovery of underlying trends and patterns within the data.

Uses/Examples of Text Mining:

  • Sales and marketing executives can count on text mining tools to analyze company descriptions in their prospect database. The results help executives target customers for new sales and marketing campaigns.
  • Linguists at a university in Belgium use text mining to analyze summaries of ancient and modern texts. Researchers mine textual information in several languages and use the results to address philological and psychological questions.
  • A new text mining project at a university medical center will let doctors make better use of medical databases such as Medline, PsychInfo and Toxline for evidence-based medicine. Search results of these medical databases can often yield 2,000 matches, but advanced modeling with text mining technologies can reduce the results to 100 highly relevant documents and sort those 100 documents into smaller subgroups or categories.

The combination of data and text mining is referred to as “duo-mining”. Duo-mining gives companies the edge on consolidated information for better decision making. This process combination has proven to be especially useful to banking and credit card companies. Instead of only being able to analyze the structured data they collect from transactions, they can add call logs from customer services and further analyze customers and spending patterns from the text mining side. These new developments in text mining technology that go beyond simple searching methods are the key to information discovery.

Contact Mu Sigma
info@mu-sigma.com
Site Map | Disclaimer | Privacy Policy
© 2005 - 2009 Mu Sigma. All rights reserved