Title:
Text analysis techniques
Document Type and Number:
Kind Code:
A1

Abstract:
One embodiment of the present invention includes means determining a concept representation for a set of text documents based on partial order analysis and modifying this representation if it is determined to be unidentifiable. Furthermore, the embodiment includes means for labeling the representation, mapping documents to it to provide a corresponding document representation, generating a number of document signatures each of a different type, and performing several data processing applications each with a different one of the document signatures of differing types.

Representative Image:
Inventors:
Willse, Alan R. (Richland, WA, US)
Hetzler, Elizabeth G. (Herndon, VA, US)
Hope, Lawrence L. (Surry, ME, US)
Tanasse, Theodore E. (West Richland, WA, US)
Havre, Susan L. (Richland, WA, US)
Turner, Alan E. (Herndon, VA, US)
Macgregor, Margaret (Piedmont, CA, US)
Nakamura, Grant C. (Kennewick, WA, US)
Naucarrow, Catherine (Piedmont, CA, US)
      Plaque It!

Application Number:
10/252984
Publication Date:
03/25/2004
Filing Date:
09/23/2002
View Patent Images:
Images are available in PDF form when logged in. To view PDFs, Login  or  Create Account (Free!)
Primary Class:
Other Classes:
707/E17.094
International Classes:
(IPC1-7): G06F017/00
Attorney, Agent or Firm:
Bank One, Center/tower Woodard Emhardt Naughton Moriarty And Mcnett (Suite 3700, Indianapolis, IN, 46204-5137, US)
Claims:

What is claimed is:



1. A method, comprising: selecting a set of text documents; selecting a number of terms included in the set; establishing a multidimensional document space with a computer system as a function of the terms; performing a bump hunting procedure with the computer system to identify a number of document space features, the features each corresponding to a composition of two or more concepts of the documents; and deconvolving the features with the computer system to separately identify the concepts.

2. The method of claim 1, which includes providing a concept representation corresponding to an acyclic graph with a number of nodes each corresponding to one of the concepts and different levels to represent related concepts of differing degrees of specificity.

3. The method of claim 2, which includes identifying a number of different multilevel groups in accordance with a mathematically determined degree of desired fit of the different multilevel groups.

4. The method of claim 1, which includes determining the multidimensional document space in accordance with frequency of each of the terms in each of the text documents.

5. The method of claim 1, which includes determining a plurality of different signature vectors from the concepts for different text processing applications.

6. The method of claim 1, wherein said deconvolving includes performing a latent variable analysis as a function of the features and the terms to identify the concepts.

7. The method of claim 6, wherein, said deconvolving includes: identifying one of a number of first level concepts of the text documents by determining each of the terms that is associated with one of the features; and establishing one of several second level concepts of the text documents by identifying at least one of the terms found in each member of a subset of the first level concepts.

8. The method of claim 7, which includes: providing a concept representation of the text documents, the representation including the first level concepts and the second level concepts with the subset of the first level concepts being subordinate to the one of the second level concepts; testing identifiability of the concept representation; and providing a modified concept representation in response to said testing if the concept representation is nonidentifiable.

9. A method, comprising: performing a routine with a computer system, including: extracting terminological features from a set of text documents; establishing a representation of a number of concepts of the text documents as a function of the features, the representation corresponding to an arrangement of several levels to indicate different degrees of concept specificity; and identifying a number of different related groups of the concepts, the groups each being mathmatically determined as a function of a degree of separateness from the concept representation.

10. The method of claim 9, wherein the groups are determined with an iterative gradient descent procedure.

11. The method of claim 9, wherein the degree of separateness is determined relative to a likelihood function for the concept representation.

12. The method of claim 9, which includes providing a visualization of the concept representation in the form of an acyclic directed graph, the graph including a number of nodes each corresponding to one of the concepts, the nodes being selectively linked to indicate relationships between the concepts.

13. The method of claim 9, wherein the groups each correspond to a different facet of the concept representation, and which includes preparing a number of different document signatures each from a different one of the groupings.

14. A method, comprising: performing a routine with a computer system, including: extracting terminological features from a set of text documents; establishing a representation of a number of concepts of the text documents as a function of the terminological features, the representation hierarchically indicating different degrees of specificity among related members of the concepts and corresponding to an acyclic graph organization; determining the representation is nonidentifiable; in response to said determining, constraining one or more processing parameters of the routine; and providing a modified concept representation after said constraining, the modified concept representation being identifiable.

15. The method of claim 14, wherein said constraining one or more processing parameters of the routine includes limiting the modified concept representation to a quantity of levels.

16. The method of claim 14, wherein said constraining one or more processing parameters of the routine includes limiting the modified concept representation to a strict hierarchy form in which each one of the concepts is subordinate to at most one other of the concepts.

17. The method of claim 14, wherein said constraining one or more processing parameters of the routine includes mapping the representation into a number of multilevel subgroupings each corresponding to an acyclic graph arrangement.

18. The method of claim 14, wherein said extracting is performed by executing a bump hunting procedure and the concepts are determined by executing a deconvolution procedure with respect to the features.

19. A method, comprising: performing a routine with a computer system, including: extracting terminological features from a set of text documents; establishing a representation of a number of concepts of the text documents as a function of the terminological features, the representation hierarchically indicating different degrees of specificity among related ones of the concepts in correspondence to different levels of an acyclic graph organization; evaluating a selected document relative to the representation; and generating a number of different document signatures for the selected document with the representation.

20. The method of claim 19, wherein said extracting is performed by executing a bump hunting procedure and the concepts are determined by executing a deconvolution procedure with respect to the features.

21. The method of claim 19, which includes identifying several different group of related concepts, the groups each corresponding to several of the different levels of the representation.

22. The method of claim 21, wherein said generating includes preparing each of the different document signatures in accordance with a different one of the groups.

23. The method of claim 19, wherein said generating includes preparing each of the different documents signatures for a different text data processing application.

24. The method of claim 23, wherein the different text data application is one or more of the group consisting of event detection, document summarization, document clustering, document filtering, querying, and synonym analysis.

25. The method of claim 19, wherein: said extracting includes determining the terminological features as a function of a set of terms contained in the set of text documents; and said evaluating includes mapping the selected document to the concept representation as a function of any terms of the selected document contained in the set of terms.

26. A method, comprising: selecting a set of text documents; representing the documents with a number of terms; identifying a number of multiterm features of the text documents as a function of frequency of each of the terms in each of the documents; relating the multiterm features and the terms with one or more data structures corresponding to a sparse matrix; performing a latent variable analysis as a function of the terms to determine a number of concepts of the text documents from the one or more data structures; and providing a concept representation corresponding to a multilevel acyclic graph organization in which each node of the graph corresponds to one of the concepts.

27. The method of claim 26, wherein the latent variable analysis includes deconvolving the features to determine the concepts.

28. The method of claim 26, wherein the latent variable analysis includes: identifying one of the concepts in a first level of the concept representation by determining each of the terms that is associated with one of the features; and establishing one of the concepts in a second level of the concept representation by identifying at least one of the terms found in each member of a subset of the concepts in the first level.

29. The method of claim 28, wherein the concept representation indicates the one of the concepts in the first level is related and subordinate to the one of the concepts in the second level.

30. The method of claim 26, which includes: determining a number of related subsets of the concepts, the subsets each spanning several levels of the concept representation and each corresponding to a different facet of the representation; testing identifiability of the concept representation; and providing several different document signatures from the concept representation.

31. A method, comprising: performing a routine with a computer system, including: determining a number of multiterm features of a set of text documents as a function of a number of terms included in the set of text documents; identifying one of a number of first level concepts of the text documents by determining each of the terms that is identified with one of the features; establishing one of several second level concepts of the text documents by identifying one of the terms found in each member of a subset of the first level concepts; and providing a concept representation of the text documents, the representation including the first level concepts and the second level concepts with the subset of the first level concepts being subordinate to the one of the second level concepts.

32. The method of claim 31, which includes establishing one of several third level concepts of the text documents by identifying at least one of the terms found in each member of a subset of the second level concepts.

33. The method of claim 32, wherein the subset of the second level concepts are subordinate to the one of the third level concepts in the concept representation.

34. The method of claim 31, which includes labeling the one of the first level concepts with each of the terms this is only identified with the one of the features.

35. The method of claim 31, wherein the subset of first level concepts includes the one of the first level concepts.

36. The method of claim 35, wherein said establishing includes determining a number of other subsets each including the one of the first level concepts by correspondingly identifying other of the terms that are included in each member of a respective one of the other subsets.

37. The method of 36, which includes labeling the one of the second level concepts with the one of the terms and the other of the terms.

38. The method of claim 36, wherein said determining includes executing a bump hunting procedure to determine the multiterm features.

39. The method of claim 36, which includes determining a number of different document signatures from the concept representation with the computer system.

40. The method of claim 36, which includes determining a number of related subsets of the concepts, the subsets each spanning several levels of the concept representation and each corresponding to a different representation facet.

41. A method, comprising: identifying a number of events; providing a visualization of the events with a computer system, the visualization including a number of visualization objects each representing a different one of the events; positioning each of the visualization objects along a first axis to indicate timing of each of the events relative to one another with a corresponding initiation time and a corresponding termination time of each of the events being represented by an initiation point and termination point of each of the visualization objects along the first axis; and dimensioning each of the visualization objects between the corresponding initiation point and the corresponding termination point along the first axis to indicate event duration and along a second axis to indicate relative strength of the different one of the events.

42. The method of claim 41, wherein said identifying includes relating each of the events to a combination of terms.

43. The method of claim 42, wherein the visualization objects are each comprised of a number of components, the components each corresponding to one of the terms of the combination for the respective one of the visualization objects.

44. The method of claim 43, wherein the components of the respective one of the visualization objects are each differently colored in the visualization.

45. The method of claim 41, which includes providing graphic user interfacing with the visualization to select a time window for display of event details.

46. The method of claim 45, wherein the event details include a display of the combination of terms for each of the visualization objects included in the time window.

47. The method of claim 41, wherein the events are determined from a set of text documents.

48. The method of claim 47, wherein the events are determined with a concept representation of text documents.

49. A method, comprising: providing a set of text documents; evaluating time variation of a number of terms included in the documents; generating a number of clusters corresponding to the documents with a computer system as a function of the terms; and identifying a number of events as a function of a time variation of the clusters.

50. The method of claim 49, wherein said evaluating includes: determining presence of a word in the documents for each of several different segments of a time period; and establishing a degree of time variation of the word as a function of the one of the segments with a frequency of the word greater than other of the segments and a quantity of the documents including the word.

51. The method of claim 49, wherein said generating is performed with the terms having a selected level of time variation.

52. The method of claim 49, wherein said generating adjusts term weighting in accordance with said evaluating.

53. The method of claim 49, which includes: displaying the events in a visualization including a number of visualization objects each representative of a different one of the events; positioning each of the visualization objects along a first axis to indicate timing of each of the events relative to one another with a corresponding initiation time and a corresponding termination time of each of the events being represented by an initiation point and termination point of each of the visualization objects along the first axis; and dimensioning each of the visualization objects between the corresponding initiation point and the corresponding termination point along the first axis to indicate event duration and along a second axis to indicate relative strength of the different one of the events.

54. The method of claim 49, wherein said generating is performed based on document signatures determined from a concept representation.

55. A method, comprising: providing a number of textual documents arranged relative to a period of time; identifying a feature with a time varying distribution among the documents; evaluating presence of the feature for each of several different segments of the time period; and detecting an event as a function of the one of the segments with a frequency of the feature greater than other of the segments and a quantity of the documents corresponding to the feature.

56. The method of claim 55, wherein the feature is a term.

57. The method of claim 55, wherein the feature is a document cluster.

58. The method of claim 55, which includes identifying a number of other events.

59. The method of claim 55, which includes providing an event visualization.

60. The method of claim 55, which includes preparing a document signature from a concept representation for event detection processing.

61. A method, comprising: selecting a set of text documents; designating several different dimensions of the documents; characterizing each of the dimensions with a corresponding set of words; for each of the dimensions, performing a clustering analysis of the documents based on the corresponding set of words; and visualizing the clustering analysis for each of the dimensions.

62. The method of claim 61, wherein the different dimensions are selected based on several different document signatures representative of each of the documents.

63. The method of claim 61, wherein said selecting includes receiving input regarding the dimensions from an operator.

64. The method of claim 61, which includes: pairing a number of the documents to provide a number of document pairs; for each of the document pairs, comparing a first pair member to a second pair member; and determining a degree of similarity based on said comparing for each of the document pairs.

65. A method, comprising: in response to an input of one or more words in a computer system, providing a list of words with the computer system as a function of a number of context vectors for a set of text documents and the one or more words; receiving another input responsive to the list; reweighting a number of different entries corresponding to the context vectors with the computer system based on the second input; providing an output of related words with the computer system based said reweighting; and repeating said receiving, said reweighting, and said providing with the computer system.

66. The method of claim 65, which includes locating one or more of the text documents based on the related words after said repeating.

67. The method of claim 65, wherein said reweighting includes determining a profile of words of interest based on cooccurrence and variance.

68. The method of claim 65, wherein said reweighting includes applying a statistical discrimination test.

69. The method of claim 65, wherein the context vectors are comprised of different types, at least one of the types being provided as a document signature based on a concept representation.

70. The method of claim 65, which includes generating the context vectors based on a cooccurrence measure.

71. An apparatus, comprising: means for determining a concept representation for a set of text documents based on partial order analysis; means for modifying the concept representation if it is determined to be unidentifiable; means for labeling the concept representation; means for mapping documents to the concept representation to provide a corresponding document representation; means for generating several document signature types from the document representation; and means for performing a number of data processing applications each with a document signature of a different one of the document signature types determine from the document representation.

72. The apparatus of claim 71, wherein the different applications include at least one of event detection, event visualization, term relationship discovery; and clustering.

73. An apparatus, comprising: a device carrying logic executable with a computer system to extract terminological features from a set of text documents; establish a representation of a number of concepts of the text documents as a function of the terminological features, the representation hierarchically indicating different degrees of specificity among related ones of the concepts in correspondence to different levels of an acyclic graph organization; evaluate a selected document relative to the representation; and generate a number of different document signatures for the selected document with the representation.

74. The apparatus of claim 73, wherein the device includes one or more components of a computer network.

75. The apparatus of claim 73, wherein the device includes a memory device accessible by the computer system.

76. The apparatus of claim 75, wherein the memory device is in the form of a removable disk.

Description:

BACKGROUND

[0001] The present invention relates to data processing and more particularly, but not exclusively, relates to text analysis techniques.

[0002] Recent technological advancements have led to the collection of a vast amount of electronic data. These collections are sometimes arranged into corpora each comprised of millions of text documents. Unfortunately, the ability to quickly identify patterns or relationships which exist within such collections, and/or the ability to readily perceive underlying concepts within documents of a give corpus remain highly limited. Common text analysis applications include information retrieval, document clustering, and document classification (or document filtering). Typically, such operations are preceded by feature extraction, document representation, and signature creation, in which the textual data is transformed to numeric data in a form suitable for analysis. In some text analysis systems, the feature extraction, document representation, and signature creation are the same for all applications. The Battelle SPIRE system provides an example in which each document is represented by a numeric vector called the SPIRE ‘signature’; all SPIRE applications then work directly with this signature vector.

[0003] In other text analysis systems (e.g., IBM's Intelligent Miner for Text), approaches for feature extraction, document representation or signature creation vary with the application. Desired features often differ for document clustering and document classification applications. In classification, a ‘training’ set of documents with known class labels is used to ‘learn’ rules for classifying future documents; features can be extracted that show large variation or differences between known classes. In clustering, documents are organized into groups with no prior knowledge of class labels; features can be extracted that show large variation or clumping between documents; however, because ‘true’ class labels are unknown, they cannot be exploited for feature extraction.

[0004] While generic systems facilitate the layering of multiple applications once a generic ‘signature’ is obtained, it may not perform as well in specific applications as systems that were developed specifically for that application. In contrast, the disadvantage of specialized systems is that they require separate development of feature extraction, document representation, or signature creation algorithms for each application, which can be time consuming, and impractical for small research groups.

[0005] Furthermore, current schemes tend to group documents according to a unitary measure of semantic similarity; however, documents can be similar in different ‘respects’. For example, in an assessment of retrieval of aviation safety incident reports related to documents describing the Cali accident (M. W. McGreevy and I. C. Statler, NASA/TM-1998-208749), analysts judged incident reports as related or not to the Cali accident (based on NTSB investigative reports of the Cali accident) according to six different ‘respects’ exemplified by the questions asked of the analysis: (1) in some ways, the context of this incident is similar to the context of the Cali accident; (2) some of the events of this incident are similar to some of the events of the Cali accident; (3) some of the problems of this incident are similar to some of the problems of the Cali accident; (4) some of the human factors of this incident are similar to some of the human factors of the Cali accident; (5) some of the causes of this incident are similar to some of the causes of the Cali accident; and (6) in some ways, this incident is relevant to the Cali accident. Many existing systems do not account for these different dimensions of similarity.

[0006] Moreover, typical systems do not account for the confidence in observed relationships, the potential for multiple levels of meaning, and/or the context of observed relationships. Thus, there is an ongoing need for further contributions in this area of technology.

SUMMARY

[0007] One embodiment of the present invention is a unique data processing technique. Other embodiments include unique apparatus, systems, and methods for analyzing collections of text documents or records.

[0008] A further embodiment of the present invention is a method that includes selecting a set of text documents; selecting a number of terms included in the set; establishing a multidimensional document space with a computer system as a function of these terms; performing a bump-hunting procedure with the computer system to identify a number of document space features that each correspond to a composition of two or more concepts of the documents; and deconvolving these features with the computer system to separately identify the concepts.

[0009] Still a further embodiment of the present invention is a method that includes extracting terminological features from a set of text documents; establishing a representation of a number of concepts of the text documents as a function of the features; and identifying a number of different related groups of the concepts. The representation may correspond to an arrangement of several levels to indicate different degrees of concept specificity.

[0010] Yet another embodiment of the present invention includes a method comprising: extracting terminological features from a set of text documents; establishing a representation of a number of concepts of the text documents as a function of these features; determining the representation is non-identifiable; and in response, constraining one or more processing parameters of the routine to provide a modified concept representation. In one form, the representation hierarchically indicates different degrees of specificity among related members of the concepts and corresponds to an acyclic graph organization.

[0011] Still a further embodiment relates to a method which includes: extracting terminological features from a set of text documents; establishing a representation of a number of concepts of the documents as a function of these features; evaluating a selected document relative to the representation; and generating a number of document signatures for the selected document with the representation.

[0012] In another embodiment of the present invention, a method comprises: selecting a set of text documents; representing the documents with a number of terms; identifying a number of multiterm features of the text documents as a function of frequency of each of the terms in each of the documents; relating the multiterm features and terms with one or more data structures corresponding to a sparse matrix; and performing a latent variable analysis to determine a number of concepts of the text documents from the one or more data structures. This method may further include providing a concept representation corresponding to a multilevel acyclic graph organization in which each node of the graph corresponds to one of the concepts.

[0013] Yet another embodiment of the present invention includes a method for performing a routine with a computer system that includes: determining a number of multiterm features of a set of text documents as a function of a number of terms included in those documents; identifying one of a number of first level concepts of the text documents based on one or more terms associated with one of the features; establishing one of several second level concepts of the documents by identifying one of the terms found in each member of a subset of the one of the first level concepts; and providing a concept representation of the documents based on the first level and second level concepts.

[0014] A further embodiment involves a method that comprises: identifying a number of events; providing a visualization of the events with a computer system; and dimensioning each of a number of visualization objects relative to a first axis and a second axis. The visualization objects each represent a different one of the events and are positioned along the first axis to indicate timing of each of the events relative to one another with a corresponding initiation time and a corresponding termination time of each of the events being represented by an initiation point and a termination point of each of the objects along the first axis. The extent of each object along the second axis is indicative of relative strength of the event represented thereby.

[0015] In another embodiment of the present invention, a method includes: providing a set of text documents; evaluating time variation of a number of terms included in these documents; generating a number of clusters corresponding to the documents with a computer system as a function of these terms; and identifying a number of events as a function of a time variation of the clusters.

[0016] For a further embodiment of the present invention, a method includes: providing a number of textual documents arranged relative to a period of time; identifying a feature with a time varying distribution among the documents; evaluating presence of this feature for each of several different segments of the time period; and detecting an event as a function of the one of the segments with a frequency of the feature greater than other of the segments and a quantity of the documents corresponding to the feature.

[0017] Still another embodiment includes a method, comprising: selecting a set of text documents; designating several different dimensions of the documents; characterizing each of the dimensions with a corresponding set of words; performing a cluster analysis of the documents based on the set of words for each of the dimensions; and visualizing the clustering analysis for each of the dimensions.

[0018] Yet another embodiment is directed to a method which includes: providing a list of words with a computer system as a function of a number of context vectors for a set of text documents and one or more words; receiving input responsive to this list; reweighting a number of different entries corresponding to the context vectors with the computer system based on this input; providing an output of related words with a computer system based on the reweighting; and repeating receipt of the input, reweighting, and provision of the output with a computer system as desired.

[0019] In other embodiments, a unique system is provided to perform one or more of the above-indicated methods and/or at least one device is provided carrying logic executable by a computer system to perform one or more of the above-indicated methods.

[0020] Accordingly, one object of the present invention is to provide a unique data processing technique.

[0021] Another object is to provide a unique apparatus, system, device, or method for analyzing textual data.

[0022] Further objects, embodiments, forms, features, aspects, benefits, and advantages of the present invention will become apparent from the drawings and detailed description contained herein.

BRIEF DESCRIPTION OF THE VIEWS OF THE DRAWING

[0023] FIG. 1 is a diagrammatic view of a computing system.

[0024] FIG. 2 is a flowchart illustrating details of a routine that can be executed with the system of FIG. 1 .

[0025] FIG. 3 is a flowchart illustrating details of a subroutine for the routine of FIG. 2 .

[0026] FIG. 4 is a flowchart illustrating details of a procedure included in the subroutine of FIG. 3 .

[0027] FIG. 5 is an illustration of a term-by-bump matrix.

[0028] FIG. 6 is a diagram of a term tree corresponding to the matrix of FIG. 5 .

[0029] FIG. 7 is a diagram of a concept representation formed from the matrix of FIG. 5 and diagram of FIG. 6 that can be provided with the routine of FIG. 2 .

[0030] FIG. 8 is another concept representation that can be provided with the routine of FIG. 2 .

[0031] FIG. 9 is a flowchart illustrating details of a multidimensional clustering procedure that can be performed as part of the routine of FIG. 2 .

[0032] FIG. 10 is a flowchart illustrating details of an event detection and visualization procedure that can be performed as part of the routine of FIG. 2 .

[0033] FIG. 11 is a visualization of events detected in accordance with the procedure of FIG. 10 .

[0034] FIG. 12 is a diagram of a visualization object from the visualization of FIG. 10 showing greater detail.

[0035] FIG. 13 is a flowchart illustrating details of a procedure for identifying term relationships.

DETAILED DESCRIPTION OF SELECTED EMBODIMENTS

[0036] For the purpose of promoting an understanding of the principles of the invention, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended. Any alterations and further modifications in the described embodiments, and any further applications of the principles of the invention as described herein are contemplated as would normally occur to one skilled in the art to which the invention relates.

[0037] In accordance with one embodiment of the present invention, text analysis is performed to create a hierarchical, multifaceted document representation that enables multiple distinct views of a corpus based on the discovery that it can be desirable to consider similarity of documents in different ‘respects’. The hierarchical feature provides the potential for multiple levels of meaning to be represented; where the desired ‘level of meaning’ to use in a given application often depends on the user and the level of confidence for the different representation levels. For example, in one document there might be a relatively high degree of confidence that the topic “Sports” is discussed, but confidence might be low regarding the type of sport; in another document confidence might be high that the sport of tennis is discussed. In one form, this concept representation is created automatically, using machine learning techniques. It can be created in the absence of any ‘outside’ knowledge, using statistically derived techniques. Alternatively or additionally, outside knowledge sources can be used, such as predefined document categorizations and term taxonomies, just to name a few.

[0038] The construction of a concept representation is typically based on identifying ‘concepts’ in documents. Frequently, documents do not contain express concepts—instead they contain words from which concepts can often be inferred. By way of nonlimiting example, terms and their juxtapositions within documents can serve as indicators of latent concepts. Accordingly, latent concepts can often be estimated using a statistical latent variable model. In one approach, a latent variable analysis is applied to determine the concepts by deconvolving a document feature space created with a bump-hunting procedure based on a set of terms extracted from the document set. The resulting concept representation can be organized with different concept levels and/or facets. In one form, the concept representation is provided as one or more data structures corresponding to an acyclic directed graph and can be visualized as such.

[0039] A document representation is provided by mapping documents of a given corpus to the above-indicated concept representation. Alternatively or additionally, an initial concept representation can be restructured by equivalence mapping before a document representation is provided. From the document representation, different document signatures can be generated specific to various text analysis applications, such as: (a) information retrieval—retrieve ‘relevant’ documents in response to a query, such as a boolean or ‘query by example’; (b) document clustering—organize documents into groups according to semantic similarity; (c) document categorization, routing, and filtering—classify documents into predefined groups; (d) summarization—provide synopses of individual documents or groups of documents; (e) information extraction—extract pre-defined information pieces from text, such a company names, or sentences describing terrorist activity; (f) entity linkage—find relationships between entities, such as recognizing that “Joe Brown is President of The Alfalfa Company” or identify linkages between airlines in the context of a merger, to name just a few examples; (g) event detection—automatically detect and summarize significant events (usually in real time), and deliver summary and supporting evidence to interested parties; (h) corpus navigation—browse a corpus; (i) topic discovery and organization—organize topics or concepts within a corpus; and/or (j) question answering—provide answers to questions. Question answering can go beyond retrieving documents that are ‘relevant’ to a question. In some applications, the answer can be directly extracted from a relevant document. In others, it is acknowledged that the answer to a question might not be contained in a single document—instead different parts of the answer might occur in different documents, which could be identified and combined by the application.

[0040] Accordingly, these and other embodiments of the present invention provide a combination of generic and application-specific components that are better-suited to current text mining objectives. FIG. 1 diagrammatically depicts computer system 20 of another embodiment of the present invention. System 20 includes computer 21 with processor 22 . Processor 22 can be of any type, and is configured to operate in accordance with programming instructions and/or another form of operating logic. In one embodiment, processor 22 is integrated circuit based, including one or more digital, solid-state central processing units each in the form of a microprocessor.

[0041] System 20 also includes operator input devices 24 and operator output devices 26 operatively coupled to processor 22 . Input devices 24 include a conventional mouse 24 a and keyboard 24 b, and alternatively or additionally can include a trackball, light pen, voice recognition subsystem, and/or different input device type as would occur to those skilled in the art. Output devices 26 include a conventional graphic display 26 a, such as a color or noncolor plasma, Cathode Ray Tube (CRT), or Liquid Crystal Display (LCD) type, and color or noncolor printer 26 b. Alternatively or additionally output devices 26 can include an aural output system and/or different output device type as would occur to those skilled in the art. Further, in other embodiments, more or fewer operator input devices 24 or operator output devices 26 may be utilized.

[0042] System 20 also includes memory 28 operatively coupled to processor 22 . Memory 28 can be of one or more types, such as solid-state electronic memory, magnetic memory, optical memory, or a combination of these. As illustrated in FIG. 1 , memory 28 includes a removable/portable memory device 28 a that can be an optical disk (such as a CD ROM or DVD); a magnetically encoded hard disk, floppy disk, tape, or cartridge; and/or a different form as would occur to those skilled in the art. In one embodiment, at least a portion of memory 28 is operable to store programming instructions for selective execution by processor 22 . Alternatively or additionally, memory 28 can be arranged to store data other than programming instructions for processor 22 . In still other embodiments, memory 28 and/or portable memory device 28 a may not be present.

[0043] System 20 also includes computer network 30 , which can be a Local Area Network (LAN); Wide Area Network (WAN), such as the Internet; another type as would occur to those skilled in the art; or a combination of these. Network 30 couples computer 40 to computer 21 ; where computer 40 is remotely located relative to computer 21 . Computer 40 can include a processor, input devices, output devices, and/or memory as described in connection with computer 21 ; however these features of computer 40 are not shown to preserve clarity. Computer 40 and computer 21 can be arranged as client and server, respectively, in relation to some or all of the data processing of the present invention. For this arrangement, it should be understood that many other remote computers 40 could be included as clients of computer 21 , but are not shown to preserve clarity. In another embodiment, computer 21 and computer 40 can both be participating members of a distributed processing arrangement with one or more processors located at a different site relative to the others. The distributed processors of such an arrangement can be used collectively to execute routines according to the present invention. In still other embodiments, remote computer 40 may be absent.

[0044] Operating logic for processor 22 is arranged to facilitate performance of various routines, subroutines, procedures, stages, operations, and/or conditionals described hereinafter. This operating logic can be of a dedicated, hardwired variety and/or in the form of programming instructions as is appropriate for the particular processor arrangement. Such logic can be at least partially encoded on device 28 a for storage and/or transport to another computer. Alternatively or additionally, the logic of computer 21 can be in the form of one or more signals carried by a transmission medium, such as network 30 .

[0045] System 20 is also depicted with computer-accessible data sources or datasets generally designated as corpora 50 . Corpora 50 include datasets 52 local to computer 21 and remotely located datasets 54 accessible via network 30 . Computer 21 is operable to process data selected from one or more of corpora 50 . The one or more corpora 50 can be accessed with a data extraction routine executed by processor 22 to selectively extract information according to predefined criteria. In addition to datasets 52 and 54 , corpora data may be acquired live or in realtime from local source 56 and/or remote source 58 using one or more sensors or other instrumentation, as appropriate. The data mined in this manner can be further processed to provide one or more corresponding data processing outputs in accordance with the operating logic of processor 22 .

[0046] Referring to FIG. 2, a flowchart of document processing routine 100 is presented. Routine 100 can be performed with system 20 in accordance with operating logic of processor 22 . Routine 100 begins with concept representation subroutine 200 . Subroutine 200 is directed to the construction of a concept representation that is used in later stages and procedures of routine 100 .

[0047] Referring to FIG. 3 , subroutine 200 starts with document preprocessing stage 210 , which includes selection of a set of text documents for training purposes in operation 202 . These documents can be selected from among corpora 50 with system 20 . Typically the documents are selected to be representative of a single corpus or collection that has some aspect of commonality, such as document type, overall topic, or the like; however, documents from diverse collections/corpora can alternatively be selected.

[0048] In one form, it is desirable that the set of documents selected for training are representative of documents expected to be used when applying the concept representation to various applications. Alternatively or additionally, it may be desirable to select a training set of documents that is relatively large to make it more likely to ‘discover’ infrequent or ‘rare’ concepts. In one instance of this approach, concept representation construction is based on a training set of at least 100,000 text documents, although in other instances more or fewer training document could be used.

[0049] Preprocessing stage 210 also includes term standardization operation 204 in which a set of terms S is determined for processing in later stages. Such standardization can include typical stemming, identification of phrases (i.e., word sequences that should be treated as one unit), and mapping known synonyms to a common canonical form. Typically, functional words or ‘stop’ words will be removed when determining this standardized lexicon. Functional words include modifiers such as ‘a’, ‘the’, and ‘this’ that are necessary for grammatical comprehension but do not directly contribute to a concept. Functional words can be removed by comparing them with a list of known functional terms—a ‘stop-word’ list. Alternatively, if a stop-word list is not available (for example, if a foreign language is being analyzed for which a stop-word list is not known), functional words can be identified automatically via a topicality calculation executed with system 20 . In such a calculation for a given term, let A be the number of documents that contain the term. Let N be the number of documents in the test collection, and let T be the total number of times the term occurs in the collection. Then if the term is distributed randomly T times across the N documents, we would expect it to occur in E=N−N(1−1/N) T documents. If the term occurs in significantly more documents than expected by chance, it is considered to be regularly distributed, typical of a functional word. Thus, functional terms can be automatically identified as those terms for which A/E>1+λ, where λ is a threshold that may have been selected based on previous experience, or based on statistical considerations. In one embodiment, λ=0.25 has been found to be adequate for English documents. A. Bookstein, S. T. Klein, and T. Raita, “Clumping Properties of Content-Bearing Words” Journal of the American Society for Information Science (published on the world wide web 1998) is cited as a source of background information concerning such approaches.

[0050] From term standardization operation 204 , subroutine 200 exits preprocessing stage 210 and proceeds to stage 212 . In stage 212 , a document feature space is generated as a function of the term set S selected during operation 204 . In one embodiment, the document feature space is provided in the form of a term-by-document frequency matrix; where, the (ij) th entry contains the frequency of the i th term in the j th document, an example of which follows in Table I: 1

TABLE I
Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9
Football 3 1 0 2 0 0 1 0 0
Ball 0 5 0 0 0 3 3 0 0
Sports 2 0 3 3 0 2 5 3 2
Basketball 0 0 4 1 3 0 0 1 2
Game 0 0 1 1 0 0 0 2 0
Skate 0 0 0 0 1 0 0 0 0

[0051] It should be understood that in other embodiments, a term-by-document frequency matrix can include fewer, but typically, many more documents and/or terms. Alternatively or additionally, the frequency can be weighted based on one or more criteria, such as an information-theoretic measure of content or information contained in a given term and/or document. In one such form, term frequencies are weighted by a measure of their content relative to their prevalence in the document collection. To standardize for documents of varying sizes, the columns of a weighted term-by-document frequency matrix might also be normalized prior to analysis.

[0052] A term-by-document frequency matrix is often useful in discovering co-occurrence patterns of terms, which can often correspond to underlying concepts. First-order co-occurrence patterns relate terms that frequently occur together in the same documents; second-order co-occurrence patterns relate terms that have similar first-order co-occurrence patterns, so that two terms can be related by second-order co-occurrence even if they never occur together in a document.

[0053] As an addition or alternative to a term-by-document frequency matrix, terminological patterns can be identified through application of a statistical language model that accounts for the order in which terms occur. In one nonlimiting example, a trigram model is utilized. For this trigram model approach, the probability of the next word given all previous words depends only on the previous two words (it satisfies a second order Markov condition). Correspondingly, the probability of a sentence of length ‘n’ is given by the following equation: 1 Pr ( w 1 , n ) = i = 1 n Pr ( w i w i - 2 , w i - 1 ) embedded image

[0054] The bigram and trigram probabilities can be estimated using sparse data estimation techniques, such as backing off and discounting.

[0055] Another embodiment may alternatively or additionally employ co-occurrence statistics from windows of “n” words in length within documents. A further embodiment may alternately or additionally employ natural language processing techniques to extract from each sentence the triple (S,V,O) representing the subject, verb, and object of the sentence. The (S,V,O) triple might additionally be mapped to a canonical form. The (S,V,O) triple would then replace the term in the term-by-document matrix. In still other embodiments, a different type of terminological model suitable to define a desired type of document feature space for concept realization may be utilized as would occur to one skilled in the art. For the sake of clarity and consistency, the term-by-document frequency matrix model is utilized hereinafter unless otherwise indicated. It should be understood that the term-by-document frequency matrix can be represented by one or more data structures with system 20 that characterize a multidimensional document feature space as a function of the terms selected during operation 204 . Optionally, some or all of the documents can be associated with one or more predefined groups and/or some or all of the terms can be associated with one or more predefined groups.

[0056] Subroutine 200 proceeds from stage 212 to stage 220 . Stage 220 evaluates the term-by-document feature space generated by stage 212 to identify document and term relationships of statistical significance. In one implementation of stage 220 , a bump-hunting procedure is utilized to identify feature space regions (or “bumps”) of relatively high density that correspond to local maxima of the feature space. One form of this procedure is based on a generalized finite mixture clustering model. The paper, Heckman and Zamar, Comparing the Shapes of Regression Functions, University of British Columbia (Dated 2000) provides an example of bump-hunting analysis. In other embodiments, a different bump-hunting procedure and/or a different type of evaluation to identify statistically significant document and term relationships for concept recognition can be utilized.

[0057] Stage 220 outputs significant document features in relation to term set S. This relationship can be characterized as term-by-bump matrix. For the bump-hunting implementation, features are the discovered bumps in the document feature space, and the corresponding matrix M representation is of a binary type, having entries only of either one to represent a strong association between a term (row) and a bump (column) or zero to represent the absence of a significant term/bump association. Entries of one or zero in matrix M can be determined by applying one or more statistical tests which indicate where the terms independently tend to statistically “clump together.”

[0058] One nonlimiting example includes comparing a relevant characteristic or parameter of the term t for bump b with the set of all other bumps by using a statistical hypothesis test. For this test, let θ tb be the parameter of interest for term t in bump b, and let θ tb˜ be the parameter of interest for term t in set of others bumps (where b˜ corresponds to a Boolean inversion to represent “not bump b”), then the hypotheses test becomes:

H o : θ tb tb˜

H A : θ tb tb˜

[0059] Rejecting H o in favor of H A at some level α suggests clumping of term t in bump b. The threshold α is selected to control the number of false positives. In one form values of α=0.01 or α=0.001 were found to be desirable and the ‘parameter of interest’ was defined by reference to one of three simple models:

[0060] 1. Bernoulli: θ tb =proportion of documents in bump b that contain term t;

[0061] 2. Poisson: θ tb =average number of occurrences of term t in documents in bump b;

[0062] 3. Multinomial: θ tb =average proportion of terms in documents that are t.

[0063] Hypotheses are tested using standard likelihood ratio tests. It turns out that the likelihood ratio test statistics are the same as mutual entropy scores between t and b, this approach could also be called an entropy test.

[0064] From matrix M, a corresponding document-by-bump matrix D can be constructed. The columns of matrix D are the same bumps as in matrix M, and the rows of matrix D are the training documents. As in the case of matrix M, matrix D is binary with an entry of one indicating a significant association between the document (row) and bump (column) and entries of zero indicating the absence of a significant association. For a given document, document/bump associations can be determined by considering the term/bump associations for terms included in the given document, and applying one or more statistical tests of the type used in establishing matrix M by reversing the roles of term and document. In bump-hunting, a document might be assigned to one bump or no bump. A bump is highly specific, and likely a composition of multiple concepts (e.g., a collection of reports describing altitude deviation due to fatigue). So, though a document is initially assigned to one ‘bump’ in bump-hunting, it is likely related to multiple bumps.

[0065] The bump-hunting based binary form of matrices D and M is typically sparse. As used herein, a “sparse matrix” means a matrix in which five percent or less (<5%) of the entries are considered to be greater than or less than zero. A sparse matrix has been found to surprisingly improve the performance of the deconvolution procedure to be described hereinafter.

[0066] From stage 220 , subroutine 200 continues with parallel processing branches 220 a and 220 b. In branch 220 a, operation 230 associates terms with features. For the bump-hunting implementation, the bump features can each be characterized by a term or combination of terms that best distinguish them from one another using a multivariate discrimination algorithm. In one example based on an analysis of aviation safety reports, one bump was characterized by the terms: crew, rest, fatigue, duty time, altimeter, altitude deviation. This bump identified a series of reports in which the pilot made an altitude deviation because he or she was fatigued. Two low-level concepts can be gleaned from these reports: experiencing an altitude deviation and experiencing fatigue. These concepts can be discovered from matrix M by deconvolving the bumps into their component concepts.

[0067] Deconvolution is performed in branch 220 b. Branch 220 b begins with conditional 222 that tests whether concept recognition processing is to be supervised or not. If the test of conditional 222 is true, supervisory information or outside knowledge is input in stage 224 . In one example, outside knowledge is input in stage 224 by providing a vocabulary taxonomy (domain inspired or generic). The taxonomy can be groups of words that ‘go together’ such as a controlled vocabulary. For instance, in aviation safety, controlled vocabularies have been constructed for maintenance-related terms, weather terms, human factor terms, etc. Additionally or alternatively, a predefined vocabulary hierarchy could be utilized.

[0068] Further forms of outside input that could be used alone or in combination with others include providing examples of documents that belong to different categories of interest, for example, maintenance related, weather related, etc. in the aviation field and/or providing structured external knowledge, such as one event is always preceded by another event. In one implementation, the outside knowledge is mathematically represented as a Bayesian prior opinion. For this implementation, the strength of the prior ‘opinion’ can also be provided, which determines the relative weight given to the prior opinion compared to evidence discovered in the documents of the corpus. In other implementations, the outside knowledge is differently represented alone or in combination with the Bayesian prior opinion form. From stage 224 , branch 220 b proceeds to deconvolution procedure 250 . Likewise, if the test of conditional 222 is negative, branch 220 b bypasses the input of outside knowledge in stage 224 to continue with deconvolution procedure 250 . Accordingly, procedure 250 is executed in an unsupervised mode when stage 224 is bypassed.

[0069] Referring to the flowchart of FIG. 4 , further details of deconvolution procedure 250 for a bump-hunting based binary matrix M are next described. Procedure 250 begins with the analysis of matrix M to remove any duplicate rows or columns in stage 252 . The identity and quantity of row and column duplication is recorded for optional use in weighting certain aspects of the results in a later stage. After stage 252 , matrix M has TR number of different terms (rows) and BC number of different bumps (columns). The removal of redundant rows/columns can also be performed for matrix D, recording the removal information for approval use in weighting, etc. Procedure 250 proceeds from stage 252 to operation 260 .

[0070] Deconvolution is based on identifying partial orders in M. Given T1 and T2 are two sets of terms, then a partial order T 1 ≦T 2 exists if whenever a term in T1 is associated with a bump, every term in T2 is associated with the bump; equality holds if and only if terms in T1 and T2 are associated with exactly the same bumps. T2 is said to subsume T2 if the partial ordering is strict, i.e., if T1<T2.

[0071] During operation 260 , equivalence and subsumptive relationships among the rows (terms) of matrix M are identified. Equivalence relationships are grouped together into term equivalence classes and treated as a common unit in subsequent analyses. Subsumption indicates relationships between different hierarchical levels. The subsumptive relationships between term (or term equivalence class) pairs are considered to determine a corresponding directed graph. In constructing the directed graph, an arrow is drawn from A to B (i.e., A→B) if and only if A>B and there exists no term or term equivalence class C such that B<C and C<A. For example, for terms A, B, C, D, and E with the subsumptive relationships A>C, A>E, and C>E the resulting path is A→C→E.

[0072] Referring to FIG. 5, a nonlimiting example of a term-by-bump matrix M is shown as matrix 300 to aid in understanding operation 260 . Six rows corresponding to terms t1-t6 are shown in FIG. 5 with four columns corresponding to bumps b1-b4. For the FIG. 5 example, the relatively infrequent entries of 1 in matrix 300 for terms t4-t6 and the association of terms t4-t6 with bumps that are also associated with other terms suggest that terms t4-t6 are subsumed by one or more of terms t4-t6. In particular, the subsumptive relationships are t1>t2, t1>t3, t1>t4, t1>t5, t1>t6, t2>t4, t2>t6, t3>t5, and t3>t6, The resulting directed paths are t1→t2→t4, t1→t2→t6, t1→t3→t5, and t1→t3→t6. These paths are presented as term tree 305 in FIG. 6 .

[0073] In one nonlimiting approach to efficiently construct the directed graph, the concept hierarchy is constructed from the bottom up. First, all terms are identified from matrix M that indicate base or lowest level concepts. Terms may be associated with more than one lowest level concept. Term equivalence class Ti indicates a base level concept if there is no equivalence class Tj such that Tj<Ti. Let S1 denote the set of all such terms or term equivalence classes. It follows that each remaining term subsumes at least one term in S1. Of the remaining terms, identify those terms Tk for which there is no term or term equivalence class Tj not in S1 such that Tj<Tk. Let S2 denote the set of all such terms. Repeat the process to identify sets S3, S4, etc. until no more terms remain. This process yields a collection of disjoint sets of terms or term equivalence classes S1, S2, . . . , Sm. The directed graph is readily constructed subject to the following constraint: arrows into terms in Sn are only allowed from terms in S(n+1). Thus, for term Ti in Sn, Tj→Ti if and only if Tj>Ti and Tj is in S(n+1). From the example in FIGS. 5 and 6 , three different lowest level concepts can be identified corresponding to the term groups (t1, t2, t4); (t1,t2,t3,t6); and (t1, t3, t5). These concepts are identified as c(1,1), c(1,2), c(1,3), respectively.

[0074] From operation 260 , procedure 250 continues with operation 270 in which the hierarchical structure of the concepts is determined. In one approach, a concept structure can be provided by comparing the content of the term groups for these lowest concepts and utilizing the corresponding term tree structure. For the example of FIGS. 5 and 6 , the occurrence of terms t1-t3 in more than one of these groups indicate correspondence to higher level concepts based on frequency. Second level concepts c(2,1) and c(2,2) correspond to terms t1 and t2, and t1 and t3, respectively, and the third (highest) level concept c(3,1) corresponds to term t1. FIG. 7 presents the resulting concept representation 310 with nodes n1-n6 corresponding to the concepts c(1,1), c(1,2), c(1,3), c(2,1), c(2,2), c(3,1); respectively. Notably, through partial order analysis, operations 260 and 270 can be performed generally at the same time. In the general case, an m-level concept structure is formed, with each node in the term tree (corresponding to a term equivalence class) corresponds to a concept. The concept is ‘indicated’ by the set of terms that are descendents of the corresponding node in the term tree, i.e. there is a path from the node to each descendent. Thus, terms that are high on the term tree tend to represent more general concepts, and they tend to indicate multiple low level concepts; conversely, terms that are low on the term tree tend to represent specific concepts, and they tend to indicate few low level concepts.

[0075] Procedure 250 proceeds from operation 270 to stage 282 to refine concept relationships. This refinement has been found to frequently reduce noise in the process. Because of potential noise in matrix D, and possible errors in constructing M, the concept structure can often contain too many highly overlapping concepts. Stage 282 includes evaluating the nodes for candidates to merge. Such merging can be determined in accordance with a sequence of statistical hypothesis tests that start at the lowest level of the representation by identifying each term with its concept connectors, and then testing whether two equivalence classes can be merged. Such refinements can be based on a measurement error model. For this model, let α be the error of commission [M ij =1 in error] in associating terms with bumps, and let β be the error of omission [M ij =0 in error] the goal is to identify a smaller set of equivalence classes; where M ij is the i,j entry of matrix M. The parameters α and β can be specified by the user, or they can be estimated from the data by maximizing a likelihood function. Let m be a response vector (row in M) for an equivalence class. We can compute p(m) by reference to the measurement error model, for example:

p ( m =(0 0 1 1)|eq. class c =(0 1 1 1))=(1−α)β(1−β) 2

[0076] The conditional probability mass function for m given c is: 2 p ( m | c ) = j = 1 p [ β 1 - m j ( 1 - β ) m j ] c j [ α m j ( 1 - α ) 1 - m j ] 1 - c j embedded image

[0077] Because some equivalence classes are more populated than others, classes may be merged in the posterior probability via the following equation:

Pr(eq. class C|M )∞ Pr (eq. class C p ( M |eq. class C )

[0078] and assign M to most probable equivalence class. Generally, the effect is to remove some nodes and their connectors from the term tree. In an alternative implementation, the likelihood function is computed for the collection of term equivalence classes: 3 L = h = 1 n c p ( c ) p ( m | c ) embedded image

[0079] Then two equivalence classes c_I and c_j are merged that yield the smallest change in likelihood function. The process is continued until the change from the original likelihood (before any mergers) is large enough to be statistically significant. Other measurement error models can be exploited in a similar manner for different embodiments.

[0080] After connector removal, a further refinement is performed by adding weights to the remaining connectors. These weights can correspond to probabilities, i.e,

α tC Ai =Pr(Term t occurs in n−word span|concept C Ai is present), and

α C Ai ;C Bj =Pr(Level A concept C Ai is present|Level B concept C Bj is present);

[0081] where A and B designate different hierarchical levels of the representation.

[0082] Generally, individual features (e.g., terms) of a concept representation generated in accordance with procedure 250 are directly associated with low level concepts through weights, and are indirectly (and nonlinearly) associated with high level concepts by association with low level concepts. The representation is typically sparse, having 95% or more of the weights set to zero. In procedure 250 , the bumps are deconvolved by reference to a multi-level latent variable model, where the latent variables are identified as concepts. The latent variable model is used to construct layers of concepts, and to infer associations between higher order concepts and lower order concepts. The concept representation is one layer or level at a time in a hierarchical fashion from the lowest to highest level concepts. Representation 310 determined from matrix 300 is merely an example to aid in understanding the present application. In practice, the term-by-bump matrix and corresponding representation would typically be much larger. A visualization of the concept representation may be presented in an acyclic directed graph form, a different form, or may not be visually represented as all. In one form, the concept representation and term-by-bump matrix are each represented by one or more data records/structures stored with system 20 .

[0083] Returning to FIG. 3 from procedure 250 , branches 220 a and 220 b join at stage 240 in which the nodes of the concept representation are labeled. Concept labels can be acquired in the construction of the concept hierarchy as rows of terms are identified with different nodes. Typically, more general terms (e.g., medical) provide labels for higher-order concepts, and specific terms (e.g., cortical dysplasia) provide labels for lower-order concepts.

[0084] Stage 240 further includes evaluating the separability of different subsets of the concepts. For the type of concept representation visualization depicted in FIG. 6 , this separability is akin to the ease with which different hierarchical portions can be cleaved apart along vertical lines to provide different facets of the representation. Referring additionally to FIG. 8, a visualization of concept representation 400 of another embodiment of the present invention is illustrated. Relative to representation 310 , representation 400 includes several more nodes and is arranged to better illustrate the potential to separate the representation structure into different groups or facets. Concept representation 400 includes lowest level nodes 400 a (Level 1) connected to the next lowest level of concept nodes 400 b by connectors 402 a (Level 2). Level 3 nodes 400 c and Level 4 node 400 d, are also shown linked by connectors 402 b and 402 c, respectively. Only a few of these features are designated by reference numerals to enhance clarity. FIG. 8 further illustrates a division or separation of concept representation 400 into two hierarchical, multilevel subsets 404 a and 404 b that are respectively to the left and right of a vertical line through connector 404 . Connector 404 is shown in broken line form to better illustrate that it is broken by the separation. For this depiction, only one connector is “broken” by the separation indicating a relatively high degree of independence between subsets 404 a and 404 b compared to other groupings. In contrast, separation along horizontal lines—between different levels—separates concepts based on the degree of relative subordination. The identification of such multilevel hierarchical subsets of a concept representation or “facets” can provide an unsupervised approach to efficiently compare documents across correspondingly different ‘respects’.

[0085] To identify such subsets in stage 240 , different hierarchical groupings are evaluated, finding those that minimally disrupt the ‘goodness-of-fit’ as measured by the likelihood function of the representation. This evaluation can be performed for each hierarchical level of the representation. In one form, an iterative gradient descent procedure is executed to determine the best separations for a predefined number of groupings. In other embodiments, different approaches can be utilized to determine desired subgroupings of a supervised and/or unsupervised nature.

[0086] From stage 240 , subroutine 200 returns to conditional 110 of routine 100 . Conditional 110 tests whether the concept representation is identifiable or not. This determination can be made empirically. For example, a model is nonidentifiable if it has multiple “best” solutions that are approximately equally likely. Applying the test of conditional 110 to the type of concept representations determined according to the present invention, such a representation could be nonidentifiable if there were one or more different representations that each explained the data approximately just as well. In such a situation, one cannot determine which representation should be applied. One specific empirical test for identifiability is based on the empirical observed information matrix: 4 I L × L = h = 1 N ( L h Ψ ) Ψ ^ ( L h Ψ ) Ψ ^ embedded image

[0087] where L h is the contribution of the h th observation to the log likelihood function, and Ψ is the set of all parameters not constrained to be zero. The representation is identifiable if the I is full rank; otherwise, it is not.

[0088] Upon the discovery that the representation is nonidentifiable, several surprising solutions have been discovered that may be utilized separately or in combination. These solutions include selection of a procedure, such as bump-hunting, to increase sparseness of the resulting term-concept weights of the representation. Using outside knowledge sources also serves to impose constraints on the weights in a manner likely to increase identifiability. If the result is still nonidentifiable, further solutions include simplifying the model by applying one or more of the following: restricting the number of levels permitted; mapping the nonidentifiable representation to a strict hierarchical representation, where each subordinate concept (child) can only be associated with one concept (parent) of the next highest level; or map the nonidentifiable representation to two or more identifiable representations, such as those groupings provided in stage 240 .

[0089] Accordingly, if the test of conditional 110 is not true, the concept representation is modified in stage 120 by applying one or more of these solutions and then routine proceeds to stage 130 . If the test of conditional 110 is negative, then stage 120 is by-passed and stage 130 is directly reached. In stage 130 , a document representation is created by mapping one or more documents of the collection/corpus of interest to the concept representation.

[0090] In one example, let d be a row in the document-by-bump matrix D. For two-level concept hierarchy the following equations apply: 5 P ( d _ ; θ ) = t2 = 1 n 2 η t 2 t2 = 1 n 1 α t 1 , t 2 j = 1 j p ( d _ j | C t1 ) embedded image

[0091] where n 2 is number of level 2 concepts, n 1 is number of level 1 concepts, and

η t2 =Pr ( C t2 )

α t1,t2 =Pr ( C t1 |C t2 )

P ( t j |C t2 )=π t 1 ,j d j (1−π t 1 ,j ) 1−d j

[0092] with {η t2 } {α t1,t2 } and {π} being parameters that are estimated. However, it should be noted that π is constrained to be zero when no terms in bump j define concept C t1 . Indeed, most of the parameters in {α} and {π) are constrained to be 0 by the concept representation.

[0093] In an alternative mapping approach, each document is associated with one of the bumps. For example, let bump b might contain two concepts: fatigue and altitude deviation. Consider part of the term×Bump matrix that follows in Table II: 2

TABLE II
bump 1 bump 2 bump 3 bump 4
Fatigue 1 0 1 0
Altitude_deviation 1 1 0 1
Altimeter 0 1 0 1

[0094] Then documents in b are mapped to the concepts that are indicated by terms in bump 1. This provides us with a direct mapping of documents, without the need to create Doc×Bump Matrix.

[0095] New documents (i.e., documents not used in the training set) can be mapped to the concept representation in the same manner as the training set documents. Typically, the mapping is sparse—a new document is mapped to only a small fraction of all possible concept nodes, which facilitates storage and additional advanced computations with the document representation.

[0096] In the case that outside knowledge is available, such outside knowledge can be exploited in the analysis by imposing constraints, or by including the outside knowledge as covariates or Bayesian prior opinions in the analysis. To explain how supervision can influence the concept or document representation, two nonlimiting examples are described as follows. In the first example, suppose documents are preassigned to one or more of g groups. Such groups might correspond to categorical metadata describing the document. Let G be the length g indicator vector for a document indicating to which groups the document is assigned. Then G can be included in any one of several places in the hierarchical model used to map documents. Including G in the model can influence how documents are mapped to concepts; documents that belong to similar groups are more likely to be mapped to the same concepts. In the second example, suppose some terms (not necessarily all terms) are preassigned to one or more facets. Then the iterative algorithm used to identify ‘facets’ in the concept structure is subject to the constraints imposed by the preassignments.

[0097] Routine 100 continues with stage 140 . In stage 140 , one or more documents signatures desired for corresponding applications are determined from the document representation. A document representation according to the present invention is typically directed to the recognition and organization of a wide range of salient information in a document. In contrast, a document signature represents only a portion or a condensation of a document representation that can be based on a particular application and/or user interests and interactions. Further, because documents can often be similar in different respects, no single document signature is typically ‘best’ for all applications. Several different document signatures can be utilized according to different applications and/or user inputs. Alternatively or additionally, an unsupervised approach can be utilized to provide several plausible document signatures.

[0098] A few examples of different approaches to document signature generation are as follows. In one form, a document representation has been ‘flattened’ into a vector representing C number of concepts (or, the elements of the vector are the document's weights for the topics). Because of our sparse representation, most weights are zero. In many applications, documents contain about one to ten concepts, including only concepts from the most appropriate (or representative) levels of the representation. Thus, one nonlimiting strategy is to “flatten” the document representation into concepts such that each document contains between one and ten concepts, and each concept is represented in, at most, a certain percentage of the documents (say p %). In the context of a comparative evaluation of documents based on such signatures, the probabilities of the concepts for each of two documents can be expressed as a vector of corresponding numbers to provide a measure of similarity of the two documents. Considering the criteria of whether a concept is jointly present (or not present) in both documents and whether a concept is important, four subsets can be created according to the following Table III: 3

TABLE III
Jointly Present Concept Important Concept?
No No
No Yes
Yes No
Yes Yes

[0099] A common distance measure, such as a cosine similarity calculation, can be applied to each subset, and the results merged into a linear combination. This combination can be weighted in accordance with user input, empirical information, and/or predefined parameters. This approach addresses both general and specific similarity. As to specific similarity, high weights can be given to the distance calculation involving those “important” concepts. General similarity can be treated as similarity in the absence of any identification of important concepts. Alternatively, general similarity could eventually use a stored corpus-independent sense of the importance of different concepts. This is the notion that “terrorism” is a more important concept than “football”.

[0100] In a query application, the terms of the query are treated as one of the documents. Furthermore, a query can be thought of as identifying the important concepts so that if the other document contains concepts that aren't in the query, then the first row of Table II applies (No, No). Accordingly, the contribution for such “superset” concepts can be reduced. Assuming a nonzero weighting, the effect results that distance increases as more and more concepts are added.

[0101] In another example of document signature generation, several alternatives can be generated in an unsupervised fashion based on the groupings (facets) identified during stage 240 . Separate signatures are obtained for each grouping, based on concepts identified therein. The user may then visualize or otherwise analyze the signatures separately and select one most suitable to the problem at hand. Note that a portion of the documents will not be relevant to most of the facets or groupings (for example, many aviation safety reports do not address the aviation safety dimension).

[0102] Routine 100 continues with the performance of one or more applications in stage 150 by system 20 . Examples of such applications include document filtering (queries), information retrieval, clustering, relationship discovery, event processing, and document summarization, to name just a few. Such applications can be facilitated by stage 130 and 140 outputs. The query approach described in connection with Table II is only one example of a document filtering application.

[0103] Another application is to perform document clustering. The previously described document signatures can be submitted to standard clustering algorithms to obtain different types of clustering. Indeed, many text analysis and visualization applications begin with clustering. Typically, the clustering is completely unsupervised such that the analyst has no influence on the types of clusters he or she would like to see. For example, in a collection of documents related to aviation safety, the analyst might want to direct clustering to compare and contrast maintenance problems with communication problems that precipitate an aviation incident or accident. Thus, there is a desire to provide for ways to supervise clustering. The selection among different type of document signatures upon which to base clustering is but one example that addresses this need.

[0104] Alternatively or additionally, clustering can be at least partially supervised by entering external knowledge during stage 224 of subroutine 200 . Another approach includes starting with an unsupervised cluster analysis, but allowing the analyst to “correct” the cluster analysis by reallocating documents between clusters. A related, less restrictive approach has the analyst evaluate whether two documents are similar or not and provide the results of this evaluation as input. This approach does not have to allocate documents to clusters or pre-define clusters; only assess relative similarity. In one implementation, after an unrestricted cluster analysis, a panel of experts quantify similarity with a number between 0 and 1 for a series of paired documents (1 if they definitely belong together, 0 if they definitely do not belong together). The document pairs are presented with varying degrees of similarity according to the initial cluster analysis so the experts see documents that occur in the same cluster as well as documents that do not occur in the same cluster. The results of the paired comparison experiment are used to adjust the clustering. Alternatively or additionally, document signatures generated in the manner previously described could provide input.

[0105] The similarity sought by clustering can be multidimensional—such that documents can be similar in different respects. As an example, consider the aviation safety domain, where four dimensions of aviation safety have been well documented: 1) Mechanical/maintenance, 2) Weather, 3) Communication problems, and 4) Pilot error. In comparing two aviation incident reports, an aviation safety expert might believe that the reports are similar on the maintenance dimension but different on the weather dimension. Thus, in this case a unidimensional similarity measure does not meet the analyst's information needs.

[0106] Referring to the flowchart of FIG. 9 , multiple dimension clustering procedure