Plaque It!
|
[0001] The present invention relates to data processing and more particularly, but not exclusively, relates to text analysis techniques.
[0002] Recent technological advancements have led to the collection of a vast amount of electronic data. These collections are sometimes arranged into corpora each comprised of millions of text documents. Unfortunately, the ability to quickly identify patterns or relationships which exist within such collections, and/or the ability to readily perceive underlying concepts within documents of a give corpus remain highly limited. Common text analysis applications include information retrieval, document clustering, and document classification (or document filtering). Typically, such operations are preceded by feature extraction, document representation, and signature creation, in which the textual data is transformed to numeric data in a form suitable for analysis. In some text analysis systems, the feature extraction, document representation, and signature creation are the same for all applications. The Battelle SPIRE system provides an example in which each document is represented by a numeric vector called the SPIRE ‘signature’; all SPIRE applications then work directly with this signature vector.
[0003] In other text analysis systems (e.g., IBM's Intelligent Miner for Text), approaches for feature extraction, document representation or signature creation vary with the application. Desired features often differ for document clustering and document classification applications. In classification, a ‘training’ set of documents with known class labels is used to ‘learn’ rules for classifying future documents; features can be extracted that show large variation or differences between known classes. In clustering, documents are organized into groups with no prior knowledge of class labels; features can be extracted that show large variation or clumping between documents; however, because ‘true’ class labels are unknown, they cannot be exploited for feature extraction.
[0004] While generic systems facilitate the layering of multiple applications once a generic ‘signature’ is obtained, it may not perform as well in specific applications as systems that were developed specifically for that application. In contrast, the disadvantage of specialized systems is that they require separate development of feature extraction, document representation, or signature creation algorithms for each application, which can be time consuming, and impractical for small research groups.
[0005] Furthermore, current schemes tend to group documents according to a unitary measure of semantic similarity; however, documents can be similar in different ‘respects’. For example, in an assessment of retrieval of aviation safety incident reports related to documents describing the Cali accident (M. W. McGreevy and I. C. Statler, NASA/TM-1998-208749), analysts judged incident reports as related or not to the Cali accident (based on NTSB investigative reports of the Cali accident) according to six different ‘respects’ exemplified by the questions asked of the analysis: (1) in some ways, the context of this incident is similar to the context of the Cali accident; (2) some of the events of this incident are similar to some of the events of the Cali accident; (3) some of the problems of this incident are similar to some of the problems of the Cali accident; (4) some of the human factors of this incident are similar to some of the human factors of the Cali accident; (5) some of the causes of this incident are similar to some of the causes of the Cali accident; and (6) in some ways, this incident is relevant to the Cali accident. Many existing systems do not account for these different dimensions of similarity.
[0006] Moreover, typical systems do not account for the confidence in observed relationships, the potential for multiple levels of meaning, and/or the context of observed relationships. Thus, there is an ongoing need for further contributions in this area of technology.
[0007] One embodiment of the present invention is a unique data processing technique. Other embodiments include unique apparatus, systems, and methods for analyzing collections of text documents or records.
[0008] A further embodiment of the present invention is a method that includes selecting a set of text documents; selecting a number of terms included in the set; establishing a multidimensional document space with a computer system as a function of these terms; performing a bump-hunting procedure with the computer system to identify a number of document space features that each correspond to a composition of two or more concepts of the documents; and deconvolving these features with the computer system to separately identify the concepts.
[0009] Still a further embodiment of the present invention is a method that includes extracting terminological features from a set of text documents; establishing a representation of a number of concepts of the text documents as a function of the features; and identifying a number of different related groups of the concepts. The representation may correspond to an arrangement of several levels to indicate different degrees of concept specificity.
[0010] Yet another embodiment of the present invention includes a method comprising: extracting terminological features from a set of text documents; establishing a representation of a number of concepts of the text documents as a function of these features; determining the representation is non-identifiable; and in response, constraining one or more processing parameters of the routine to provide a modified concept representation. In one form, the representation hierarchically indicates different degrees of specificity among related members of the concepts and corresponds to an acyclic graph organization.
[0011] Still a further embodiment relates to a method which includes: extracting terminological features from a set of text documents; establishing a representation of a number of concepts of the documents as a function of these features; evaluating a selected document relative to the representation; and generating a number of document signatures for the selected document with the representation.
[0012] In another embodiment of the present invention, a method comprises: selecting a set of text documents; representing the documents with a number of terms; identifying a number of multiterm features of the text documents as a function of frequency of each of the terms in each of the documents; relating the multiterm features and terms with one or more data structures corresponding to a sparse matrix; and performing a latent variable analysis to determine a number of concepts of the text documents from the one or more data structures. This method may further include providing a concept representation corresponding to a multilevel acyclic graph organization in which each node of the graph corresponds to one of the concepts.
[0013] Yet another embodiment of the present invention includes a method for performing a routine with a computer system that includes: determining a number of multiterm features of a set of text documents as a function of a number of terms included in those documents; identifying one of a number of first level concepts of the text documents based on one or more terms associated with one of the features; establishing one of several second level concepts of the documents by identifying one of the terms found in each member of a subset of the one of the first level concepts; and providing a concept representation of the documents based on the first level and second level concepts.
[0014] A further embodiment involves a method that comprises: identifying a number of events; providing a visualization of the events with a computer system; and dimensioning each of a number of visualization objects relative to a first axis and a second axis. The visualization objects each represent a different one of the events and are positioned along the first axis to indicate timing of each of the events relative to one another with a corresponding initiation time and a corresponding termination time of each of the events being represented by an initiation point and a termination point of each of the objects along the first axis. The extent of each object along the second axis is indicative of relative strength of the event represented thereby.
[0015] In another embodiment of the present invention, a method includes: providing a set of text documents; evaluating time variation of a number of terms included in these documents; generating a number of clusters corresponding to the documents with a computer system as a function of these terms; and identifying a number of events as a function of a time variation of the clusters.
[0016] For a further embodiment of the present invention, a method includes: providing a number of textual documents arranged relative to a period of time; identifying a feature with a time varying distribution among the documents; evaluating presence of this feature for each of several different segments of the time period; and detecting an event as a function of the one of the segments with a frequency of the feature greater than other of the segments and a quantity of the documents corresponding to the feature.
[0017] Still another embodiment includes a method, comprising: selecting a set of text documents; designating several different dimensions of the documents; characterizing each of the dimensions with a corresponding set of words; performing a cluster analysis of the documents based on the set of words for each of the dimensions; and visualizing the clustering analysis for each of the dimensions.
[0018] Yet another embodiment is directed to a method which includes: providing a list of words with a computer system as a function of a number of context vectors for a set of text documents and one or more words; receiving input responsive to this list; reweighting a number of different entries corresponding to the context vectors with the computer system based on this input; providing an output of related words with a computer system based on the reweighting; and repeating receipt of the input, reweighting, and provision of the output with a computer system as desired.
[0019] In other embodiments, a unique system is provided to perform one or more of the above-indicated methods and/or at least one device is provided carrying logic executable by a computer system to perform one or more of the above-indicated methods.
[0020] Accordingly, one object of the present invention is to provide a unique data processing technique.
[0021] Another object is to provide a unique apparatus, system, device, or method for analyzing textual data.
[0022] Further objects, embodiments, forms, features, aspects, benefits, and advantages of the present invention will become apparent from the drawings and detailed description contained herein.
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036] For the purpose of promoting an understanding of the principles of the invention, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended. Any alterations and further modifications in the described embodiments, and any further applications of the principles of the invention as described herein are contemplated as would normally occur to one skilled in the art to which the invention relates.
[0037] In accordance with one embodiment of the present invention, text analysis is performed to create a hierarchical, multifaceted document representation that enables multiple distinct views of a corpus based on the discovery that it can be desirable to consider similarity of documents in different ‘respects’. The hierarchical feature provides the potential for multiple levels of meaning to be represented; where the desired ‘level of meaning’ to use in a given application often depends on the user and the level of confidence for the different representation levels. For example, in one document there might be a relatively high degree of confidence that the topic “Sports” is discussed, but confidence might be low regarding the type of sport; in another document confidence might be high that the sport of tennis is discussed. In one form, this concept representation is created automatically, using machine learning techniques. It can be created in the absence of any ‘outside’ knowledge, using statistically derived techniques. Alternatively or additionally, outside knowledge sources can be used, such as predefined document categorizations and term taxonomies, just to name a few.
[0038] The construction of a concept representation is typically based on identifying ‘concepts’ in documents. Frequently, documents do not contain express concepts—instead they contain words from which concepts can often be inferred. By way of nonlimiting example, terms and their juxtapositions within documents can serve as indicators of latent concepts. Accordingly, latent concepts can often be estimated using a statistical latent variable model. In one approach, a latent variable analysis is applied to determine the concepts by deconvolving a document feature space created with a bump-hunting procedure based on a set of terms extracted from the document set. The resulting concept representation can be organized with different concept levels and/or facets. In one form, the concept representation is provided as one or more data structures corresponding to an acyclic directed graph and can be visualized as such.
[0039] A document representation is provided by mapping documents of a given corpus to the above-indicated concept representation. Alternatively or additionally, an initial concept representation can be restructured by equivalence mapping before a document representation is provided. From the document representation, different document signatures can be generated specific to various text analysis applications, such as: (a) information retrieval—retrieve ‘relevant’ documents in response to a query, such as a boolean or ‘query by example’; (b) document clustering—organize documents into groups according to semantic similarity; (c) document categorization, routing, and filtering—classify documents into predefined groups; (d) summarization—provide synopses of individual documents or groups of documents; (e) information extraction—extract pre-defined information pieces from text, such a company names, or sentences describing terrorist activity; (f) entity linkage—find relationships between entities, such as recognizing that “Joe Brown is President of The Alfalfa Company” or identify linkages between airlines in the context of a merger, to name just a few examples; (g) event detection—automatically detect and summarize significant events (usually in real time), and deliver summary and supporting evidence to interested parties; (h) corpus navigation—browse a corpus; (i) topic discovery and organization—organize topics or concepts within a corpus; and/or (j) question answering—provide answers to questions. Question answering can go beyond retrieving documents that are ‘relevant’ to a question. In some applications, the answer can be directly extracted from a relevant document. In others, it is acknowledged that the answer to a question might not be contained in a single document—instead different parts of the answer might occur in different documents, which could be identified and combined by the application.
[0040] Accordingly, these and other embodiments of the present invention provide a combination of generic and application-specific components that are better-suited to current text mining objectives.
[0041] System
[0042] System
[0043] System
[0044] Operating logic for processor
[0045] System
[0046] Referring to
[0047] Referring to
[0048] In one form, it is desirable that the set of documents selected for training are representative of documents expected to be used when applying the concept representation to various applications. Alternatively or additionally, it may be desirable to select a training set of documents that is relatively large to make it more likely to ‘discover’ infrequent or ‘rare’ concepts. In one instance of this approach, concept representation construction is based on a training set of at least 100,000 text documents, although in other instances more or fewer training document could be used.
[0049] Preprocessing stage
[0050] From term standardization operation
TABLE I Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Football 3 1 0 2 0 0 1 0 0 Ball 0 5 0 0 0 3 3 0 0 Sports 2 0 3 3 0 2 5 3 2 Basketball 0 0 4 1 3 0 0 1 2 Game 0 0 1 1 0 0 0 2 0 Skate 0 0 0 0 1 0 0 0 0
[0051] It should be understood that in other embodiments, a term-by-document frequency matrix can include fewer, but typically, many more documents and/or terms. Alternatively or additionally, the frequency can be weighted based on one or more criteria, such as an information-theoretic measure of content or information contained in a given term and/or document. In one such form, term frequencies are weighted by a measure of their content relative to their prevalence in the document collection. To standardize for documents of varying sizes, the columns of a weighted term-by-document frequency matrix might also be normalized prior to analysis.
[0052] A term-by-document frequency matrix is often useful in discovering co-occurrence patterns of terms, which can often correspond to underlying concepts. First-order co-occurrence patterns relate terms that frequently occur together in the same documents; second-order co-occurrence patterns relate terms that have similar first-order co-occurrence patterns, so that two terms can be related by second-order co-occurrence even if they never occur together in a document.
[0053] As an addition or alternative to a term-by-document frequency matrix, terminological patterns can be identified through application of a statistical language model that accounts for the order in which terms occur. In one nonlimiting example, a trigram model is utilized. For this trigram model approach, the probability of the next word given all previous words depends only on the previous two words (it satisfies a second order Markov condition). Correspondingly, the probability of a sentence of length ‘n’ is given by the following equation:
[0054] The bigram and trigram probabilities can be estimated using sparse data estimation techniques, such as backing off and discounting.
[0055] Another embodiment may alternatively or additionally employ co-occurrence statistics from windows of “n” words in length within documents. A further embodiment may alternately or additionally employ natural language processing techniques to extract from each sentence the triple (S,V,O) representing the subject, verb, and object of the sentence. The (S,V,O) triple might additionally be mapped to a canonical form. The (S,V,O) triple would then replace the term in the term-by-document matrix. In still other embodiments, a different type of terminological model suitable to define a desired type of document feature space for concept realization may be utilized as would occur to one skilled in the art. For the sake of clarity and consistency, the term-by-document frequency matrix model is utilized hereinafter unless otherwise indicated. It should be understood that the term-by-document frequency matrix can be represented by one or more data structures with system
[0056] Subroutine
[0057] Stage
[0058] One nonlimiting example includes comparing a relevant characteristic or parameter of the term t for bump b with the set of all other bumps by using a statistical hypothesis test. For this test, let θ
[0059] Rejecting H
[0060] 1. Bernoulli: θ
[0061] 2. Poisson: θ
[0062] 3. Multinomial: θ
[0063] Hypotheses are tested using standard likelihood ratio tests. It turns out that the likelihood ratio test statistics are the same as mutual entropy scores between t and b, this approach could also be called an entropy test.
[0064] From matrix M, a corresponding document-by-bump matrix D can be constructed. The columns of matrix D are the same bumps as in matrix M, and the rows of matrix D are the training documents. As in the case of matrix M, matrix D is binary with an entry of one indicating a significant association between the document (row) and bump (column) and entries of zero indicating the absence of a significant association. For a given document, document/bump associations can be determined by considering the term/bump associations for terms included in the given document, and applying one or more statistical tests of the type used in establishing matrix M by reversing the roles of term and document. In bump-hunting, a document might be assigned to one bump or no bump. A bump is highly specific, and likely a composition of multiple concepts (e.g., a collection of reports describing altitude deviation due to fatigue). So, though a document is initially assigned to one ‘bump’ in bump-hunting, it is likely related to multiple bumps.
[0065] The bump-hunting based binary form of matrices D and M is typically sparse. As used herein, a “sparse matrix” means a matrix in which five percent or less (<5%) of the entries are considered to be greater than or less than zero. A sparse matrix has been found to surprisingly improve the performance of the deconvolution procedure to be described hereinafter.
[0066] From stage
[0067] Deconvolution is performed in branch
[0068] Further forms of outside input that could be used alone or in combination with others include providing examples of documents that belong to different categories of interest, for example, maintenance related, weather related, etc. in the aviation field and/or providing structured external knowledge, such as one event is always preceded by another event. In one implementation, the outside knowledge is mathematically represented as a Bayesian prior opinion. For this implementation, the strength of the prior ‘opinion’ can also be provided, which determines the relative weight given to the prior opinion compared to evidence discovered in the documents of the corpus. In other implementations, the outside knowledge is differently represented alone or in combination with the Bayesian prior opinion form. From stage
[0069] Referring to the flowchart of
[0070] Deconvolution is based on identifying partial orders in M. Given T1 and T2 are two sets of terms, then a partial order T
[0071] During operation
[0072] Referring to
[0073] In one nonlimiting approach to efficiently construct the directed graph, the concept hierarchy is constructed from the bottom up. First, all terms are identified from matrix M that indicate base or lowest level concepts. Terms may be associated with more than one lowest level concept. Term equivalence class Ti indicates a base level concept if there is no equivalence class Tj such that Tj<Ti. Let S1 denote the set of all such terms or term equivalence classes. It follows that each remaining term subsumes at least one term in S1. Of the remaining terms, identify those terms Tk for which there is no term or term equivalence class Tj not in S1 such that Tj<Tk. Let S2 denote the set of all such terms. Repeat the process to identify sets S3, S4, etc. until no more terms remain. This process yields a collection of disjoint sets of terms or term equivalence classes S1, S2, . . . , Sm. The directed graph is readily constructed subject to the following constraint: arrows into terms in Sn are only allowed from terms in S(n+1). Thus, for term Ti in Sn, Tj→Ti if and only if Tj>Ti and Tj is in S(n+1). From the example in
[0074] From operation
[0075] Procedure
[0076] The conditional probability mass function for m given c is:
[0077] Because some equivalence classes are more populated than others, classes may be merged in the posterior probability via the following equation:
[0078] and assign M to most probable equivalence class. Generally, the effect is to remove some nodes and their connectors from the term tree. In an alternative implementation, the likelihood function is computed for the collection of term equivalence classes:
[0079] Then two equivalence classes c_I and c_j are merged that yield the smallest change in likelihood function. The process is continued until the change from the original likelihood (before any mergers) is large enough to be statistically significant. Other measurement error models can be exploited in a similar manner for different embodiments.
[0080] After connector removal, a further refinement is performed by adding weights to the remaining connectors. These weights can correspond to probabilities, i.e,
[0081] where A and B designate different hierarchical levels of the representation.
[0082] Generally, individual features (e.g., terms) of a concept representation generated in accordance with procedure
[0083] Returning to
[0084] Stage
[0085] To identify such subsets in stage
[0086] From stage
[0087] where L
[0088] Upon the discovery that the representation is nonidentifiable, several surprising solutions have been discovered that may be utilized separately or in combination. These solutions include selection of a procedure, such as bump-hunting, to increase sparseness of the resulting term-concept weights of the representation. Using outside knowledge sources also serves to impose constraints on the weights in a manner likely to increase identifiability. If the result is still nonidentifiable, further solutions include simplifying the model by applying one or more of the following: restricting the number of levels permitted; mapping the nonidentifiable representation to a strict hierarchical representation, where each subordinate concept (child) can only be associated with one concept (parent) of the next highest level; or map the nonidentifiable representation to two or more identifiable representations, such as those groupings provided in stage
[0089] Accordingly, if the test of conditional
[0090] In one example, let d be a row in the document-by-bump matrix D. For two-level concept hierarchy the following equations apply:
[0091] where n
[0092] with {η
[0093] In an alternative mapping approach, each document is associated with one of the bumps. For example, let bump b might contain two concepts: fatigue and altitude deviation. Consider part of the term×Bump matrix that follows in Table II:
TABLE II bump 1 bump 2 bump 3 bump 4 Fatigue 1 0 1 0 Altitude_deviation 1 1 0 1 Altimeter 0 1 0 1
[0094] Then documents in b are mapped to the concepts that are indicated by terms in bump 1. This provides us with a direct mapping of documents, without the need to create Doc×Bump Matrix.
[0095] New documents (i.e., documents not used in the training set) can be mapped to the concept representation in the same manner as the training set documents. Typically, the mapping is sparse—a new document is mapped to only a small fraction of all possible concept nodes, which facilitates storage and additional advanced computations with the document representation.
[0096] In the case that outside knowledge is available, such outside knowledge can be exploited in the analysis by imposing constraints, or by including the outside knowledge as covariates or Bayesian prior opinions in the analysis. To explain how supervision can influence the concept or document representation, two nonlimiting examples are described as follows. In the first example, suppose documents are preassigned to one or more of g groups. Such groups might correspond to categorical metadata describing the document. Let G be the length g indicator vector for a document indicating to which groups the document is assigned. Then G can be included in any one of several places in the hierarchical model used to map documents. Including G in the model can influence how documents are mapped to concepts; documents that belong to similar groups are more likely to be mapped to the same concepts. In the second example, suppose some terms (not necessarily all terms) are preassigned to one or more facets. Then the iterative algorithm used to identify ‘facets’ in the concept structure is subject to the constraints imposed by the preassignments.
[0097] Routine
[0098] A few examples of different approaches to document signature generation are as follows. In one form, a document representation has been ‘flattened’ into a vector representing C number of concepts (or, the elements of the vector are the document's weights for the topics). Because of our sparse representation, most weights are zero. In many applications, documents contain about one to ten concepts, including only concepts from the most appropriate (or representative) levels of the representation. Thus, one nonlimiting strategy is to “flatten” the document representation into concepts such that each document contains between one and ten concepts, and each concept is represented in, at most, a certain percentage of the documents (say p %). In the context of a comparative evaluation of documents based on such signatures, the probabilities of the concepts for each of two documents can be expressed as a vector of corresponding numbers to provide a measure of similarity of the two documents. Considering the criteria of whether a concept is jointly present (or not present) in both documents and whether a concept is important, four subsets can be created according to the following Table III:
TABLE III Jointly Present Concept Important Concept? No No No Yes Yes No Yes Yes
[0099] A common distance measure, such as a cosine similarity calculation, can be applied to each subset, and the results merged into a linear combination. This combination can be weighted in accordance with user input, empirical information, and/or predefined parameters. This approach addresses both general and specific similarity. As to specific similarity, high weights can be given to the distance calculation involving those “important” concepts. General similarity can be treated as similarity in the absence of any identification of important concepts. Alternatively, general similarity could eventually use a stored corpus-independent sense of the importance of different concepts. This is the notion that “terrorism” is a more important concept than “football”.
[0100] In a query application, the terms of the query are treated as one of the documents. Furthermore, a query can be thought of as identifying the important concepts so that if the other document contains concepts that aren't in the query, then the first row of Table II applies (No, No). Accordingly, the contribution for such “superset” concepts can be reduced. Assuming a nonzero weighting, the effect results that distance increases as more and more concepts are added.
[0101] In another example of document signature generation, several alternatives can be generated in an unsupervised fashion based on the groupings (facets) identified during stage
[0102] Routine
[0103] Another application is to perform document clustering. The previously described document signatures can be submitted to standard clustering algorithms to obtain different types of clustering. Indeed, many text analysis and visualization applications begin with clustering. Typically, the clustering is completely unsupervised such that the analyst has no influence on the types of clusters he or she would like to see. For example, in a collection of documents related to aviation safety, the analyst might want to direct clustering to compare and contrast maintenance problems with communication problems that precipitate an aviation incident or accident. Thus, there is a desire to provide for ways to supervise clustering. The selection among different type of document signatures upon which to base clustering is but one example that addresses this need.
[0104] Alternatively or additionally, clustering can be at least partially supervised by entering external knowledge during stage
[0105] The similarity sought by clustering can be multidimensional—such that documents can be similar in different respects. As an example, consider the aviation safety domain, where four dimensions of aviation safety have been well documented: 1) Mechanical/maintenance, 2) Weather, 3) Communication problems, and 4) Pilot error. In comparing two aviation incident reports, an aviation safety expert might believe that the reports are similar on the maintenance dimension but different on the weather dimension. Thus, in this case a unidimensional similarity measure does not meet the analyst's information needs.
[0106] Referring to the flowchart of