Next Patent: TECHNIQUE FOR ACCURATELY DETECTING SYSTEM FAILURE
Next Patent: TECHNIQUE FOR ACCURATELY DETECTING SYSTEM FAILURE
Plaque It!
Sponsored by: Flash of Genius |
This application is based upon and claims the benefit of priority from the prior Japanese Patent Application No. 2007-007947, filed on Jan. 17, 2007; the entire contents of which are incorporated herein by reference.
1. Field of the Invention
The present invention relates to an indexing apparatus, an indexing method, and a computer program product that allocates an index to a speech signal.
2. Description of the Related Art
Speaker indexing (hereinafter, “indexing”) has been used to assist viewing of and listening to multiple speakers at conferences, TV or radio programs, panel discussions, etc. Indexing is a technology that allocates indexes to relevant portions of a speech signal representative of an utterance of a speaker. The index includes speech information, such as who made the utterance, when and how long the utterance was made. Such indexing is helpful in various ways. For example, it facilitates searching an utterance of a particular speaker, and detecting a time period during which the particular speaker made active discussion.
When performing the indexing, a speech signal is subdivided into numerous smaller strings, strings having the same or similar characteristic feature are grouped into a longer segment, and a segment is considered as an utterance of one speaker. JP-A 2006-84875 (KOKAI), for example, discloses a technique for calculating the characteristic feature. Concretely, JP-A 2006-84875 (KOKAI) teaches creating an acoustic model representative of speech features from each of the segments that are created by subdividing a speech signal. Subsequently, for each acoustic model, a likelihood is acquired for detecting a similarity of each subdivided speech signal. Then, a vector including the likelihood as a component is used as an index that indicates a speech feature of the speech signal. Accordingly, utterances of the same speaker have a high likelihood with respect to a specific acoustic model, so that similar vectors are obtained from such utterances. In other words, if the vectors are similar, it means that those vectors have originated from the same speaker.
However, in the technology described in JP-A 2006-84875 (KOKAI) there is a problem that when the speech signals used to create acoustic models include utterances of multiple speakers, the utterances of different speakers erroneously sometimes indicate a high likelihood with respect to a common acoustic model. In this case, a feature is provided (vector is created) improperly to distinguish utterances of different speakers, with the result that indexing accuracy is degraded.
According to an aspect of the present invention, there is provided an indexing apparatus including an extracting unit that extracts in a certain time interval, from among speech signals including utterances of a plurality of speakers, speech features indicating features of the speakers; a first dividing unit that divides the speech features into a plurality of first segments each having a certain time length; a first-acoustic-model creating unit that creates a first acoustic models for each of the first segments based on the speech features included in the first segments; a similarity calculating unit that sequentially groups a certain number of successive first segments into a region, and that calculates a similarity between regions based on first acoustic models of the first segments included in those regions; a region extracting unit that extracts a region having a similarity that is equal to or greater than a predetermined value as a learning region; a second-acoustic-model creating unit that creates, for the learning region, a second acoustic model based on speech features included in the learning region; a second dividing unit that divides the speech features into second segments each having a predetermined time length; a feature-vector acquiring unit that acquires feature vectors specific to the respective second segments, using the second acoustic model of the learning region and speech features of the second segments; a clustering unit that groups speech features of the second segments corresponding to the feature vectors, based on vector components of the feature vectors; and an indexing unit that allocates, based on a result of grouping performed by the clustering unit, relevant portions of the speech signals with speaker information including information for grouping the speakers.
According to another aspect of the present invention, there is provided a method of indexing including extracting in a certain time interval, from among speech signals including utterances of a plurality of speakers, speech features indicating features of the speakers; dividing the speech features into a plurality of first segments each having a certain time length; creating a first acoustic models for each of the first segments based on the speech features included in the first segments; sequentially grouping a certain number of successive first segments into a region; calculating a similarity between regions based on first acoustic models of the first segments included in the region; extracting a region having a similarity that is equal to or greater than a predetermined value as a learning region; creating, for the learning region, a second acoustic model based on speech features included in the learning region; dividing the speech features into second segments each having a predetermined time length; acquiring feature vectors specific to the respective second segments, using the second acoustic model of the learning region and speech features of the second segments; clustering speech features of the second segments corresponding to the feature vectors, based on vector components of the feature vectors; and allocating, based on a result of grouping performed at the clustering, relevant portions of the speech signals with speaker information including information for grouping the speakers.
A computer program product according to still another aspect of the present invention causes a computer to perform the method according to the present invention.
FIG. 1 is a schematic diagram of a hardware structure of an indexing apparatus according to a first embodiment of the present invention;
FIG. 2 is a schematic diagram of a functional configuration of the indexing apparatus shown in FIG. 1;
FIG. 3 is a schematic diagram of a functional configuration of a learning-region extracting unit shown in FIG. 2;
FIG. 4 is a schematic diagram of an example of the operations performed by the learning-region extracting unit shown in FIG. 3;
FIG. 5 is a flowchart of the operations performed by the learning-region extracting unit shown in FIG. 3;
FIG. 6 is a schematic diagram of an example of the operations performed by a feature-vector acquiring unit shown in FIG. 2;
FIG. 7 is a flowchart of the operations performed by the feature-vector acquiring unit shown in FIG. 2;
FIG. 8A is a graph for explaining the operations performed by an indexing unit shown in FIG. 2;
FIG. 8B is a schematic view for explaining the operations performed by the indexing unit shown in FIG. 2;
FIG. 9 is a flowchart of an indexing process performed by the indexing apparatus shown in FIG. 2;
FIG. 10 is a schematic view of a functional configuration of an indexing apparatus according to a second embodiment of the present invention;
FIG. 11 is a schematic view of an example of the operations performed by a speaker-change detecting unit shown in FIG. 10;
FIG. 12 is a flowchart of the operations performed by the speaker-change detecting unit shown in FIG. 10; and
FIG. 13 is a flowchart of an indexing process performed by the indexing apparatus shown in FIG. 10.
Exemplary embodiments of the present invention will be described below in detail with reference to the accompanying drawings.
FIG. 1 is a block diagram of a hardware structure of an indexing apparatus 100 according to a first embodiment of the present invention. The indexing apparatus 100 includes a central processing unit (CPU) 101 , an operating unit 102 , a displaying unit 103 , a read only memory (ROM) 104 , a random access memory (RAM) 105 , a speech input unit 106 , and a memory unit 107 , all of which are connected to a bus 108 .
The CPU 101 uses a predetermined area of the RAM 105 as a work area, and executes various processings in cooperation with various control computer programs previously stored in the ROM 104 . The CPU 101 centrally controls operations of all the units included in the indexing apparatus 100 .
FIG. 2 is a schematic diagram of a functional configuration of the indexing apparatus 100 . The CPU 101 realizes, in cooperation with predetermined computer programs previously stored in the ROM 104 , functions of a speech-feature extracting unit 11 , a speech-feature dividing unit 12 , a first-acoustic-model creating unit 13 , a learning-region extracting unit 14 , a second-acoustic-model creating unit 15 , a feature-vector acquiring unit 16 , a clustering unit 17 , and an indexing unit 18 shown in FIG. 2. These function units and their operations will be described in detail later.
The operating unit 102 includes various input keys. When a user enters information by operating those input keys, the operating unit 102 passes the entered information to the CPU 101 .
The displaying unit 103 , constituted by a display apparatus such as a liquid crystal display (LCD), displays various kinds of information based on display signals from the CPU 101 . A touch panel can be used to realize the operating unit 102 and the displaying unit 103 .
The ROM 104 stores therein various computer programs and configuration information in a non-rewritable manner. The CPU 101 uses the computer programs and the configuration information stored in the ROM 104 to control the indexing apparatus 100 .
The RAM 105 is a storage medium such as a synchronous dynamic random access memory (SDRAM), and it functions as a work area of the CPU 101 . Moreover, the RAM 105 serves as a buffer.
The speech input unit 106 converts an utterance of a speaker into electric signals, and sends them as speech signals to the CPU 101 . The speech input unit 106 can be a microphone and any other sound collector.
The memory unit 107 includes a magnetically or optically recordable storage medium. The memory unit 107 stores therein data of speech signals obtained via the speech input unit 106 and data of speech signals entered via other source such as a communicating unit and an interface (I/F) (both not shown), for example. Further, the memory unit 107 stores therein speech signals that are provided with a label (index) in an indexing process described later.
As shown in FIG. 2, the indexing apparatus 100 includes the speech-feature extracting unit 11 , the speech-feature dividing unit 12 , the first-acoustic-model creating unit 13 , the learning-region extracting unit 14 , the second-acoustic-model creating unit 15 , the feature-vector acquiring unit 16 , the clustering unit 17 , and the indexing unit 18 .
From the input speech signals, the speech-feature extracting unit 11 extracts speech features indicating speakers' features in a certain interval of a time length c 1 , and outputs the extracted speech features to the speech-feature dividing unit 12 and the feature-vector acquiring unit 16 . Cepstrum features such as LPC cepstrum or MFCC cepstrum can be considered as the speech features. Moreover, the speech features can be extracted in the certain interval of the time length c 1 from the speech signals within a certain time length c 2 , where c 1 <c 2 . Concretely, c 1 can be set to 10.0 milliseconds and c 2 can be set to 25.0 milliseconds.
The speech-feature dividing unit 12 divides the speech features, which it has received from the speech-feature extracting unit 11 , into a plurality of first segments each having a fixed time length c 3 . The speech-feature dividing unit 12 then outputs speech features and time information (start time and end time) of each of the first segments to the first-acoustic-model creating unit 13 . For example, the time length c 3 is set to be shorter (e.g., 2.0 milliseconds) than the shortest time duration of an utterance that a person can make. If the time duration is set in this manner, then it can be assumed that each of the first segments includes speech features of only one speaker.
The first-acoustic-model creating unit 13 , every time when it receives the speech features of a first segment from the speech-feature dividing unit 12 , creates an acoustic model, i.e., a first acoustic model, based on the speech features. The first-acoustic-model creating unit 13 then outputs to the learning-region extracting unit 14 the created first acoustic model and specific information (speech features and time information) of the first segment used to create the first acoustic model. When the time length c 3 is set shorter than the shortest time duration of an utterance a person can make, it is preferable to create the acoustic model by using a vector quantization (VQ) codebook.
The learning-region extracting unit 14 , when it receives the first segments from the first-acoustic-model creating unit 13 , sequentially gathers a certain number of the first segments as one region. The learning-region extracting unit 14 then calculates a similarity between each of the region based on the first acoustic models of the first segments within the region. Moreover, the learning-region extracting unit 14 extracts, as a learning region, all the regions having the similarity equal to or greater than a predetermined value, and outputs to the second-acoustic-model creating unit 15 the extracted learning region and specific information of this model learning region (speech features and time information of the learning region).
FIG. 3 is a block diagram of a functional configuration of the learning-region extracting unit 14 . The learning-region extracting unit 14 includes a first-segment input section 141 , a region setting section 142 , a similarity calculating section 143 , a region-score acquiring section 144 , and a learning-region output section 145 .
The first-segment input section 141 is a function section that receives, from the first-acoustic-model creating unit 13 , an input including the first acoustic models and specific information of the first segments used to create the first acoustic models.
The region setting section 142 sequentially gathers a certain number of the first segments, which are successively received from the first-segment input section 141 , into one region.
The similarity calculating section 143 calculates similarities between the speech features in two first segments of all possible combinations selected from among the first segments included in the regions set by the region setting section 142 .
The region-score acquiring section 144 calculates, based on time information (time length) of each region set by the region setting section 142 and similarities calculated by the similarity calculating section 143 , a region score indicating a probability that speech models included in the region are made by a single speaker.
From among region scores calculated by the region-score acquiring section 144 , the learning-region output section 145 extracts, as a learning region, a region having the maximum score. The learning-region output section 145 then outputs to the second-acoustic-model creating section 15 the extracted learning region and specific information of the extracted learning region (speech features and time information of the region).
The operations performed by the learning-region extracting unit 14 will be described here in detail. FIG. 4 is a schematic diagram for explaining an example of the operations performed by the learning-region extracting unit 14 , and FIG. 5 is a flowchart of a learning-region extracting process performed by the learning-region extracting unit 14 .
To begin with, as shown in FIG. 5, the first-segment input section 141 receives a first acoustic model and specific information of the first acoustic model (Step S 11 ). As shown in FIG. 4, the first acoustic model includes a plurality of first segments a 1 to a K each of time length c 3 . Subsequently, the region setting section 142 sequentially gathers a plurality of the first segments in one region thereby gathering all the first segments of the first acoustic model into a plurality of regions. Specifically, as shown in FIG. 4, the region setting section 142 sequentially gathers the first segments into regions b 1 to b R each having a time length c 4 (Step S 12 ). As shown in FIG. 4, some first segments of adjoining two or more regions can overlap. The time length c 4 is set empirically. For example, each region may be set to have a time length c 4 of 10.0 seconds because one speaker often continues to speak for about 10.0 seconds in conversation. Thus, if each of the first segment has a time length of 2.0 seconds, then five first segments can be gathered into one region.
Subsequently, the similarity calculating section 143 sets 1 for reference numeral k used to count a region being processed (Step S 13 ), and then selects two first segments a x and a y included in the k-th region (initial region is k=1) (Step S 14 ).
The similarity calculating section 143 calculates a similarity S(a x , a y ) between the first segments a x and a y (Step S 15 ). Concretely, if the VQ codebook is used for the acoustic models created by the first-acoustic-model creating unit 13 , the similarity calculating section 143 first calculates vector quantization distortion D y (a x ) and vector quantization distortion D x (a y ), and then calculates the similarity S(a x , a y ). Concretely, the vector quantization distortion D y (a x ) is calculated by using Equation (1) with respect to a code vector of the first segment a y by using the speech features of the first segment a x . Similarly, the vector quantization distortion D x (a y ) is calculated with respect to a code vector of the first segment a x by using the speech features of the first segment a y . Finally, the similarity S(a x , a y ) is obtained by giving a minus sign to a mean of the distortion D y (a x ) and D x (a y ) as shown by Equation (2).
In Equation (2), d(x, y) is Euclidean distance of the vectors x and y, C y is a codebook of the segment a x , C y (i) is the i-th code vector, M is the size of the codebook, and fix is the i-th speech feature of the first segment a x . The higher the similarity S(a x , a y ) is, the smaller the vector quantization distortion between the first segments a x and a y is, allowing an assumption that the utterance is made by the same speaker highly likely.
The similarity calculating section 143 determines whether the processes at Steps 14 to 15 have been performed on all the first segments included in the region being processed, i.e., whether a similarity of two first segments of all combinations has been calculated (Step S 16 ). If the similarity has not been calculated for all the combinations (No at Step S 16 ), the system goes back to Step S 14 and a similarity between first segments of a new combination is calculated.
On the contrary, at Step S 16 , if the similarity has been calculated for all the combinations (Yes at Step S 16 ), the region-score acquiring section 144 calculates a region score of the k-th region being processed (Step S 17 ). The region score indicates a probability that utterances are made by the same speaker. For example, the region score may be the minimum similarity among the acquired similarities.
The region-score acquiring section 144 determines whether the k-th region currently being processed is the last region. If the k-th region is not the last one (No at Step S 18 ), the region-score acquiring section 144 increments the reference numeral k by 1 (k=k+1), thereby setting the next region to be processed (Step S 19 ). Accordingly, the system control goes back to Step S 14 .
On the contrary, at Step S 18 , if the region currently being processed is the last region (Yes at Step S 18 ), the learning-region output section 145 extracts, as a learning region, a region that meets a specific extraction criteria (Step S 20 ). The learning-region output section 145 then outputs to the second-acoustic-model creating unit 15 the extracted learning region and specific information of the learning region (speech features and time information of the region) (Step S 21 ), and terminates the procedure.
Preferably, the extraction criteria used at Step S 20 include extracting a region that has the maximum similarity which is found is equal to or greater than the threshold th 1 . This is because near the region having the maximum similarity, utterances are most likely made by the same speaker. Further, with a similarity of equal to or greater than the threshold th 1 , the criteria for determining that utterances are made by the same speaker can be met. In this case, the threshold th 1 may be set empirically or may be, for example, a mean of the similarities of all the regions. Alternatively, to ensure extraction of multiple regions, one or more regions may be extracted in a certain time interval.
It is possible to use different time lengths c 4 for different regions. Specifically, the extraction may be arranged such that several patterns are applied to the time lengths c 4 and all the regions of which scores have been calculated are subjected to the extracting process, regardless of the patterns. It has been known from experience that some speeches are long while some are short. To facilitate extraction of a region having a long time length c 4 or a region having a short time length c 4 , values set for the time lengths c 4 are preferably taken into consideration along with the acquired similarities. In the example shown in FIG. 4, a region b r has been extracted.
Referring back to FIG. 2, the second-acoustic-model creating unit 15 creates, for each learning region extracted by the learning-region extracting unit 14 , an acoustic model i.e., a second acoustic model, based on the speech features of the region. The second-acoustic-model creating unit 15 then outputs the created second acoustic model to the feature-vector acquiring unit 16 . To acquire the second acoustic model, the Gaussian mixture model (GMM) is preferably used because the time length c 4 of one region is longer than the time length c 3 of one first segment.
The feature-vector acquiring unit 16 uses the second acoustic model of each region, which it has received from the second-acoustic-model creating unit 15 , and speech features corresponding to second segments (described later) included in the speech features, which it has received from the speech-feature extracting unit 11 , to acquire a feature vector specific to each second segment. Further, the feature-vector acquiring unit 16 outputs to the clustering unit 17 the acquired feature vector of each second segment and time information of the second segment, as specific information of the second segment.
The operations performed by the feature-vector acquiring unit 16 are described here in detail. FIG. 6 is a schematic diagram for explaining an example of the operations performed by the feature-vector acquiring unit 16 , and FIG. 7 is a flowchart of a feature-vector acquiring process performed by the feature-vector acquiring unit 16 .
As shown in FIG. 6, the feature-vector acquiring unit 16 establishes, for each time length c 5 , a second segment d k having speech features with a time length c 6 (Step S 31 ). The time lengths c 5 and c 6 may be set to be, for example, 0.5 seconds and 3.0 seconds, respectively. Note that, the time length c 5 can be equal to or less than the time length c 6 . Further, the time length c 6 is equal to or less than the time length c 4 and substantially the same as the time length c 3 .
Further, the feature-vector acquiring unit 16 sets the initial second segment d k to have a reference numeral k=1 (Step S 32 ). From among second acoustic models s n received from the second-acoustic-model creating unit 15 , the feature-vector acquiring unit 16 sets the initial second acoustic model s n to have a reference numeral n=1 (Step S 33 ).
The feature-vector acquiring unit 16 calculates a likelihood P(d k |s n ) with respect to the n-th second acoustic model s n , using the speech features of the k-th second segment d k (Step S 34 ). When the GMM is used to create the second acoustic model s n , the likelihood is expressed by Equation (3):
where dim is the number of dimensions of the speech features; I k is the number of speech features of the second segment d k ; fi is the i-th speech feature of the second segment d k ; m n is the number of the mixed second acoustic models s n ; and c nm , u nm , and U nm respectively denote a weight factor, a mean vector, and a diagonal covariance matrix with respect to the number m of the mixed second acoustic models s n .
Further, the feature-vector acquiring unit 16 determines whether the likelihood calculation has been performed at Step 34 for all the second acoustic models received from the second-acoustic-model creating unit 15 (Step S 35 ). If the calculation has not been performed for some of the second acoustic models (No at Step S 35 ), the feature-vector acquiring unit 16 sets the next second acoustic model to have a reference numeral n=n+1, thereby setting the next second acoustic model to be processed (Step S 36 ). Accordingly, the system control goes back to Step S 34 .
On the contrary, at Step S 35 , if the likelihood calculation has been performed for all the second acoustic models (Yes at Step S 35 ), the feature-vector acquiring unit 16 creates, for the k-th second segment d k , a vector having the acquired likelihood as a component based on Equation (4):
Specifically, the vector is created as a feature vector v k indicating the features of the second segment (step S 37 ). In Equation (4), the number of the second acoustic models is N. The feature vector v k may be processed such that its components are normalized.
Further, the feature-vector acquiring unit 16 determines whether a feature vector has been created for each of the second segments (Step S 38 ). If a feature vector has not been created for each of the second segments (No at Step S 38 ), the feature-vector acquiring unit 16 sets the next second segment to have a reference numeral k=k+1, thereby setting the next second segment to be processed (Step S 39 ). Accordingly, the system control goes back to Step S 33 .
On the contrary, at Step S 38 , if a feature vector has been created for each of the second segments (Yes at Step S 38 ), the feature-vector acquiring unit 16 outputs to the clustering unit 17 specific information (feature vector and time information) of each of the second segments (Step S 40 ), and terminates the procedure.
Referring back to FIG. 2, the clustering unit 17 groups similar feature vectors out of the feature vectors of all the second segments received from the feature-vector acquiring unit 16 into a class. Further, the clustering unit 17 allocates the same ID (class number) to the second segments corresponding to the feature vectors that belonging to one class. The ID allows handling of the segments as being made by the same speaker. The clustering unit 17 then outputs time information and ID of each second segment to the indexing unit 18 . As to whether the feature vectors are similar to each other, determination may be performed regarding, for example, whether a distortion due to the Euclidean distance is small. Further, as an algorithm used for grouping, for example, commonly known k-means may be used.
Based on the time information and IDs of the second segments received from the clustering unit 17 , the indexing unit 18 divides the speech signals according to groups of second segments having the same IDs, i.e., by speakers. Further, the indexing unit 18 allocates a label (index) to each of the speech signals. Such a label indicates speaker information of each speaker.
FIGS. 8A and 8B are schematic drawings for explaining the operations performed by the indexing unit 18 . When the clustering unit 17 groups the second segments each having two components (likelihoods) as feature vectors into three classes as shown in FIG. 8A, the indexing unit 18 provides the second segments falling within a time period from 0 to 2×c5 with a label Class 1 , the second segments falling within a time period from 2×c5 to 5×c5 with a label Class 2 , and the second segments falling within a time period from 5×c5 to 7×c5+c6 with a label Class 3 as shown in FIG. 8B.
The second segments being close to each other may overlap depending on the value set for the time length c 5 . In this case, assuming that, for example, a second segment being closer to a mean of the class achieves higher reliability, a result indicating higher reliability may preferably be used. In the example shown in FIG. 8B, the second segment d 3 is determined more reliable than the second segment d 2 , and the second segment d 6 is determined more reliable than the second segment d 5 . Further, a portion having more than one result may further be divided into new segments each having a shorter time length c 7 , and feature vectors are found for the new segments thus divided. These feature vectors may be used to find a class to which each new segment belongs, and time for each segment.
FIG. 9 is a flowchart of an indexing process performed by the indexing apparatus 100 . To begin with, speech signals are received via the speech input unit 106 (Step S 101 ). The speech-feature extracting unit 11 extracts speech features indicating speakers' features, in a certain interval of the time length c 1 , from the received speech signals (Step S 102 ), and outputs the extracted speech features to the speech-feature dividing unit 12 and the feature-vector acquiring unit 16 .
The speech-feature dividing unit 12 divides the received speech features into first segments each having a predetermined interval of the time length c 3 (Step S 103 ). Then, speech features and time information of each of the first segments are output to the first-acoustic-model creating unit 13 .
The first-acoustic-model creating unit 13 , every time when it receives the speech features of a first segment, creates an acoustic model based on the speech features (Step S 104 ). The created acoustic model together with specific information (speech features and time information) of the first segment used to create the acoustic model is output from the first-acoustic-model creating unit 13 to the learning-region extracting unit 14 .
At the subsequent step S 105 , the learning-region extracting unit 14 performs a learning-region extracting process (see FIG. 5), based on the acoustic models created at Step S 104 and specific information of the first segments of the acoustic models. The learning-region extracting unit 14 then extracts, as a learning region, a region where utterances are highly likely made by an identical speaker (Step S 105 ). The extracted learning region together with the specific information of the learning region (speech features and time information of the relevant region) is output from the learning-region extracting unit 14 to the second-acoustic-model creating unit 15 .
The second-acoustic-model creating unit 15 creates, for each learning region extracted at Step S 105 , a second acoustic model based on the speech features of the region (Step S 106 ). The created second acoustic model is then output from the second-acoustic-model creating unit 15 to the feature-vector acquiring unit 16 .
At the subsequent step S 107 , the feature-vector acquiring unit 16 performs the feature-vector acquiring process (see FIG. 7), based on the second acoustic models created at Step S 106 and the speech features of the second segments. Accordingly, the feature-vector acquiring unit 16 acquires specific information (feature vectors and time information) of the second segments in the feature-vector acquiring process (Step S 107 ). The acquired specific information is output from the feature-vector acquiring unit 16 to the clustering unit 17 .
From among all the feature vectors obtained at Step S 107 , the clustering unit 17 groups similar feature vectors into a class. Further, the clustering unit 17 provides second segments corresponding to the feature vectors included in the class with a specific ID allowing handling of the segments as being made by an identical speaker (Step S 108 ). Then, the time information (start time and end time) and ID of each of the second segments are output from the clustering unit 17 to the indexing unit 18 .
The indexing unit 18 divides the speech signals received at Step S 101 , based on the time information of the second segments and IDs given to the second segments. Further, the indexing unit 18 provides each of the divided speech signals with a relevant label (index) (Step S 109 ), and terminates the procedure.
As described, according to the present embodiment, a time period during which speech signals are generated by utterances of a single speaker is used to create acoustic models. This method reduces a possibility that acoustic models are created in a time period during which utterances of multiple speakers are mixed, and eliminates difficulties in discriminating utterances of different speakers, thereby improving accuracy in creating acoustic models, i.e., indexing. Further, by using divided segments to create one acoustic model, a larger amount of information can be included in one model, compared with conventional methods. Thus, more accurate indexing is realized.
An indexing apparatus 200 according to a second embodiment of the present invention will be described here. Constituting elements identical to those described in the first embodiment are indicated by the same reference numerals, and their description is omitted. Further, the indexing apparatus 200 has the same hardware structure as shown in FIG. 1.
FIG. 10 is a block diagram of functional configuration of the indexing apparatus 200 according to the second embodiment. The indexing apparatus 200 includes a speech-feature extracting unit 21 , the speech-feature dividing unit 12 , the first-acoustic-model creating unit 13 , the learning-region extracting unit 14 , a second-acoustic-model creating unit 22 , a feature-vector acquiring unit 23 , a speaker-change detecting unit 24 , a feature-vector reacquiring unit 25 , the clustering unit 17 , and the indexing unit 18 .
The speech-feature extracting unit 21 , the second-acoustic-model creating unit 22 , the feature-vector acquiring unit 23 , the speaker-change detecting unit 24 , and the feature-vector reacquiring unit 25 are functional units realized in cooperation with predetermined computer programs previously stored in the CPU 101 and the ROM 104 , like the speech-feature dividing unit 12 , the first-acoustic-model creating unit 13 , the learning-region extracting unit 14 , the clustering unit 17 , and the indexing unit 18 .
The speech-feature extracting unit 21 extracts speech features, and outputs them to the feature-vector reacquiring unit 25 , the speech-feature dividing unit 12 , and the feature-vector acquiring unit 23 . The second-acoustic-model creating unit 22 creates an acoustic model for each region, and outputs it to the feature-vector reacquiring unit 25 and the feature-vector acquiring unit 23 . The feature-vector acquiring unit 23 outputs to the speaker-change detecting unit 24 specific information (feature vector and time information) of each second segment.
The speaker-change detecting unit 24 calculates a similarity of adjacent second segments based on their feature vectors, detects a time point when the speaker is changed, and then outputs information of the detected time to the feature-vector reacquiring unit 25 .
The operations performed by the speaker-change detecting unit 24 are described here in detail. FIG. 11 is a schematic drawing for explaining an example of the operations performed by the speaker-change detecting unit 24 , and FIG. 12 is a flowchart of a speaker-change detecting process performed by the speaker-change detecting unit 24 .
To begin with, the speaker-change detecting unit 24 sets a reference numeral p=1 to specific information of the initial second segment received from the feature-vector acquiring unit 23 (Step S 51 ). Specific information of a second segment is referred to as a second segment d p .
As shown in FIG. 11, the speaker-change detecting unit 24 selects a second segment d p and a second segment d q having a start time closest to the end time of the second segment d p (Step S 52 ). This process allows for selection of the second segments d p and a second segment adjacent to the second segment d p . When the time length c 6 is set to be a constant multiple of the time length c 5 , the end time of the second segment d p matches the start time of the second segment d q .
The speaker-change detecting unit 24 calculates a time t that lies at the middle point between the end time of the second segment d p and the start time of the second segment d q (Step S 53 ). The speaker-change detecting unit 24 then calculates a similarity between a feature vector v p of the second segment d p and a feature vector v q of the second segment d q , and sets it as the similarity at the time t (Step S 54 ). The similarity may be obtained by, for example, giving a minus sign to a Euclidean distance.
The speaker-change detecting unit 24 determines whether the second segment d q being processed is the last one of all the second segments received from the feature-vector acquiring unit 23 (Step S 55 ). If the second segment d q being processed is not the last one (No at Step S 55 ), the speaker-change detecting unit 24 increments the reference numeral p by 1 (p=p+1), thereby setting the next second segment to be processed (Step S 56 ). Accordingly, the system control goes back to Step S 52 .
On the contrary, at Step S 55 , if the second segment d q being processed is the last second segment (Yes at Step S 55 ), the speaker-change detecting unit 24 detects a time point at which a similarity is found that meets the detection criteria for determining whether the speaker is changed at that time point. Specifically, the speaker-change detecting unit 24 detects the time point as a point when the speaker has changed (change time) (Step S 57 ). The speaker-change detecting unit 24 then outputs the detected change time to the feature-vector reacquiring unit 25 (Step S 58 ), and terminates the procedure.
Preferably, the detection criteria include detecting a time point at which the minimum similarity which is found is equal to or less than the threshold th 2 . This is because the speaker has most likely changed near the time point at which the minimum similarity is found. Further, with a similarity of equal to or less than the threshold th 2 , the criteria for determining that compared second segments are utterances made by different speakers can be met. The threshold th 2 may be set empirically. In the example shown in FIG. 11, the result shows that three time points are detected at which the speakers have changed.
Referring back to FIG. 10, the feature-vector reacquiring unit 25 divides the speech features received from the speech-feature extracting unit 21 by using the change time received from the speaker-change detecting unit 24 . The feature-vector reacquiring unit 25 performs the process as performed by the feature-vector acquiring unit 23 by using, for example, acoustic models received from the second-acoustic-model creating unit 22 , so as to acquire feature vectors of third segments obtained by dividing the speech features. The feature-vector reacquiring unit 25 then outputs to the clustering unit 17 specific information (feature vectors and time information) of the third segments.
The feature vectors may be calculated in a different manner from the one performed by the feature-vector acquiring unit 23 . For example, if second segments are arranged in a way that their start time and end time are within the range from the start time to the end time of a third segment, a mean of the feature vectors of the arranged second segments may be set as the feature vector of the third segment.
FIG. 13 is a flowchart of an indexing process performed by the indexing apparatus 200 . To begin with, speech signals are received via the speech input unit 106 (Step S 201 ). The speech-feature extracting unit 21 extracts speech features indicating speakers' features, in a certain interval of the time length c 1 from the received speech signals (Step S 202 ). The extracted speech features are output from the speech-feature extracting unit 21 to the speech-feature dividing unit 12 , the feature-vector acquiring unit 23 , and the feature-vector reacquiring unit 25 .
The speech-feature dividing unit 12 divides the speech features into first segments, and outputs to the first-acoustic-model creating unit 13 the speech features and time information (start time and end time) of each of the first segments (Step S 203 ).
The first-acoustic-model creating unit 13 creates, for speech features of a first segment, an acoustic model based on the speech features of each first segment. The first-acoustic-model creating unit 13 then outputs to the learning-region extracting unit 14 the created acoustic model and specific information (speech features and time information) of each of the first segments used to create the acoustic model (Step S 204 ).
At the subsequent step S 205 , the learning-region extracting unit 14 performs the learning-region extracting process (see FIG. 5), based on the received acoustic models and specific information of the first segments used to create the acoustic models. The learning-region extracting unit 14 then extracts, as a learning region, a region where utterances are highly likely made by an identical speaker (Step S 205 ). The extracted learning region together with specific information of the learning region (speech features and time information of the relevant region) is output from the learning-region extracting unit 14 to the second-acoustic-model creating unit 22 .
The second-acoustic-model creating unit 22 creates, for each learning region extracted at Step S 205 , a second acoustic model based on the speech features of the region (Step S 206 ). The created second acoustic model is output from the second-acoustic-model creating unit 22 to the feature-vector acquiring unit 23 and the feature-vector reacquiring unit 25 .
At the subsequent step S 207 , the feature-vector acquiring unit 23 performs the feature-vector acquiring process (see FIG. 7), based on the second acoustic models created at Step S 206 and speech features of the second segments. Accordingly, the feature-vector acquiring unit 23 acquires specific information (feature vectors and time information) of the second segments in the feature-vector acquiring process (Step S 207 ). The acquired specific information is output from the feature-vector acquiring unit 23 to the speaker-change detecting unit 24 .
At Step S 208 , the speaker-change detecting unit 24 performs a speaker-change detecting process as described (see FIG. 12) based on the specific information of the second segments, which is acquired at Step S 207 . The speaker-change detecting unit 24 then outputs to the feature-vector reacquiring unit 25 a change time detected in the speaker-change detecting process (Step S 208 ).
Subsequently, the feature-vector reacquiring unit 25 divides, based on the change time detected at Step S 208 , speech features of the time length c 2 , which are extracted at Step S 202 . Further, the feature-vector reacquiring unit 25 performs a similar process to the feature-vector acquiring process (see FIG. 7) based on the second acoustic models of the regions and the speech features of the relevant second segments, so as to acquire specific information of the third segments (Step S 209 ). The acquired specific information is output from the feature-vector reacquiring unit 25 to the clustering unit 17 .
From among the feature vectors acquired at Step S 209 for all the third segments, the clustering unit 17 groups similar feature vectors into one class. The clustering unit 17 then provides third segments corresponding to the feature vectors included in one class with a specific ID allowing handling of the segments as being made by an identical speaker (Step S 210 ). The time information (start time and end time) and ID of each of the third segments are output from the clustering unit 17 to the indexing unit 18 .
The indexing unit 18 divides the speech signals based on the received time information and IDs of the third segments, provides each of the divided speech signals with a relevant label (index) (Step S 211 ), and then terminates the process.
As described, the second embodiment yields the following advantages in addition to the advantage achieved in the first embodiment. In the second embodiment, the speaker-change detecting unit 24 is incorporated and estimation is made for a time when the speaker is changed. This structure enables more accurate identification of an interface between different labels output from the indexing unit 18 . Further, segments divided by each change time are subjected to clustering. Accordingly, the clustering can be performed on segments for a longer time length than the time length c 6 of a second segment. This arrangement enables highly reliable featuring based on a larger amount of information, thereby realizing highly accurate indexing.
While two specific embodiments of the present invention have been described above, the present invention is not limited to those embodiment. In other words, it can be modified, changed, and added with other features in various ways without departing from the sprit and scope of the present invention.
In the foregoing embodiments, computer programs executable by a user interface system are previously installed in the ROM 14 , the memory unit 17 , or the like. However, such programs can be recorded in other computer-readable recording media such as compact disk read only memories (CD-ROM), flexible disks (FD), compact disk readable (CD-R) disks, or digital versatile disks (DVD) in an installable or executable file format. Further, these computer programs can be stored in a computer connected to the Internet and other networks, so as to be downloaded via the network, or may be provided or distributed via networks such as the Internet.
Additional advantages and modifications will readily occur to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details and representative embodiments shown and described herein. Accordingly, various modifications may be made without departing from the spirit or scope of the general inventive concept as defined by the appended claims and their equivalents.