Algorithms

The most important parts of a speaker recognition system are the feature extraction and the classification method. The aim of the feature extraction step is to strip unnecessary information from the sensor data and convert the properties of the signal which are important for the pattern recognition task to a format that simplifies the distinction of the classes. Usually, the feature extraction process reduces the dimension of the data in order to avoid the "curse of dimensionality". The goal of the classification step is to estimate the general extension of the classes within feature space from a training set.
  1. Feature Extraction

    Often, mel frequency cepstral coefficients (MFCC) are used. These features are well-known in the field of speech recognition also, therefore, they can be regarded as the "standard" features in speaker as well as speech recognition. However, experiments show that the parameterization of the MFC coefficients which is best for discriminating speakers is different from the one usually used for speech recognition applications. For example, speaker recognition error rates might be reduced if the "standard" MFCC feature dimension for speech recognition is increased.

    The feature recognition process cuts the digitized audio signal, i.e. the sequence of sample values, into overlapping windows of equal length. The cut-out portions of the signal are called "frames", they are extracted out of the original signal every 10 or 20 ms. The length of a frame is about 30 ms. For speaker recognition tasks, sometimes longer frames are used in comparison to the feature extraction method used for speech recognition in order to increase spectral resolution. Each frame in the time domain is transformed to a MFCC vector. Therefore, the original speech signal is converted into a sequence of feature vectors, each vector representing cepstral properties of the signal within the corresponding window. The feature vector sequences of training and test utterances are the inputs of the classification step of a speaker recognition system, which is now described in more detail.

  2. Classification

    In regard to the choice of the classification method, the kind of application of the speaker recognition system is crucial.

    For text independent recognition, speaker specific vector quantization codebooks or the more advanced gaussian mixture models are used most often. For text dependent recognition, dynamic time warping or hidden markov models are appropriate.

    • Text independent recognition

      Vector quantization is a technique which is also used for speech coding. The training material is used to estimate a code book. This includes mean vectors of feature vector clusters which are given indizes in order to identify them. For compression of speech, the index number of the nearest cluster is used instead of the original feature vector. In order to be able te reconstruct the original signal, a revertable feature computation method has to be chosen (i.e. the MFCC features described above cannot be used for speech coding). The quantization error in feature space is the mean distance between original feature vectors and nearest mean vectors (i.e. the feature used for reconstruction). Obviously, the quantization error depends on the similarity between training material used for estimation of the codebook and the audio signal that is compressed. For example, if a code book is trained using speech signals, the compression of music with this code book will result in a poor reconstruction for a listener as well as in regard to the quantization error.

      This observation is also true in regard to speaker specific codebooks which are used for speaker recognition. The training material of a speaker is used to estimate a codebook, which is the model for that speaker. The classification of unknown test signals is based on the quantization error. For example, for an identification decision, the error of the test feature vector sequence in regard to all codebooks are computed. The "winner" is the speaker which code book has the smallest error between the test vectors and the corresponding nearest code book vector.

      Gaussian mixture models (GMM) are similar to code books in the regard that clusters in feature space are estimated as well. In addition to the mean vectors, the covariances of the clusters are computed, resulting in a more detailed speaker model if there is a sufficient amount of training speech.

    • Text dependent recognition

      Dynamic time warping (DTW) stores the labelled training vector sequences without any further processing. A test vector sequence is aligned to each of the training sequences such that a certain distance measure is minimized. Therefore, the classification algorithm can handle variations in regard to the length of the phonemes an utterance consists of.

      Finally, a hidden markov model (HMM) is a statistical model which may be used for text dependent recognition of speakers. Roughly speaking, they can be viewed as a combination of the DTW and the GMM approach. A HMM has a number of states which model distinct parts of, for example, a user's password for a pass-phrase authentication system. The feature vectors which are observed for the appropriate part of the pass phrase in training are used to estimate a density function, e.g. a GMM. This is called the "output density" of the HMM state. A hidden markov model is a more advanced representation for the pass phrase of a certain speaker, as the characteristic features for the phonemes that are present in the utterances are modelled statistically. Nevertheless, the DTW approach may be a better choice for a real-world speaker recognition system if the amount of available training data is not sufficient in order to reliably estimate the HMM's output densities.