The most important parts of a speaker recognition
system are the feature extraction and the
classification method. The aim of the feature extraction
step is to strip unnecessary information from the sensor data and
convert the properties of the signal which are important for the
pattern recognition task to a format that simplifies the distinction of
the classes. Usually, the feature extraction process reduces the dimension
of the data in order to avoid
the "curse of dimensionality".
The goal of the classification step is to estimate the general extension of
the classes within feature space from a training set.
Feature Extraction
Often, mel frequency cepstral coefficients (MFCC)
are used. These features are well-known in the field of
speech recognition also, therefore, they can be regarded as
the "standard" features in speaker as well as speech recognition.
However, experiments show that the parameterization of the MFC
coefficients which is best for discriminating speakers is different
from the one usually used for speech recognition applications.
For example, speaker recognition error rates might be reduced if
the "standard" MFCC feature dimension for speech recognition is increased.
The feature recognition process cuts the digitized audio signal, i.e. the sequence
of sample values, into overlapping windows of equal length. The cut-out
portions of the signal are called "frames", they are extracted out
of the original signal every 10 or 20 ms. The length
of a frame is about 30 ms. For speaker recognition tasks,
sometimes longer frames are used in comparison to the feature extraction
method used for speech recognition in order to increase spectral
resolution. Each frame in the time domain is transformed to a
MFCC vector. Therefore, the original speech signal is converted
into a sequence of feature vectors, each vector representing cepstral
properties of the signal within the corresponding window.
The feature vector sequences of training and test utterances are the
inputs of the classification step of a speaker recognition system,
which is now described in more detail.
Classification
In regard to the choice of the classification method, the kind of
application of the speaker recognition system is crucial.
For text independent recognition,
speaker specific vector quantization codebooks or the more advanced
gaussian mixture models are used most often.
For text dependent recognition, dynamic time warping
or hidden markov models are appropriate.
Text independent recognition
Vector quantization is a technique which is also
used for speech coding.
The training material is used to estimate a code book.
This includes mean vectors of feature vector clusters which are
given indizes in order to identify them. For compression of speech,
the index number of the nearest cluster is used instead of the original
feature vector. In order to be able te reconstruct the original
signal, a revertable feature computation method has to be chosen
(i.e. the MFCC features described above cannot be used for speech
coding). The quantization error in feature space
is the mean distance between original feature vectors and nearest
mean vectors (i.e. the feature used for reconstruction). Obviously,
the quantization error depends on the similarity between training material
used for estimation of the codebook and the audio signal that is
compressed. For example, if a code book is trained using speech
signals, the compression of music with this code book will result
in a poor reconstruction for a listener as well as in regard to
the quantization error.
This observation is also true in regard to speaker specific codebooks
which are used for speaker recognition. The training material of
a speaker is used to estimate a codebook, which is the model for
that speaker. The classification of unknown test signals is based on
the quantization error. For example, for an identification decision,
the error of the test feature vector sequence in regard to all codebooks
are computed. The "winner" is the speaker which code book has
the smallest error between the test vectors and the corresponding
nearest code book vector.
Gaussian mixture models (GMM) are similar to code
books in the regard that clusters in feature space are
estimated as well. In addition to the mean
vectors, the covariances of the clusters are computed, resulting
in a more detailed speaker model if there is a sufficient amount
of training speech.
Text dependent recognition
Dynamic time warping (DTW) stores the labelled
training vector sequences without
any further processing. A test vector sequence is aligned to each of the
training sequences such that a certain distance measure is minimized.
Therefore, the classification algorithm can handle variations
in regard to the length of the phonemes an utterance consists of.
Finally, a hidden markov model (HMM) is a statistical
model which may be used for text dependent recognition of speakers.
Roughly speaking, they can be viewed as a combination of the
DTW and the GMM approach.
A HMM has a number of states which model distinct parts of, for
example, a user's password for a pass-phrase authentication system.
The feature vectors which are observed for the appropriate part
of the pass phrase in training are used to estimate a density
function, e.g. a GMM. This is called the "output density" of
the HMM state. A hidden markov model is a more advanced representation
for the pass phrase of a certain speaker, as the characteristic
features for the phonemes that are present in the
utterances are modelled statistically. Nevertheless, the DTW
approach may be a better choice for a real-world speaker recognition
system if the amount of available training data is not sufficient in
order to reliably estimate the HMM's output densities.