Classifying speech segments as silence, voiced, or unvoiced is an important component of the speech timescale modification algorithm. This classification is accomplished using short time magnitude and zero crossing rates of the speech signal. Many algorithms have been published for classifying speech segments into voiced, unvoiced that focus on determining the exact endpoints of theses segments. For example, in Digital Processing of Speech Signals Rabiner & Shaffer describe an algorithm that uses short time energy to set the inner bounds for the end points of an utterance. Then the zero crossing rate is used, working outward from these bounds to find the actual endpoints. Applications such as speech recognition or some types of speech compression required this level of precision. The speech timescale modification algorithm only needs to determine if an endpoint of silence, voiced, or unvoiced speech occured within a 10 to 20 millisecond speech segment. For this level of precision, the following simple logic is used:
|
|
zero crossing rate |
|
|
low |
high |
| Magnitude |
low |
silence |
? |
|
high |
voiced |
unvoiced |
Theoretically a low magnitude with a high zero crossing rate (marked ? in the table) should never occur. Any real world application must deal with such possibilities so the Speed Reader application treats these segments as transitions to be on the safe side.
The final problem is to determine values for high and low levels for magnitude and zero crossing rate. The average magnitude for 100 millisecond segments of the first second of speech is calculated. The 100 millisecond segment with the lowest average magnitude is considered silence. The mean and standard deviation of the magnitude and zero crossing rate are calculated for this sgment to characterize the background noise level.
Here is a prototype of the short time average crossing rate. Note that the zero crossing rate near the beginning of the phrase is high where the average magnitude is low. The combination of the zero crossing rate and average magnitude can be used in an algorithm to classify components of speech.

Average zero crossing rate plot for the phrase "Mister Meryk"
The octave code for the prototype zero crossing rate function is shown below:
# Create a vector of speech samples
samples = wavread("four_ways_linear.wav", [5000 6600]);
# absolute value function
function a = absv(b)
if(b < 0)
a = -b;
else
a = b;
endif
endfunction
# sign function, octave's sign function isn't quite what we need
function y = sgn(x)
if(x < 0)
y = -1;
else
y = 1;
endif
endfunction
# zero crossing function for a vector
function y = zc(x)
y = 0;
for i = 1 : (length(x) - 2)
y += absv(sgn(x(i)) - sgn(x(i + 1)));
endfor
endfunction
# Calculate the average zero crossing rate for a window size
# of 160 samples every 40 samples
win_lngth = 160;
rate = 40;
j = 1;
for i = 1 : (length(samples) / rate)
if(j + win_lngth < length(samples))
win = samples(j : j + win_lngth);
avgzcr(i) = zc(win) / (2 * win_lngth);
j = j + rate;
else
avgzcr(i) = 0;
endif
endfor
plot(avgzcr);
Next I’ll show the complete speech classification algorithm prototype in octave.
The short time average zero crossing rate of a speech signal can be used in conjunction with the short time average energy (or magnitude) to discriminate between voiced speech, unvoiced speech and silence. The short time average crossing rate of a digitally sample speech signal is defined in Digital Processing of Speech Signals
(Rabiner & Schafer) as:

Equation 1 Short time average zero crossing rate
Where:

Equation 2 sgn (sign) function
and w(n) is the windowing function with a window size of N samples:

Equation 3 Windowing function, window size N samples

My Beagle Board arrived today from Texas Instruments. I just ordered a power supply and assorted cables for it from Digi-Key. So it will stilll be a few days before I try to bring up Linux on it.
Here is a quick prototype of the short time energy function in GNU Octave for a the speech sample “Mister Meryk”. The plot below shows the average magnitude of the phrase using a window size of 320 samples, calculated every 80 samples.

Average magnitude function.
Here is the code that I used to generate the plot – just a quick prototype in GNU Octave:
# Create a vector with 4800 speech samples
samples = wavread("four_ways_linear.wav", [2700 7500]);
# absolute value function
function a = absv(b)
if(b < 0)
a = -b;
else
a = b;
endif
endfunction
# average magnitude function note that x must
# be a vector.
function y = avgmag(x)
v = arrayfun(@absv, x);
y = sum(v) / length(v);
endfunction
# Calculate the average magnituded for a window size
# of 320 samples every 80 samples
win_lngth = 320;
rate = win_lngth / 4;
j = 1;
for i = 1:(length(samples) / rate - 4)
win = samples(j : j + win_lngth);
mag(i) = avgmag(win);
j = j + rate;
endfor
plot(mag);
The short time energy measurement of a speech signal can be used to determine voiced vs. unvoiced speech. Short time energy can also be used to detect the transition from unvoiced to voiced speech and vice versa. The energy of voiced speech is much greater than the energy of unvoiced speech.

Equation 1 Short time energy
Equation 1 defines the short time energy for a sampled signal where h(n-m) is a windowing function. For simplicity a rectangular windowing function is used as defined in equation 2.

Equation 2 Windowing function.
N in equation 2 is the length of the window in samples. The window must be long enough to encompass several pitch periods to produce a smooth representation of the amplituded of the signal. At the same time the window must be short enough to relfect rapid changes in amplitued that occur at the voiced/unvoiced bounderies. The selection of the window size is a comprimise since a high pitched female or childs voice may ahve a pitch period as small as 16 samples at an 8 kHz. samling rate up to 200 samples for a low pitched male voice. A windo size of 160 samples or about 20 msec. is a good comprimise. One of the advantages of using a tool like Octave to prototype algorithms is that is that it makes it easier to experiment with parameters like the window size.
One problem with the short time energy function is that it is very sensitive to large signal levels since the sample values are squared. This isn’t a problem in Octave since Octave scales audio samples to +/- 1. In a production application using fixed point math this is a disadvantage. In addition a multiply operation is required for each sample. So instead of using the short time energy function, I’ll use a related function, the short time magnitude functions shown in equation 3.

Equation 3 Short time magnitude.
Next post I’ll show the short time magnitude calculation in Octave and a plot for the our speech segment, “Mister Meryk”.
My project, Speed Reader, has been approved for the BeagleBoard Sponsored Projects Program. Now I’ll receive a BeagleBoard to prototype a Speech Timescale Modification application for playing audio books. I’m one step closer to building a real application!
I’m transcribing my paper notes on short-time measurment of the speech signal and needed a way to insert some equations into my blog post. I found an easy way to do this is to use GNU TeXmacs to enter the equations which can produce postscript files. Then I use GIMP, the GNU Image Manipulation Program to crop the postscript page and convert the result to a PNG image. I’m bearly scratching the surface of Texmacs’ capabilities. It can be used to write complete scientific papaers. Despite it’s name TeXmacs requires no knowledge of Donald Knuth’s TeX language. From downloading TeXmacs to generating a PNG image of an equation took me less than an hour.
The plot from last time doesn’t reveal much about the nature of speech so I’ve been looking at some smaller bits. First I isolated the first phrase from the sample speech: “Mister Maryk”:

Plot of the phrase "Mister Meryk"
This is a good chunk of speech for experimenting with classifying algorithms but this is still too large a sample to see the fine grain structure of the sound wave form. Next lets take a look at the “eh” sound from the phrase “Mister Maryk”.
You can clearly see a periodic wave form though it is a bit noisy. This is voiced speech and I’ll use a pitch detection algorithm on sounds like this to determine the pitch period. Just judging by eye it looks like the period is about 75 samples which at an 8 kHz sample rate works out to be about 107 Hz. Yes, Humphrey Bogart had a pretty low voice. Note that all of the plots were made using GNU Octave and GNU Plot.

Closer look at the "eh" sound from the phrase "Mister Meryk"
Two gross measurements are used to classify the component sounds of speech in this application: the short term average magnitude and the short term zero crossing rate. I’ll talk about those in my next post.
The first step is to get some sample speech to work with. I found this clip on the Web. The first problem is that all of the speech samples that I could find on the Web were encoded in mp3 format. Speech processing requires linear encoding. I also wanted to sample at 8kHz which is about the minimum required for quality speech; 8kHz was the standard used in the public phone system for digitized speech. The lower sampling rate reduces the amount of computation. There is an open source tool, ffmpeg, that is primarily used to convert video formats but since most video also includes an audio track, ffmpeg can be used to convert audio file formats. The following command will convert an audio file to signed (little endian) 16 bit linear samples at 8kHz sampling rate:
ffmpeg -i websample.wav -ar 8000 -acodec pcm_s16le sample-1.wav
After conversion the same sample sounds like this. The difference is subtle but if you listen carefully you’ll notice some loss at high frequencies.It an actual application I would use codec plugins like those used by mplayer and ffmpeg to convert the audio into linear samples in real time.
The following graph shows this sample read by using Octave’s wavread functions then plotted using Octave’s plot function:

There is a small problem here if we take a look at some samples from 10000 to 10500, an area that appears to be silence:

A section of "silence" from the same sample.
You’ll notice that zero is near the top of the plot offset from where you expect to see it. This represents a DC offset in the speech signal. The source of the offset is the wavread function in Octave itself. Digging in to the documentation on sourceforge I found the explanation:
Note that translating the asymmetric range [-2^n,2^n-1] into the symmetric range [-1,1] requires a DC offset of 2/2^n. The inverse process used by ausave requires a DC offset of -2/2^n, so loading and saving a file will not change the contents. Other applications may compensate for the asymmetry in a different way (including previous versions of auload/ausave) so you may find small differences in calculated DC offsets for the same file.
It looks like the wavread function does the same conversion. This offset works out to be 1/16384 so it can probably be ignored.