Categories


Handling Speech Samples with GNU Octave

The first step is to get some sample speech to work with.  I found this clip on the Web.  The first problem is that all of the speech samples that I could find on the Web were encoded in mp3 format.  Speech processing requires linear encoding.  I also wanted to sample at 8kHz which is about the minimum required for quality speech;  8kHz was the standard used in the public phone system for digitized speech.  The lower sampling rate reduces the amount of computation.  There is an open source tool, ffmpeg, that is primarily used to convert video formats but since most video also includes an audio track, ffmpeg can be used to convert audio file formats.  The following command will convert an audio file to signed (little endian) 16 bit linear samples at 8kHz sampling rate:

ffmpeg -i websample.wav -ar 8000 -acodec pcm_s16le sample-1.wav

After conversion the same sample sounds like this.  The difference is subtle but if you listen carefully you’ll notice some loss at high frequencies.It an actual application I would use codec plugins like those used by mplayer and ffmpeg to convert the audio into linear samples in real time.

The following graph shows this sample read by using Octave’s wavread functions then plotted using Octave’s plot function:

Plot of the audio sample (several seconds long).

There is a small problem here if we take a look at some samples from 10000 to 10500, an area that appears to be silence:

A section of silence from the same sample.

A section of "silence" from the same sample.

You’ll notice that zero is near the top of the plot offset from where you expect to see it.  This represents a DC offset in the speech signal.  The source of the offset is the wavread function in Octave itself.  Digging in to the documentation on sourceforge I found the explanation:

Note that translating the asymmetric range [-2^n,2^n-1] into the symmetric range [-1,1] requires a DC offset of 2/2^n. The inverse process used by ausave requires a DC offset of -2/2^n, so loading and saving a file will not change the contents. Other applications may compensate for the asymmetry in a different way (including previous versions of auload/ausave) so you may find small differences in calculated DC offsets for the same file.

It looks like the wavread function does the same conversion.  This offset works out to be 1/16384 so it can probably be ignored.

Comments are closed.