Signal processing algorithms are usually prototyped and tested using high level math tools such as MATLAB or Mathcad. These tools use a high level language that closely models actual mathematical equations. Tools like these include many built in math functions and utilities to plot results. MATLAB even has add-on packages that can generate C/C++ code or HDL for implementing signal processing applications on a DSP or an FPGA. The only problem is that these tools are too expensive to purchase for personal development projects.
Fortunately there is an open-source alternative, GNU Octave. Octave’s language is mostly compatible with MATLAB. You can also write your own modules for Octave in C/C++ and invoke them from Octave. Octave is integrated with GNU Plot to provide graphical output. Octave provides a highly interactive environment making it easier to quickly debug algorithms. This should prove to be much quicker than writing C++ code and trying to debug both the algorithm and the code at the same time.
GNU Octave includes an audio processing package that will be useful for prototyping speech processing algorithms. The package includes functions for reading and writing digitized audio in WAV file format, playing processed audio using the computer’s sound card, and plotting digital audio data. There are also packages for digital signal processing and communications that may prove useful in this project. You don’t even need to buy yet another manual to use Octave. The entire 575 page manual is online.
The Beagle Board is a low cost single board computer that is ideally suited to prototyping multimedia embedded systems applications. It is based on the Texas Instruments OMAP3530 application processor. The TI OMAP 3530 integrates an ARM Cortex RISC processor and a TI C64++ DSP making it an ideal processor for digital audio and video processing applications. The Beagle Board is supported by eLinux an open source embedded version of Linux.
The BeagleBoard Sponsored Projects Program is a program where selected projects are given a Beagle Board if the project is approved. My project is called Speed Reader and is an application of Timescale Modification of Speech applied to an audio book player. Speed Reader has been submitted to the BeagleBoard Sponsored Projects Program.
Basically, timescale modification of speech is accomplished by first dividing the speech into segments. Then segments are deleted to speed up the speaker rate or segments are repeated to slow down the speaker rate. The two key issues are how to segment the speech and what segments can be deleted or repeated without degrading intelligibility.
Speech can be viewed as three broad classes of sounds. The first is voiced speech which consists of a periodic signal with a specific pitch. The vowels are voiced speech. Second is unvoiced speech which is a broadband noise like signal. The ’s’ and ’sh’ sound are examples of unvoiced speech. Third, there are transition sounds called stops and plosives. These sounds are a sudden transition to or from silence. The ‘t’ sound at the end of a word is a stop while the ‘p’ sound at the beginning of a word is a plosive. (Rabner, Schafer: Introduction to Digital Speech Processing Now Publishers Inc, 2007). Finally in this application we are interested in a fourth category, silence. Each of these categories of sounds a handled differently in segmenting speech for timescale modification.
For voiced speech we must use pitch synchronous segmentation to achieve a natural and intelligible sound. Since voiced speech is a periodic waveform, the dominate frequency or pitch of the signal can be detected. Voiced sounds are then segmented on period boundaries so that when segments deleted or repeated, the periodic waveform is not distorted.
Unvoiced speech has no detectable period so unvoiced speech can be segmented at arbitrary boundaries. Segments of unvoiced speech must be kept short enough so that the segments do not contain parts of other types of speech signals. Periods of silence can be handled in a similar fashion, using short arbitrary length segments.
The transition sounds, stops and plosives, can not be deleted or repeated. Deletion of these important segments results in unintelligible speech while repeating these segments can result in a stutter effect.
Another consideration is aligning the edges of the speech segments when deleting or repeating segments. When I first implemented a system for variable rate playback of speech in the 1980s, I simply placed the segments end to end, ignoring any discontinuities between the signal in one segment and the next. Though I had good results with this, the discontinuities add some noise to the speech. Since then a technique called Synchronized Overlap and Add (SOLA) has been developed. SOLA used a windowing function on the segments, gradually rammping down one segment while ramping up the the next segment. An overlapping portion of the winodwed segments is added at the edges as shown below. (Coyle, Doran, Lawlor: Audio Time-Scale Modification Using a Hybrid Time-Frequency Domain Approach).

The first multimedia software project that I’m working on is an application that changes the playback rate of recorded speech, also called time scale modification of digitized speech in the technical literature. Obviously this is more complicated than just playing back digitized speech at a faster or slower rate than it was originally recorded. Doing this changes both the speech rate and the pitch of the speech. The object of this development is to play digitized speech at different rates with no change in pitch and no loss of ineligibility.
I’ll give an overview of the techniques used to accomplish this in a later post but first I want to describe why we might be interested in doing this at all. Here are a few possible applications:
- Voice Mail – have you ever gotten one of those long winded voice mail messages that ramble on forever only to miss the return phone number at the end? Or, had someone just rush through their name and phone number so quickly that you can’t quite catch it? This application would allow you to fast forward through the long winded messages by speeding up the speaking rate. You could still understand the message but it wouldn’t take so long to listen to it. Or slow down the important information like names and contact numbers to make the message more intelligible or just give you enough time to remember the information.
- Speech Therapy: By slowing down the speaking rate, students would hear the correct pronunciation of words better. If the student’s speech is recorded and played back at a slow rate, they would better understand the difference between how they are pronouncing words and the correct pronunciation.
- Audio Books for the Blind: Audio books are a real boon for people who can’t see well enough to read. But this still leaves one major disadvantage. Sighted people can skim over parts of a book that are not interesting or important, then slowly read complex writing or important information. Being able to change the speaking rate when listening to an audio book could give audio book users a similar capability.
- Changing the speaking rate of digitized speech is a good learning experience because it requires understanding the component sounds that make up speech and techniques for detecting these component sounds.
- It’s fun!
So, that’s the why of this project. Next I’ll start to get into the how.