Open Source Software For Transcribing Speech in Audio Files

java python speech-recognition speech-to-text cmusphinx

12,186

Why can't it read a wav?

It tells you that the file has wrong sampling rate (8000) instead of requested (16000). Sampling rate is very important for speech recognition software.

Why can't it read /dev/dsp?

In recent versions of Ubuntu pulseaudio framework is used instead of OSS. The version you are trying is using OSS so you need to install oss-compatibility package from your distribution to bring OSS support back.

You can try newer Julius which has pulseaudio support

Why does it then appear to be able to read /dev/dsp, but not react in any way?

Audio input doesn't work properly.

Has anyone else had any success with open source speech recognizers, especially on Linux?

Sure, check this video as an example of what people do with CMUSphinx:

http://www.youtube.com/watch?v=vfaNLIowSyk

I suggest you to revisit CMUSphinx package which is a leading open source speech recognition engine. There are loads of documents on the website, you just need to read them. Remember that speech recognition is a complex area where you can get a great results but you also need to invest your time in understanding the technology. Just like with any other domain.

In short, to transcribe a file with CMUSPhinx you need to do the following 3 simple steps:

Take wav file and resample it to 8khz 16 bit mono file with sox:

    sox input.wav -r 8000 -c 1 resampled.wav

Install pocketsphinx 0.7

   apt-get install pocketsphinx

Decode the file

    pocketsphinx_continuous -samprate 8000 -infile resampled.wav

The result will be printed to standard output. To supress the logger, add stderr redirection to /dev/null

    pocketsphinx_continuous -infile resampled.wav 2> /dev/null

12,186

Author by

Cerin

Updated on June 05, 2022

Comments

Cerin almost 2 years
Can anyone recommend reliable open source software for transcribing English speech in wav files? The two main programs I've researched are Sphinx and Julius, but I've never been able to get either to work, and the documentation with each on transcribing files is sketchy at best.

I'm developing on 64-bit Ubuntu 10.04, whose repos include sphinx2 and julius, as well as voxforge's julius acoustic modal for English. I'm focussing on transcribing files, instead of directly processing sound from a mic, because I've given up on expecting projects like these to work with Ubuntu's sound system. This isn't a knock against Ubuntu, as I can record sound with my mic perfectly using Audacity, but neither system seems able to access my mic, so I'm hoping I can simply their configuration by just reading from a file.

I first tried Sphinx2, from the Ubuntu package sphinx2-bin. Even though the sample sphinx2-demo seemed to work on transcribing a file, there's virtually no documentation on the configuration, so I'm not sure how I'd customize this to read from an arbitrary wav. The audio file used in the demo is in some undocumented "16k" format, which is indirectly referenced through 2 configuration files. There's a brief blurb describing sphinx2-demo as running sphinx2-batch, but inspecting the script shows it's actually calling sphinx2-continuous. Even worse, the --help docs for each script list about 6 dozen options, and doesn't mention which are required or optional. Overall, the lack of sphinx documentation, and the poor quality of existing documentation is driving me nuts.

I next tried Julius, again from the Ubuntu package, which was surprisingly recent (4.1), considering the version used in Voxforge's quickstart is 3.5. The package seems to include slightly better documentation, and even an example written in Python (/usr/share/doc/julius-voxforge/examples/controlapp). After reading the example's docs, I tried adapting it to read from a file by creating a file filelist.txt containing the text "hello.wav" referring to a file of the same name, containing a recording of someone saying "hello". Placing these in the same directory, I ran:
```
julius -input file -filelist filelist.txt -C julian.jconf
```
getting the response:
```
### read waveform input
Error: adin_file: sampling rate != 16000 (8000)
Error: adin_file: error in parsing wav header at hello.wav
Error: adin_file: failed to read speech data: "hello.wav"
0 files processed
```
Retrying by specifying absolute filenames for filelist.txt and hello.wav produce the same error.

I also tried the Julius call used in the example, to record directly from a mic:
```
julius -input mic -C julian.jconf
```
I called this several times, and the response varied between the error:
```
Cannot read /dev/dsp
```
and:
```
STAT: AD-in thread created
<<< please speak >>>
```
In the later case, no matter what I say into the mic, nothing happens. I can't tell if it's still unable to read the mic, or if it's reading something, but is simply unable to transcribe the audio.

I'm not sure what to make of this. The errors I'm getting don't leave me with much to go on. Why can't it read a wav? Why can't it read /dev/dsp? Why does it then appear to be able to read /dev/dsp, but not react in any way?

Has anyone else had any success with open source speech recognizers, especially on Linux?
Cerin over 12 years

Where did you find an Ubuntu package for pocketsphinx 0.7? The 10.04 repo only has 0.5, which is very buggy and doesn't have an infile parameter. I could only find versions as high as 0.6 through Google.
Cerin over 12 years

Note, even when resampled to the correct rate, Julius then complains about bytes-per-second != 32000. Isn't sampling rate the same thing as bytes-per-second?
Nikolay Shmyrev over 12 years

> sampling rate the same thing as bytes-per-second? No file needs to be mono (single channel too).
Michael Levy over 12 years

bytes/sample * samples/second = bytes/second. So, if you are sampling 4 bytes/sample (16 bit stereo) at 8,000 samples/second you'll get 32,000 bytes/second
sebpiq almost 9 years

note that on Ubuntu 14+ the package is now called pocketsphinx-utils