Speech to Text software


I've got a lot of MP3 and WAV files of talks and sermons that I am going to transcribe into text. These talks go many years back and hence the transcripts have been lost. There are hours and hours of talks and I'm looking for software that will help make my job easier.

OS X comes with some very good text to speech capabilities. Is there any capability to do speech to text on OS X? It would be great if there were free/cheap shareware products that do this task. Any recommendations?


iListen from MacSpeech is the only speech to text Mac app as I understand it. Anyway, it's clearly the most developed at least.

I forget the exact price, $100-$150 something like that.

The real investment is the time investment though, setting up and learning a speech to text program is not a small matter. And it requires a level of patience not typical for most apps.

There is a mail list where you can ask questions of users.

Google for the URL, can't recall it off top of my head.


Be careful of ViaVoice, I think it may no longer be under development. Speech recog is a big time investment, you don't want to get sucked in to a product that has been abandoned.

If you only have one project, like a pile of tapes you want converted to text, hire a transcriptionist. There is no cheap easy way to quickly convert sound to text in a reliable manner, not yet.

This is my understanding, feel free to add your own, I'm not an expert.


Staff member
I also wanted to add that you might find some people who would be glad to spend an hour or two transcribing for you at a reasonable price... On the other hand, if the software actually works... I guess it could get difficult, though. If the software misunderstands something, and you have to go over _all_ them again, you might as well transcribe them yourself, as that won't take much longer...


If you have an ongoing need for speech recog that merits spending many hours getting set up, then iListen is certainly worth a look.

The software works, imperfectly, after a lot of input and training on your part.

I was just trying to warn folks that speech recog is not at the same place as normal software. It's still a developing field and things never work perfectly, no matter how much you work at it. It takes a different mindset than normal software, and probably isn't practical for quick one time tasks.


II am totally and completely frustrated that after all these years, and all the technology out there, the brilliant Mac software developers have still not come up with a speech-to-text system that will automatically transcribe third party recorded interviews.

My job involves a huge amount of transcription of digitally recorded interviews with people and this task of transcribing interviews is enormously boring, time consuming and mind-numbing.

Because of timelines and the nature of the work, I can't email the work off to some person in India doing it for three dollars an hour or whatever ... I need a machine that can do it and it is a perfect job for a machine ... I want software to take it so I can concentrate on the more creative stuff.

The transcription systems that are out there, like iListen and ViaVoice are totally useless to me because they only work on the principle of first "training" the software on your voice, and your voice inflections. And obviously, I don't need software to recognize my voice ... i need it to recongize other people's voices and I need software that can do this without needing to be "trained."

And I recognize that such software will make mistakes ... but I don't need a *perfect* system because I have to edit the words after they are transcribed anyway. What I need is a system that will reduce the laborious task of transcribing because it can more or less recongize other people's voices, make sense of it and give me a transcription that is pretty close to what was said.

I know that what I am asking for is a very complicated and a very sophisticated task for software because every voice is different and we all tend to run our sentences together ... and it's complicated because some people say Tomaaaato and some people say tomito and and some people talk a mile a minute and others speak veeeeeeeery sloooooowly. Not to mention dealing with all the different foreign accents in just the English language.

But I have been thinking about the various ways around these problems and I have a few ideas

(I don't write software so I can't do it, but if there is a developer out there who can do it, I'd be very appreciative!)

What I am thinking is... why not break this problem down into several steps?

1) Develop the software that will take a digital recording someone's natural speech and basically "flatten" it out so that it sounds just like one of the computer generated voices ... like "Bruce" or whatever ...

There are various existing sound applications that can change the tone, pitch, speed, etc. of sound... so I am thinking that this first step shouldn't be too, too, far beyond today's technological capability.

2) Once you have done that, then all you have to do is pre-train the speech-to-text software to recognize the above "voice," ... so you provide a pre-trained system with an in-built decent vocabulary/dictionary plus the ability for the user to add new words to that vocabulary/dictionary.

3) I run my interviews through this software ... the software autmatically does step one, and then step two and gives me the result.

Yes, it might be slow and take a bit of time, but it should certainly save me hours of transcription time.

Any thoughts on this?? Can it be done? Has anybody tried anything remotely approaching this?


Since you are "thinking big", you might want to consider an approach that should achieve that desired independence from personal habits and inflections etc. -- phoneme encoding. The basic idea is that there are less than 300 phonemes used in human speech -- probably they would all fit into one byte, plus a couple more for pitch, loudness, attack, decay, length etc. (see midi). The speech-to-phoneme encoding software would thus be spared the task of attaching any meaning to the phonemes, it would simply be an efficient and idiosyncracy-free way of efficiently storing what someone said. Then the vastly more difficult task of attaching meaning (required before the correct spelling can possibly be deduced) would be relieved from the burden of adapting to individual voices. Thinking about it in these terms might also help you realize just how difficult the task you describe really is! For more on phoneme encoding see http://jick.net/~jess/fi/tech/phoneme.html -- or look it up using the Free Ideas database at http://jick.net/fi/ -- and let me know if you get anywhere with this! - Cheers


Thanks, Jick ... that looks interesting! But it does sound very complicated!

My idea for a solution was based on the fact that there is "voice changing" software out there. There is, for example, Voice Candy which works on a Mac, and Voice Changer for the PC. These are systems that allow people to disguise their voices or turn a female sounding voice to a male sounding voice.

So my thought was that with some work, it should be possible to take the voices on a digital recording ... the interviews on my digital recorder ... and put them through some such "voice changing" software that would then "change" the voices on the recording to one tone or one speech.

Then, in the next step, you could take that, and put it through some software like Via Voice that has been trained to that sound, that mechanized voice, and get your automatic speech-to-text transcription of an interview with various people.

But yes, I think that somehow, the phoneme encoding that you describe would have to be incorporated ... because maybe it would be impossible for voice changing software to really change the voices of many speakers to one monotone?

I am not sure ... I am not a software developer. But I wish someone with the software smarts would try to do it because there are lots of people (not to mention courtrooms that do court reporting and so forth) who would pay to get such software!