II am totally and completely frustrated that after all these years, and all the technology out there, the brilliant Mac software developers have still not come up with a speech-to-text system that will automatically transcribe third party recorded interviews.
My job involves a huge amount of transcription of digitally recorded interviews with people and this task of transcribing interviews is enormously boring, time consuming and mind-numbing.
Because of timelines and the nature of the work, I can't email the work off to some person in India doing it for three dollars an hour or whatever ... I need a machine that can do it and it is a perfect job for a machine ... I want software to take it so I can concentrate on the more creative stuff.
The transcription systems that are out there, like iListen and ViaVoice are totally useless to me because they only work on the principle of first "training" the software on your voice, and your voice inflections. And obviously, I don't need software to recognize my voice ... i need it to recongize other people's voices and I need software that can do this without needing to be "trained."
And I recognize that such software will make mistakes ... but I don't need a *perfect* system because I have to edit the words after they are transcribed anyway. What I need is a system that will reduce the laborious task of transcribing because it can more or less recongize other people's voices, make sense of it and give me a transcription that is pretty close to what was said.
I know that what I am asking for is a very complicated and a very sophisticated task for software because every voice is different and we all tend to run our sentences together ... and it's complicated because some people say Tomaaaato and some people say tomito and and some people talk a mile a minute and others speak veeeeeeeery sloooooowly. Not to mention dealing with all the different foreign accents in just the English language.
But I have been thinking about the various ways around these problems and I have a few ideas
(I don't write software so I can't do it, but if there is a developer out there who can do it, I'd be very appreciative!)
What I am thinking is... why not break this problem down into several steps?
1) Develop the software that will take a digital recording someone's natural speech and basically "flatten" it out so that it sounds just like one of the computer generated voices ... like "Bruce" or whatever ...
There are various existing sound applications that can change the tone, pitch, speed, etc. of sound... so I am thinking that this first step shouldn't be too, too, far beyond today's technological capability.
2) Once you have done that, then all you have to do is pre-train the speech-to-text software to recognize the above "voice," ... so you provide a pre-trained system with an in-built decent vocabulary/dictionary plus the ability for the user to add new words to that vocabulary/dictionary.
3) I run my interviews through this software ... the software autmatically does step one, and then step two and gives me the result.
Yes, it might be slow and take a bit of time, but it should certainly save me hours of transcription time.
Any thoughts on this?? Can it be done? Has anybody tried anything remotely approaching this?