In my roles at Kurzweil, Lernout & Hauspie, Nuance, and now Voicebrook, I have been demonstrating speech recognition technology for more than 20 years starting with discrete speech technology, when “you”, “had”, “to”, “speak”, “like”, “this”. In my current role as Manager of Sales Engineering, I am in front of potential customers performing demos and fielding questions. Even though the technology has made tremendous strides and is becoming ubiquitous in modern life, the misconceptions and questions that I still hear have not changed much. Here is a list of the 11 most common questions and concerns that I have heard from prospective clients.
1. How does Speech Recognition learn my voice?
Speech recognition technology uses pattern recognition to understand spoken words. It combines a language model that is mapped to the types of words and phrases that a person uses when dictating a specific function, like a Pathology report, and listens for phonemes (English language sounds) to determine what a speaker is most likely saying.
In the early years, the software needed to learn a person’s individual voice, by the use of “enrollment”. This is the process whereby it learned the way a speaker pronounced the 44 phonemes in the English language. Initially the technology required an enrollment of 45 minutes or more to do this, but the newest products do not require enrollment at all!
2. How accurate is it?
Out of the box most users have higher than 95% accuracy. With the proper implementation of correction techniques initial dictation accuracy quickly end up in the 98% - 99% range.
3. What about accents?
The software comes with various American English and non-American English accent models. If a user speaks with one of those accents the software considers them a “native” speaker and the accuracy numbers above apply. If they have an accent that is not native to the software, they can still enroll their voice, and as long as the user appropriately corrects errors occurring while dictating, they will reach the same level of accuracy as native speakers.
4. What about background noise?
We live and work in a noisy world. The voice recognition engine that is in VoiceOver employs noise canceling technology to create a clean sound for the software to compute. We also provide microphones that use noise canceling technology to filter out extraneous noise. By employing “Best Practices” we insure that our clients have the highest level of performance for their environment.
For example, the best microphone to use in a noisy pathology grossing area is a close talking microphone that the user wears, as opposed to a boom microphone that sits on the desk. It can filter out the ventilation fan, other persons and equipment, and even the music that is often played in the lab!
5. How does it work with complicated words?
People have asked me how the software can understand complicated Pathology words. In reality the more complex the word, the easier it is for the speech recognition engine to understand it. There are various language models and vocabularies based on the specific function you are performing. There are general models, legal models, and a medical model that includes well over 80 sub-specialties. A language model consists of an active vocabulary (160,000 words) combined with a predictive measure of how those words are used together. This is based upon frequency analysis of millions of targeted documents and an analysis of these documents to determine what words appear in the context of other words.
For example, in the pathology language model, if I were to say “moderately differentiated adenocarcinoma”, it would never transcribe it as “Moderately differentiated ad campaign”. This is because the words “ad” and “campaign” would either not appear in the active vocabulary, or if they did they would be low probability words. In addition, after the words “moderately differentiated” the phonetic component and analytical component of the engine would most likely determine based on what it heard that it should be “adenocarcinoma”.
6. How fast can I talk?
You can dictate as fast as 160 words per minute (wpm)
7. I tried it and it didn’t work for me. Why will it work now?
I have been hearing this comment for many years. Speech Recognition has been around for over 20 years. Like many emerging technologies it has had its share of growing pains and conversely improvements.
Invariably when I ask the person how long ago did they try it and answer is always “years ago.” I tell them to compare the video games of today with the early games such as “Pong” and the analogy helps to overcome their preconceived aversion.
8. Has there ever been someone you could not train?
When prospects ask this question, I have found that in most cases they are expecting the answer to be '"yes", and the reason being a user’s strong accent. The truth is, yes, there have been a few users that we have not been able to train, but accents were never the problem. In every case, it had to do with “intent”. The doctors did not want to use software but were being forced to use it by their Chief or by the practice Administration. They did not want to put in the initial time and effort it takes for success, and did not adhere to the simple steps to be successful. While I am not saying that the software is difficult to use, like most things in life, success comes with practice and patience.
9. Do I need a spell checker?
No. One of the secrets of speech recognition technology is that it never makes a spelling error. It may put the wrong word or words in, but it will never misspell them. In all of the years that I have worked with the technology, I have only come across a few typos in my dictation and when I researched the errors, I realized that I added the misspelled words myself!
10. Does it learn?
Absolutely! If a user corrects their dictation errors with the simple correction methodologies that our implementation specialists teach, the software does learn and becomes more accurate.
11. How does it deal with “ums” and “ahs”?
Speech Recognition technology has a “Disfluency Filter.” Speech disfluency is defined as “any interruption in the normal flow of speech.” “Ums” and “ahs” are perfect examples of this. The software has the ability to filter most of those out. If it actually adds a small word when it hears the “um” or “ah” then the user can use correction techniques to teach the software to ignore the disfluency.
Voice recognition is rapidly becoming commonplace in both our personal lives and in the Pathology laboratory. Many people have heard about it and have used it on their phone or when calling into customer services, etc. but few professionals have seen it demonstrated or used it in their professional environment. If you are interested in scheduling a demo please click the button below to get started.