Ultimate Guide to Speech Recognition

Speech recognition is the process of converting spoken words into data. The process is used in many sectors, including business or households. It is helpful for people with physical disabilities, such as loss of vision or loss of use of hands. Speech recognition is also used by the healthcare systems, military, and telephony.

The program can be largely classified into two types of users, namely, small vocabulary/large users and large vocabulary/limited users. The small vocabulary program is perfect for automated answering on the telephone. It can identify different accents and variations in speech patterns. Sensibly, it’s restricted to basic menu and generic responses. In the large vocabulary program, the system can identify more words with greater accuracy but it can identify fewer users. Let’s take a look at the process, history, and issues in the speech recognition process.

Speech to Data

Converting your speech to data which is understood by the computer involves a series of steps. It is a complex process. The first step is digital sampling, where the vibrations caused by your speech are measured precisely at regular intervals. The analog to digital converter translates it into data for the computer. The system filters out unwanted noises, categorizes the frequency levels, and normalizes the sound. Since people speak at different speeds, this should be aligned to the speed stored in the computer. Better quality can be achieved by higher sampling and precision. The signal is then split into segments as small as a hundredth or even thousandth of a second. This helps in identifying plosive consonant sounds like “p” or “t” and it can be matched to phonemes (smallest element in a language). Finally, the most important and difficult task of comparing the phonemes in context to other phonemes is accomplished. The contextual phoneme is put through complex statistical model and it is compared with an existing library of words, phrases, and sentences. With this, it deciphers what the user was trying to say, converting it into data.

Speech Recognition and Statistical Modeling

Earlier speech recognition systems were based on a set of grammatical rules and guidelines. These were not too effective as you could not control the way people speak, requiring users to pause between words as recognition of continuous speech was difficult. Later, systems with statistical methods using mathematics were developed to improve accuracy. The two widely prevalent methods now are the Hidden Markov model and neural networks. The more commonly used model is the Hidden Markov model. Here, each word is like a chain and each phoneme a link. The chain branches out in search of options to match the sound and phoneme. Each phoneme is assigned a probability score on the basis of the training and the in built dictionary. The process gets complicated for complex words. It helps if the user spends time in training the system to recognize the type and style of speech. Neural network is good in identifying patterns so it’s used to identify speech patterns. It can be classified into static and dynamic. Static classification is when all the input is seen in one go and a result is given. In dynamic classification, the input is broken into smaller parts and a result is assigned to it with all the parts integrated for the final solution. Static is useful to identify phonemes while dynamic is better for words and phrases.

Speech Recognition Issues

There are many factors that interfere with accurate speech recognition. Noise from the background is one such factor. If there is too much noise in the background, words cannot be heard or understood properly. This can be defined as low signal to noise ratio. Users should work in a silent place and speak as close to the microphone as possible. In some cases, the sound cards used may not be able to block signals from other computer parts, causing hissing noises. When there is overlapping speech as in a meeting or conference with many people, then voice recognition gets difficult. The processor needs to do a lot of work in order to comprehend complicated words, phrases or long winded sentences. Speech recognition vocabulary takes up a lot of hard disk space. Another difficulty is words that sound similar but spelt differently or words that sound similar but with different meanings. Extensive training has helped reduce that issue. Some other factors affecting speech recognition are the sex of the speaker, style of speaking, speed, dialects, and word boundary ambiguity (many ways of grouping words).

History of Speech Recognition

The last two decades have seen an amazing growth in development of speech recognition. However, the process of speech recognition began as early as the 1870’s when Alexander Graham Bell discovered how to convert sound or air pressure waves into electrical impulses. He wanted to create a device to change words into pictures that can be understood by a deaf person so that he could communicate with his wife. It eventually led to the invention of the telephone. This triggered a process of mathematical and scientific method of understanding speech. The earliest device to recognize speech was developed in 1952 by Bell laboratories, which identified single spoken digits. In 1964, IBM’s shoebox was displayed at the New York World Fair.

In the 1970’s, the technology was developed further by ARPA Speech Understanding Research. The most notable discovery was that speech should be understood, and not merely identified. In 1980, the discovery of two products made speech recognition accessible to the common man. The first one was helpful for processing transactions over the telephone as it identified small vocabulary. The other software which was offered by IBM, Dragon Systems and Kurzweil Applied Intelligence was for large vocabulary where text documents can be created by voice dictation.

Additional Information

Lessons from a bored Room

Lessons From a Bored Room

How to avoid meeting monotony, be a better speaker, and make your communication sing

This collection of short articles will help you engage with your participants and be a strong, successful speaker. Purchase through Amazon.


Big Talk Newsletter

FacebookTwitterlinkedinGoogle Plus
© Arkadin - 5 Concourse Parkway, Suite 1600, Atlanta, GA 30328 - 1.800.977.4607