Information about Speech Recognition
Speech recognition (in many contexts also known as automatic speech recognition, computer speech recognition or erroneously as voice recognition) is the process of converting a speech signal to a sequence of words in the form of digital data, by means of an algorithm implemented as a computer program.
Speech recognition applications that have emerged over the last few years include voice dialing (e.g., "Call home"), call routing (e.g., "I would like to make a collect call"), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g., a radiology report), domotic appliance control and content-based spoken audio search (e.g. find a podcast where particular words were spoken).
Voice recognition or speaker recognition is a related process that attempts to identify the person speaking, as opposed to what is being said.
Speech recognition can be implemented in front-end or back-end of the medical documentation process.
Front-End SR is where the provider dictates into a speech-recognition engine, the recognized words are displayed right after they are spoken, and the dictator is responsible for editing and signing off on the document. It never goes through an MT/editor.
Back-End SR or Deferred SR is where the provider dictates into a digital dictation system, and the voice is routed through a speech-recognition machine and the recognized draft document is routed along with the original voice file to the MT/editor, who edits the draft and finalizes the report. Deferred SR is being widely used in the industry currently.
Many EMR applications can be more effective and easy to use when deployed in conjunction with a speech-recognition engine. Searches, queries, and form filling are all faster to perform by voice than using a keyboard.
Most speech recognition users would tend to agree that dictation machines can achieve very high performance in controlled conditions. Part of the confusion mainly comes from the mixed usage of the terms "speech recognition" and "dictation".
Speaker-dependent dictation systems requiring a short period of training can capture continuous speech with a large vocabulary at normal pace with a very high accuracy. Most commercial companies claim that recognition software can achieve between 98% to 99% accuracy if operated under optimal conditions. These optimal conditions usually means the test subjects have:
Limited vocabulary systems, requiring no training, can recognize a small number of words (for instance, the ten digits) as spoken by most speakers. Such systems are popular for routing incoming phone calls to their destinations in large organizations.
Both acoustic modelling and language modelling are important studies in modern statistical speech recognition. In this entry, we will discuss the use of hidden Markov model (HMM) which is widely used in many systems. (Language modelling has many other applications such as smart keyboard and document classification; to the corresponding entries.)
The Carnegie Mellon University has made some good steps in increasing the speed of speech chips by using ASICs (application-specific integrated circuits) and reconfigurable chips called FPGAs (field programmable gate arrays). [1]
Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use. In speech recognition, to give the very simplest set up possible, the hidden Markov model would output a sequence of n-dimensional real-valued vectors with n around, say, 13, outputting one of these every 10 milliseconds. The vectors, again in the very simplest case, would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short-time window of speech and decorrelating the spectrum using a cosine transform, then taking the first (most significant) coefficients. The hidden Markov model will tend to have, in each state, a statistical distribution called a mixture of diagonal covariance Gaussians which will give a likelihood for each observed vector. Each word, or (for more general speech recognition systems), each phoneme, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes.
Described above are the core elements of the most common, HMM-based approach to speech recognition. Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above. A typical large-vocabulary system would need context dependency for the phones (so phones with different left and right context have different realizations as HMM states); it would use cepstral normalization to normalize for different speaker and recording conditions; for further speaker normalization it might use vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation. The features would have so-called delta and delta-delta coefficients to capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients and use splicing and an LDA-based projection followed perhaps by heteroscedastic linear discriminant analysis or a global semitied covariance transform (also known as maximum likelihood linear transform, or MLLT). Many systems use so-called discriminative training techniques which dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data. Examples are maximum mutual information (MMI), minimum classification error (MCE) and minimum phone error (MPE).
Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model which includes both the acoustic and language model information, or combining it statically beforehand (the finite state transducer, or FST, approach).
Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach. Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another they were walking more quickly, or even if there were accelerations and decelerations during the course of one observation. DTW has been applied to video, audio, and graphics -- indeed, any data which can be turned into a linear representation can be analyzed with DTW.
A well known application has been automatic speech recognition, to cope with different speaking speeds. In general, it is a method that allows a computer to find an optimal match between two given sequences (e.g. time series) with certain restrictions, i.e. the sequences are "warped" non-linearly to match each other. This sequence alignment method is often used in the context of hidden Markov models.
In terms of freely available resources, the HTK book (and the accompanying HTK toolkit) is one place to start to both learn about speech recognition and to start experimenting. Another such resource is Carnegie Mellon University's SPHINX toolkit.
Voice recognition or speaker recognition is a related process that attempts to identify the person speaking, as opposed to what is being said.
History
According to industry experts, at its inception, speech recognition (SR) was sold as a way to completely eliminate transcription rather than bring efficiencies to the transcription process, hence it was not accepted, also given the fact that SR at that time was not up to the task technically. Secondly it required physicians to change the way they work and document clinical encounters - many if not all were reluctant to do that. The biggest reason that speech recognition has not been able to reasonably automate transcription is due limitation of the software and the highly interpretive nature of narrative dictation that often requires human judgment that a speech recognition system just cannot provide. Another limitation is the extensive amount of resource time required of the provider to train the software.Current
Even in the wake of improving speech recognition technologies, medical transcriptionists (MTs) are not expected to become obsolete. Many experts in the health care field anticipate that with increased use of speech recognition technology, the services provided will be redistributed rather than replaced. Speech recognition will not make the skills of MTs obsolete.Speech recognition can be implemented in front-end or back-end of the medical documentation process.
Front-End SR is where the provider dictates into a speech-recognition engine, the recognized words are displayed right after they are spoken, and the dictator is responsible for editing and signing off on the document. It never goes through an MT/editor.
Back-End SR or Deferred SR is where the provider dictates into a digital dictation system, and the voice is routed through a speech-recognition machine and the recognized draft document is routed along with the original voice file to the MT/editor, who edits the draft and finalizes the report. Deferred SR is being widely used in the industry currently.
Many EMR applications can be more effective and easy to use when deployed in conjunction with a speech-recognition engine. Searches, queries, and form filling are all faster to perform by voice than using a keyboard.
Performance of speech recognition systems
The performance of a speech recognition systems is usually specified in terms of accuracy and speed. Accuracy is measured with the word error rate, whereas speed is measured with the real time factor.Most speech recognition users would tend to agree that dictation machines can achieve very high performance in controlled conditions. Part of the confusion mainly comes from the mixed usage of the terms "speech recognition" and "dictation".
Speaker-dependent dictation systems requiring a short period of training can capture continuous speech with a large vocabulary at normal pace with a very high accuracy. Most commercial companies claim that recognition software can achieve between 98% to 99% accuracy if operated under optimal conditions. These optimal conditions usually means the test subjects have:
- matching speaker characteristics with the training data,
- proper speaker adaptation, and
- clean environment (e.g. office space).
Limited vocabulary systems, requiring no training, can recognize a small number of words (for instance, the ten digits) as spoken by most speakers. Such systems are popular for routing incoming phone calls to their destinations in large organizations.
Both acoustic modelling and language modelling are important studies in modern statistical speech recognition. In this entry, we will discuss the use of hidden Markov model (HMM) which is widely used in many systems. (Language modelling has many other applications such as smart keyboard and document classification; to the corresponding entries.)
The Carnegie Mellon University has made some good steps in increasing the speed of speech chips by using ASICs (application-specific integrated circuits) and reconfigurable chips called FPGAs (field programmable gate arrays). [1]
Hidden Markov model (HMM)-based speech recognition
Modern general-purpose speech recognition systems are generally based on HMMs. These are statistical models which output a sequence of symbols or quantities. One possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as a piece-wise stationary signal or a short-time stationary signal. That is, one could assume in a short-time in the range of 10 milliseconds, speech could be approximated as a stationary process. Speech could thus be thought as a Markov model for many stochastic processes (known as states).Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use. In speech recognition, to give the very simplest set up possible, the hidden Markov model would output a sequence of n-dimensional real-valued vectors with n around, say, 13, outputting one of these every 10 milliseconds. The vectors, again in the very simplest case, would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short-time window of speech and decorrelating the spectrum using a cosine transform, then taking the first (most significant) coefficients. The hidden Markov model will tend to have, in each state, a statistical distribution called a mixture of diagonal covariance Gaussians which will give a likelihood for each observed vector. Each word, or (for more general speech recognition systems), each phoneme, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes.
Described above are the core elements of the most common, HMM-based approach to speech recognition. Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above. A typical large-vocabulary system would need context dependency for the phones (so phones with different left and right context have different realizations as HMM states); it would use cepstral normalization to normalize for different speaker and recording conditions; for further speaker normalization it might use vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation. The features would have so-called delta and delta-delta coefficients to capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients and use splicing and an LDA-based projection followed perhaps by heteroscedastic linear discriminant analysis or a global semitied covariance transform (also known as maximum likelihood linear transform, or MLLT). Many systems use so-called discriminative training techniques which dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data. Examples are maximum mutual information (MMI), minimum classification error (MCE) and minimum phone error (MPE).
Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model which includes both the acoustic and language model information, or combining it statically beforehand (the finite state transducer, or FST, approach).
Dynamic time warping (DTW)-based speech recognition
Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach. Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another they were walking more quickly, or even if there were accelerations and decelerations during the course of one observation. DTW has been applied to video, audio, and graphics -- indeed, any data which can be turned into a linear representation can be analyzed with DTW.
A well known application has been automatic speech recognition, to cope with different speaking speeds. In general, it is a method that allows a computer to find an optimal match between two given sequences (e.g. time series) with certain restrictions, i.e. the sequences are "warped" non-linearly to match each other. This sequence alignment method is often used in the context of hidden Markov models.
For further information
Popular speech recognition conferences held each year or two include ICASSP, Eurospeech/ICSLP (now named Interspeech) and the IEEE ASRU. Conferences in the field of Natural Language Processing, such as ACL, NAACL, EMNLP, and HLT, are beginning to include papers on speech processing. Important journals include the IEEE Transactions on Speech and Audio Processing (now named IEEE Transactions on Audio, Speech and Language Processing), Computer Speech and Language, and Speech Communication. Books like "Fundamentals of Speech Recognition" by Lawrence Rabiner can be useful to acquire basic knowledge but may not be fully up to date (1993). Another good source can be "Statistical Methods for Speech Recognition" by Frederick Jelinek which is a more up to date book (1998). Even more up to date is "Computer Speech", by Manfred R. Schroeder, second edition published in 2004. A good insight into the techniques used in the best modern systems can be gained by paying attention to government sponsored competitions such as those organised by DARPA (the largest speech recognition-related project ongoing as of 2007 is the GALE project, which involves both speech recognition and translation components).In terms of freely available resources, the HTK book (and the accompanying HTK toolkit) is one place to start to both learn about speech recognition and to start experimenting. Another such resource is Carnegie Mellon University's SPHINX toolkit.
Applications of speech recognition
- Automatic translation
- Automotive speech recognition
- Speech Biometric Recognition
- Dictation
- Hands-free computing: voice command recognition computer user interface
- Home automation
- Interactive voice response
- Medical transcription
- Mobile telephony
- Pronunciation
- Robotics
See also
- Audio visual speech recognition
- Cockpit (aviation) (also termed Direct Voice Input)
- Keyword spotting
- List of speech recognition projects
- Microphone
- Speech Analytics
- Speaker identification
- Speech processing
- Speech synthesis
- Speech verification
- Text-to-speech (TTS)
- VoiceXML
- Windows Speech Recognition
- Acoustic Model
- Speech corpus
References
1. ^ Dennis van der Heijden. "Computer Chips to Enhance Speech Recognition", Axistive.com, 2003-10-06.
External links
Home automation (also called domotics) is a field within building automation, specializing in the specific automation requirements of private homes and in the application of automation techniques for the comfort and security of its residents.
..... Click the link for more information.
..... Click the link for more information.
Speaker recognition, or voice recognition is the task of recognizing people from their voices. Such systems extract features from speech, model them and use them to recognize the person from his/her voice.
..... Click the link for more information.
..... Click the link for more information.
Health care, or healthcare, is the prevention, treatment, and management of illness and the preservation of mental and physical well being through the services offered by the medical, nursing, and allied health professions.
..... Click the link for more information.
..... Click the link for more information.
Word error rate (WER) is a common metric of measuring the performance of a speech recognition system.
The general difficulty of measuring the performance lies on the fact that the recognized word sequence can have different length from the reference word sequence (supposedly
..... Click the link for more information.
The general difficulty of measuring the performance lies on the fact that the recognized word sequence can have different length from the reference word sequence (supposedly
..... Click the link for more information.
The real time factor (RTF) is a common metric of measuring the speed of an automatic speech recognition system. It can also be used in other context where an audio or video signal is processed (usually automatically) at nearly constant rate (e.g. reading music from a CD).
..... Click the link for more information.
..... Click the link for more information.
Physical modelling synthesis is the synthesis of sound by using a set of equations and algorithms to simulate a physical source of sound. Sound is then generated using parameters that describe the physical materials used in the instrument and the user's interaction with it, for
..... Click the link for more information.
..... Click the link for more information.
A statistical language model assigns a probability to a sequence of words P(w1..n) by means of a probability distribution.
Language modeling is used in many natural language processing applications such as speech recognition, machine translation,
..... Click the link for more information.
Language modeling is used in many natural language processing applications such as speech recognition, machine translation,
..... Click the link for more information.
A statistical language model assigns a probability to a sequence of words P(w1..n) by means of a probability distribution.
Language modeling is used in many natural language processing applications such as speech recognition, machine translation,
..... Click the link for more information.
Language modeling is used in many natural language processing applications such as speech recognition, machine translation,
..... Click the link for more information.
Document classification/categorization is a problem in information science. The task is to assign an electronic document to one or more categories, based on its contents. Document classification tasks can be divided into two sorts: supervised document classification
..... Click the link for more information.
..... Click the link for more information.
field-programmable gate array is a semiconductor device containing programmable logic components called "logic blocks", and programmable interconnects. Logic blocks can be programmed to perform the function of basic logic gates such as AND, and XOR, or more complex combinational
..... Click the link for more information.
..... Click the link for more information.
hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters.
..... Click the link for more information.
..... Click the link for more information.
In the mathematical sciences, a stationary process (or strict(ly) stationary process) is a stochastic process whose probability distribution at a fixed time or position is the same for all times or positions.
..... Click the link for more information.
..... Click the link for more information.
This article or section may be confusing or unclear for some readers.
Please [improve the article] or discuss this issue on the talk page. This article has been tagged since September 2007.
..... Click the link for more information.
Please [improve the article] or discuss this issue on the talk page. This article has been tagged since September 2007.
..... Click the link for more information.
A cepstrum (pronounced /ˈkɛpstrəm/) is the result of taking the Fourier transform (FT) of the decibel spectrum as if it were a signal. Its name was derived by reversing the first four letters of "spectrum".
..... Click the link for more information.
..... Click the link for more information.
Fourier transform, named in honor of French mathematician Joseph Fourier, is a certain linear operator that maps functions to other functions. Loosely speaking, the Fourier transform decomposes a function into a continuous spectrum of its frequency components
..... Click the link for more information.
..... Click the link for more information.
phoneme is the smallest unit of speech that distinguishes meaning. Phonemes are not the physical segments themselves, but abstractions of them. An example of a phoneme would be the /t/ found in words like tip,
..... Click the link for more information.
..... Click the link for more information.
The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states – called the Viterbi path – that results in a sequence of observed events, especially in the context of hidden Markov models.
..... Click the link for more information.
..... Click the link for more information.
Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another he or she were walking more
..... Click the link for more information.
..... Click the link for more information.
Natural language processing (NLP) is a subfield of artificial intelligence and computational linguistics. It studies the problems of automated generation and understanding of natural human languages.
..... Click the link for more information.
..... Click the link for more information.
Institute of Electrical and Electronics Engineers
Type Professional Organization
Founded January 1, 1963
Origins Merger of the American Institute of Electrical Engineers and the Institute of Radio Engineers
Key people Leah H.
..... Click the link for more information.
Type Professional Organization
Founded January 1, 1963
Origins Merger of the American Institute of Electrical Engineers and the Institute of Radio Engineers
Key people Leah H.
..... Click the link for more information.
Institute of Electrical and Electronics Engineers
Type Professional Organization
Founded January 1, 1963
Origins Merger of the American Institute of Electrical Engineers and the Institute of Radio Engineers
Key people Leah H.
..... Click the link for more information.
Type Professional Organization
Founded January 1, 1963
Origins Merger of the American Institute of Electrical Engineers and the Institute of Radio Engineers
Key people Leah H.
..... Click the link for more information.
Lawrence R. Rabiner (born 28 September 1943 in Brooklyn, New York) is an electrical engineer working in the fields of digital signal processing and speech processing; in particular in digital signal processing for automatic speech recognition.
..... Click the link for more information.
..... Click the link for more information.
Defense Advanced Research Projects Agency
Agency overview
Formed 1958
Employees 240
Annual Budget $3.2 billion
Agency Executive Anthony J. Tether, Director
Website
www.darpa.
..... Click the link for more information.
Agency overview
Formed 1958
Employees 240
Annual Budget $3.2 billion
Agency Executive Anthony J. Tether, Director
Website
www.darpa.
..... Click the link for more information.
HTK may refer to:
..... Click the link for more information.
- HTK (software), a hidden Markov model software tool
- Histidine-tryptophan-ketoglutarate, a preservative for donor organs
..... Click the link for more information.
Carnegie Mellon University is a private research university in Pittsburgh, Pennsylvania, United States. It began as the Carnegie Technical Schools, founded by Andrew Carnegie in 1900. In 1912, the school became Carnegie Institute of Technology and began granting four-year degrees.
..... Click the link for more information.
..... Click the link for more information.
Dictation can refer to:
..... Click the link for more information.
- Dictation (exercise), when one person speaks while another person transcribes what is spoken.
- A dictation machine, a device used to record this speech for transcription.
..... Click the link for more information.
Hands-free computing is a term used to describe configuration of computers so that they can be used by persons without the use of the hands interfacing with commonly used human interface devices such as the mouse and keyboard.
..... Click the link for more information.
..... Click the link for more information.
The user interface (or Human Machine Interface) is the aggregate of means by which people (the users) interact with a particular machine, device, computer program or other complex tool (the system).
..... Click the link for more information.
..... Click the link for more information.
Home automation (also called domotics) is a field within building automation, specializing in the specific automation requirements of private homes and in the application of automation techniques for the comfort and security of its residents.
..... Click the link for more information.
..... Click the link for more information.
interactive voice response, or IVR, is a phone technology that allows a computer to detect voice and touch tones using a normal phone call. The IVR system can respond with pre-recorded or dynamically generated audio to further direct callers on how to proceed.
..... Click the link for more information.
..... Click the link for more information.
This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus