Review: by far the best voice-recognition system uses a two-way short-term memory network (LSTM,LongShort Term Memory), however, this system of training, decoding complexity Shi Yangao problem, especially in the industrial sector in real-time recognition system it is difficult to apply. HKUST news fly in this year puts forward a new sequence of speech recognition framework--depth Convolutional neural networks (DFCNN,Deep Fully Convolutional NeuralNetwork), better suited to industrial applications. This article is on the HKUST to fly using DFCNN detailed interpretation applied to phonetic transliteration technology, also contains a phonetic transliteration of the colloquial and discourse-level language-model processing, noise and far-field recognition and text-processing real-time error correction and text processing technology analysis.
Application of artificial intelligence, voice recognition to make significant progress this year, both in English, Chinese or other languages, speech recognition accuracy rates on the rise of the machines. Speech dictation most rapid technology development, has been widely in voice input and voice search, voice Assistant applied product and maturity. However, another dimension of voice applications that phonetic transliteration, at present there are still some difficulties, because users in the process of recording file and the recording is not expected to be used for speech recognition, compared with voice dictation, phonetic transliteration will face speaking style, with a strong accent, the quality of the recording and many other challenges.
Phonetic transliteration of scenarios including, talking to reporters, TV shows, classes and conferences, including anyone in the everyday working life of any audio file. Phonetic transliteration of markets and imagine space is a huge, imagine if mankind can conquer the voice transcription, TV can automatically vivid titles, formal meetings can be formed automatically applies, journalists interview recording can be automated draft... ... In a person's life says to Zi much more than we wrote, we said if there is a software that can record all the words and efficient management, this world would be so incredible.
Acoustic modeling technology based on DFCNN
Acoustic modeling for speech recognition is mainly used for modeling speech signal and the relationship between phoneme, flew on December 21 last year at HKUST proposed feed-forward sequence memory (FSMN, Feed-forward Sequential Memory Network) as an acoustic modeling framework, again this year, launched a new speech recognition framework, namely the depth sequence Convolutional neural networks (DFCNN,Deep Fully Convolutional NeuralNetwork)。
By far the best voice-recognition system uses a two-way short-term memory network (LSTM,LongShort Term Memory), this network can model a long-term relationship between voice, so as to improve recognition accuracy. Bidirectional LSTM networks training, decoding complexity Shi Yangao problems exist, especially in the industrial sector in real-time recognition system it is difficult to apply. HKUST flew deep sequence Convolutional neural networks to overcome the shortcomings of bidirectional LSTM.
CNN is used for speech recognition system as early as 2012, but no major breakthroughs. Main reason is that it uses fixed-length frames mosaic as input, speech context information isn't long enough to see another defect to CNN as a feature Extractor, convolution layers are rarely used, and limited skills. Victorias Secret case for iPad Mini Victorias Secret iPad
To solve these problems, DFCNN uses a lot of convolution direct modeling the entire speech. First, input DFCNN spectrogram directly as input, compared to other traditional voice features speech recognition as input has a natural advantage over the framework. Secondly, in the model structure, using image recognition network configuration, each using small convolution kernel convolution, and ponds and layer after layer of multiple convolution by accumulating a lot of convolution pool layer, so that you can see a very long history and information in the future. These two points ensures excellent long-term relationship between voice and expression of DFCNN, RNN networks than in more great robustness, as well as online decoding of short delay, which can be used in industrial systems.
(DFCNN map)
Spoken language and discourse-level language-model processing technologies
Speech recognition language models are mainly used for modeling of phoneme and corresponding relations between words. Because of human spoken language for natural language without organized, people in free conversation, often appeared to hesitate, read-back, modal and other complex languages, in text corpora are usually written in the form, the gap between these two language modeling for spoken language faced extreme challenges.
HKUST news fly learn from speech recognition to work with noise problems using training ideas, on the basis of written language automatically bring back reading, inversion, modal, colloquially "noise" phenomenon, which can automatically generate mass speech, resolve the mismatch between spoken language and written language. First, collect the part on oral and written text corpus and, secondly, using written language modeling based on Encoder-Decoder neural network framework and the correspondence between the spoken text, enabling automatic generation of spoken text.
In addition, the context information can greatly help people's understanding of language, same is true for machine transcription. Thus, flew on December 21 last year at HKUST programmes chapter-level language model is proposed, which according to the decoding of speech recognition results automatically for critical information extraction, data search and processing in real time, decode results and search form a special voice of corpus of language models to further improve the accuracy of speech transcription.
(Chapter-level language model flow chart)
And far-field noise recognition technology
Speech recognition using far-field pickup and noise are the two major technical problems. For example in Conference of scene Xia, if using recording pen for recording, away from recording pen more far talk people of voice that for far field with reverb voice, due to reverb will makes not synchronization of voice mutual overlay, brings has phonemes of make stack masking effect, to serious effect voice recognition effect; also, if recording environment in the exists background noise, voice spectrum will was pollution, its recognition effect also will sharply declined. HKUST news flew to solve this problem using the Dan Maike and the microphone array under two kinds of hardware noise reduction, reverb solution technology so far, voice transcription in noisy situations to achieve a practical threshold.
Dan Maike solutions of noise reduction, reverb
Loss of collected voice, mixed training and based on recurrent neural network method of reverberation noise reduction solutions combine. Namely, on the one hand clean noise added and mixed training with clean, thus improving the robustness of models for noisy speech (Editor's Note: transliteration of the Robust, that is robust and strong meaning);, use the depth based on recurrent neural network for noise reduction and reverberation, further increasing the noise, far-field voice recognition accuracy.
Microphone array solutions of noise reduction, reverb
Only noise in speech processing can be said to be a stopgap, how to solve the reverberation and noise from the source seems to be the crux of the matter. Faced with this challenge, and University researchers by recording equipment with multi-mic array on using microphone array noise and reverberation is performed. Specifically, when using multiple microphones capture multi-channel signals using Convolutional neural network beamformer, thus forming a pickup Steering in the direction of the target signal, and attenuation of reflected sound from the other. The above combination of Dan Maike reverberation noise reconciliation can be significantly improved with further noise, far-field voice recognition accuracy.
Text processing and real-time correction + text processing
Mentioned are only for voice processing technology, audio transcribed into text, but as noted above human spoken natural language with no organizational, even in the case of speech transcription accuracy is very high, transcribed from speech to text to read is still large problems, demonstrated the importance of the text. So-called text after text-spoken on clause, paragraph, and to deal with the fluency of the text content and even summaries of the content, in order to facilitate better reading and editing.
Post processing I: clauses with partition
Clauses, that is, clause divides the transcribed text notes, and add punctuation between clause; subparagraph a text into several semantic paragraph, every paragraph describes the different child theme.
By extracting contextual semantic features, combined with speech, to clauses and paragraphs divided; taking into account the annotated speech data is difficult to obtain, in practical use of ustc iflytek uses two levels of cascaded two-way short-term memory network model technology, so as to better address the clauses with partition problems.
Post processing II: smooth
Smooth, also known as flow testing, which eliminate the transfer results in pauses and tones of words, repeated words, make smooth text easier to read.
HKUST flies by using the generalization characteristics combined with the bi-directional short-term memory network modeling, making smooth accurate rate reached a practical stage.
Source: University of flying public
No comments:
Post a Comment