[Seminar] STREAMING ATTENTION BASED ON-DEVICE SPEECH RECOGNITION TRAINED WITH A LARGE SPEECH CORPUS
In this talk, we present our end-to-end automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with a large (> 10k hours) training corpus. The first part of this talk describes the basic theories of a sequence-to-sequence model based on an RNN encoder-decoder, its extension with the attention mechanism, and the recent streaming attention mechanisms such as hard-monotonic attention and Monotonic Chunk-wise Attention (MoChA). In the second part of this talk, we explain how we implemented our end-to-end ASR system for our commercial products. We attained around 90% speech recognition accuracy in general domains by joint training with the Connectionist Temporal Classification (CTC) and the Cross Entropy (CE) losses, the Minimum Word Error Rate (MWER) training, layer-wise pre-training, and various data augmentation techniques. We compressed our models by more than 3.4 times using an iterative hyper Low-Rank Approximation (LRA) method while minimizing degradation in recognition accuracy. The memory footprint was further reduced with 8-bit quantization. The final model size is less than 39 MB. For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% reduction in word error rate (WER) for target domains. We also present a novel Small Energy Masking (SEM) technique for further performance improvement.
Chanwoo Kim has been a corporate vice president at Samsung research leading the speech processing Lab since Feb. 2018. He has led the end-to-end speech recognition and end-to-end text-to-speech (TTS) projects. The outcomes from both projects were successfully commercialized in 2019. He has been working for far-field speech enhancement, key-word spotting, audio-visual multi-modal speech recognition projects, and so on. He was a software engineer at Google speech team between Feb. 2013 and Feb. 2018. He worked for acoustic modeling for speech recognition systems and enhancing noise robustness using deep learning techniques. He was a speech scientist at Microsoft from Jan. 2011 to Jan. 2013. Dr. Kim received his Ph. D. from the Language Technologies Institute of School of Computer Science Carnegie Mellon University in Dec. 2010. He received his B.S and M.S. degrees in Electrical Engineering from Seoul National University in 1998 and 2001, respectively. Dr. Kim’s doctoral research was focused on enhancing the robustness of automatic speech recognition systems in noisy environments. Between 2003 and 2005 Dr. Kim was a Senior Research Engineer at LG Electronics, where he worked primarily on embedded signal processing and protocol stacks for multimedia systems. Prior to his employment at LG, he worked for EdumediaTek and SK Teletech as a R&D engineer.