Improving Non-autoregressive ASR with Autoregressive Pretraining

Improving Non-autoregressive ASR with Autoregressive Pretraining
April 25, 2023
Research

Autoregressive (AR) automatic speech recognition (ASR) models predict each output token conditioning on the previous ones, which slows down their inference speed. On the other hand, nonautoregressive (NAR) models predict tokens independently and simultaneously within a constant number of decoding iterations, which brings high inference speed. However, NAR models generally have lower accuracy than AR models. In this work, we propose AR pretraining to the NAR encoder to reduce the accuracy gap between AR and NAR models. The experiment results show that our AR-pretrained MaskCTC reaches the same accuracy as AR Conformer on Aishell-1 (both 4.9% CER) and reduce the performance gap with AR Conformer on LibriSpeech by relatively 50%. Moreover, our AR-pretrained MaskCTC only needs single decoding iteration, which reduces inference time by 50%. We also investigate multiple masking strategies in training the masked language model of MaskCTC.