Untied Positional Encodings For Efficient Transformer-based Speech Recognition

Untied Positional Encodings For Efficient Transformer-based Speech Recognition
April 25, 2023
Research

Self-attention has become a vital component for end-to-end (E2E) automatic speech recognition (ASR). Convolution- augmented Transformer (Conformer) with relative positional encoding (RPE) achieved state-of-the-art performance. This paper proposes a positional encoding (PE) mechanism called Scaled Untied RPE that unties the feature-position correla- tions in the self-attention computation, and computes feature correlations and positional correlations separately using dif- ferent projection matrices. In addition, we propose to scale feature correlations with the positional correlations and the aggressiveness of this multiplicative interaction can be con- figured using a parameter called amplitude. Moreover, we show that the PE matrix can be sliced to reduce model param- eters. Our results on National Speech Corpus (NSC) show that Transformer encoders with Scaled Untied RPE achieves relative improvements of 1.9% in accuracy and up to 50.9% in latency over a Conformer baseline respectively.