Untied Positional Encodings For Efficient Transformer-based Speech Recognition

April 25, 2023

Self-attention has become a vital component for end-to-end (E2E) automatic speech recognition (ASR). Convolution- augmented Transformer (Conformer) with relative positional encoding (RPE) achieved state-of-the-art performance. This paper proposes a positional encoding (PE) mechanism called Scaled Untied RPE that unties the feature-position correla- tions in the self-attention computation, and computes feature correlations and positional correlations separately using dif- ferent projection matrices. In addition, we propose to scale feature correlations with the positional correlations and the aggressiveness of this multiplicative interaction can be con- figured using a parameter called amplitude. Moreover, we show that the PE matrix can be sliced to reduce model param- eters. Our results on National Speech Corpus (NSC) show that Transformer encoders with Scaled Untied RPE achieves relative improvements of 1.9% in accuracy and up to 50.9% in latency over a Conformer baseline respectively.

Untied Positional Encodings For Efficient Transformer-based Speech Recognition

Latest articles

Browse all articles

Newsroom

Fano Labs Secures IMDA Accreditation

May 16, 2024

Newsroom

Announcing our Series B funding round

May 2, 2024