No items found.

Improving Non-autoregressive ASR with Autoregressive Pretraining

Yanjia Li, Lahiru Samarakoon, Ivan Fung,ICASSP 2023, June 2023

Abstract

Autoregressive (AR) automatic speech recognition (ASR) models predict each output token conditioning on the previous ones, which slows down their inference speed. On the other hand, nonautoregressive (NAR) models predict tokens independently and simultaneously within a constant number of decoding iterations, which brings high inference speed. However, NAR models generally have lower accuracy than AR models. In this work, we propose AR pretraining to the NAR encoder to reduce the accuracy gap between AR and NAR models. The experiment results show that our AR-pretrained MaskCTC reaches the same accuracy as AR Conformer on Aishell-1 (both 4.9% CER) and reduce the performance gap with AR Conformer on LibriSpeech by relatively 50%. Moreover, our AR-pretrained MaskCTC only needs single decoding iteration, which reduces inference time by 50%. We also investigate multiple masking strategies in training the masked language model of MaskCTC.

Link to publication

Untied Positional Encodings For Efficient Transformer-based Speech Recognition

Lahiru Samarakoon, Ivan Fung,SLT 2022, January 2023

Abstract

Self-attention has become a vital component for end-to-end (E2E) automatic speech recognition (ASR). Convolution- augmented Transformer (Conformer) with relative positional encoding (RPE) achieved state-of-the-art performance. This paper proposes a positional encoding (PE) mechanism called Scaled Untied RPE that unties the feature-position correla- tions in the self-attention computation, and computes feature correlations and positional correlations separately using dif- ferent projection matrices. In addition, we propose to scale feature correlations with the positional correlations and the aggressiveness of this multiplicative interaction can be con- figured using a parameter called amplitude. Moreover, we show that the PE matrix can be sliced to reduce model param- eters. Our results on National Speech Corpus (NSC) show that Transformer encoders with Scaled Untied RPE achieves relative improvements of 1.9% in accuracy and up to 50.9% in latency over a Conformer baseline respectively.

Link to publication

Fine-tuning Pre-trained Language Models for Few-shot Intent Detection: Supervised Pre-training and Isotropization

Haode Zhang, Haowen Liang, Yuwei Zhang, Liming Zhan, Xiao-Ming Wu, Xiaolei Lu, Albert Y.S. Lam, arXiv:2205.07208, 2022.

Abstract

It is challenging to train a good intent classifier for a task-oriented dialogue system with only a few annotations. Recent studies have shown that fine-tuning pre-trained language models with a small amount of labeled utterances from public benchmarks in a supervised manner is extremely helpful. However, we find that supervised pre-training yields an anisotropic feature space, which may suppress the expressive power of the semantic representations. Inspired by recent research in isotropization, we propose to improve supervised pre-training by regularizing the feature space towards isotropy. We propose two regularizers based on contrastive learning and correlation matrix respectively, and demonstrate their effectiveness through extensive experiments. Our main finding is that it is promising to regularize supervised pre-training with isotropization to further improve the performance of few-shot intent detection. The source code can be found at this https URL.

Link to publication

Conformer-based Speech Recognition with Linear Nystrom Attention and Rotary Position Embedding

Tsun-Yat Leung, Lahiru Samarakoon, ICASSP 2022, May 2022

Abstract

Self-attention has become an important component for end-to-end (E2E) automatic speech recognition (ASR). Recently, Convolution- augmented Transformer (Conformer) with relative positional encod- ing (RPE) achieved state-of-the-art performance. However, the com- putational and memory complexity of self-attention grows quadrati- cally with the input sequence length. Effect of this can be significant for the Conformer encoder when processing longer sequences. In this work, we propose to replace self-attention with a linear com- plexity Nystro ̈m attention which is a low-rank approximation of the attention scores based on the Nystro ̈m method. In addition, we pro- pose to use Rotary Position Embedding (RoPE) with Nystro ̈m at- tention since RPE is of quadratic complexity. Moreover, we show that models can be made even lighter by removing self-attention sub-layers from top encoder layers without any drop in the perfor- mance. Furthermore, we demonstrate that Convolutional sub-layers in Conformer can effectively recover the information lost due to the Nystro ̈m approximation.

Link to publication

Two-Stage Auction Mechanism for Long-Term Participation in Crowdsourcing

Timothy Shin Heng Mak, Albert Y.S. Lam, arXiv:2202.10064, 2022.

Abstract

Crowdsourcing has become an important tool to collect data for various artificial intelligence applications and auction can be an effective way to allocate work and determine reward in a crowdsourcing platform. In this paper, we focus on the crowdsourcing of small tasks such as image labelling and voice recording where we face a number of challenges. First, workers have different limits on the amount of work they would be willing to do, and they may also misreport these limits in their bid for work. Secondly, if the auction is repeated over time, unsuccessful workers may drop out of the system, reducing competition and diversity. To tackle these issues, we first extend the results of the celebrated Myerson's optimal auction mechanism for a single-parameter bid to the case where the bid consists of the unit cost of work, the maximum amount of work one is willing to do, and the actual work completed. We show that a simple payment mechanism is sufficient to ensure a dominant strategy from the workers, and that this dominant strategy is robust to the true utility function of the workers. Secondly, we propose a novel, flexible work allocation mechanism, which allows the requester to balance between cost efficiency and equality. While cost minimization is obviously important, encouraging equality in the allocation of work increases the diversity of the workforce as well as promotes long-term participation on the crowdsourcing platform. Our main results are proved analytically and validated through simulations.

Link to publication

Robust End-to-end Speaker Diarization with Conformer and Additive Margin Penalty

Tsun-Yat Leung, Lahiru Samarakoon, Interspeech 2021, August 2021

Abstract

Traditionally, a speaker diarization system has multiple compo- nents to extract and cluster speaker embeddings. However, end- to-end diarization is more desirable as it facilitates optimizing one model in contrast to multiple components in a traditional set up. Moreover, end-to-end diarization systems are capable of handling overlapped speech. Recently proposed self-attentive end-to-end diarization model with encoder-decoder based at- tractors (EEND-EDA) is capable of processing speech from an unknown number of speakers, and has reported comparable per- formances to traditional systems. In this work, we aim to im- prove the EEND-EDA model. First, we increase the robust- ness of the model by incorporating an additive margin penalty for minimizing the intra-class variance. Second, we propose to replace the Transformer encoders with Conformer encoders to capture local information. Third, we propose to use convolu- tional subsampling and upsampling instead of manual subsam- pling only. Our proposed improvements report 21.6% relative reduction in DER on the evaluation full set of the track 2 of the DIHARD III challenge.

Link to publication

Unknown Intent Detection Using Gaussian Mixture Model with an Application to Zero-shot Intent Classification

Guangfeng Yan, Lu Fan, Qimai Li, Han Liu, Xiaotong Zhang, Xiao-Ming Wu, and Albert Y.S. Lam, in Proceedings of 2020 Annual Conference of the Association for Computational Linguistics, July, 2020.

Abstract

User intent classification plays a vital role in dialogue systems. Since user intent may frequently change over time in many realistic scenarios, unknown (new) intent detection has become an essential problem, where the study has just begun. This paper proposes a semantic-enhanced Gaussian mixture model (SEG) for unknown intent detection. In particular, we model utterance embeddings with a Gaussian mixture distribution and inject dynamic class semantic information into Gaussian means, which enables learning more class-concentrated embeddings that help to facilitate downstream outlier detection. Coupled with a density-based outlier detection algorithm, SEG achieves competitive results on three real task-oriented dialogue datasets in two languages for unknown intent detection. On top of that, we propose to integrate SEG as an unknown intent identifier into existing generalized zero-shot intent classification models to improve their performance. A case study on a state-of-the-art method, ReCapsNet, shows that SEG can push the classification performance to a significantly higher level.

Link to publication

Deep-AIR: A Hybrid CNN-LSTM Framework forFine-Grained Air Pollution Forecast

Q. Zhang, J.C.K. Lam, Victor O.K. Li, and Y. Han, arXiv:2001.11957 [eess.SP], Jan. 2020.

Abstract

Poor air quality has become an increasingly critical challenge for many metropolitan cities, which carries many catastrophic physical and mental consequences on human health and quality of life. However, accurately monitoring and forecasting air quality remains a highly challenging endeavour. Limited by geographically sparse data, traditional statistical models and newly emerging data-driven methods of air quality forecasting mainly focused on the temporal correlation between the historical temporal datasets of air pollutants. However, in reality, both distribution and dispersion of air pollutants are highly location-dependant. In this paper, we propose a novel hybrid deep learning model that combines Convolutional Neural Networks (CNN) and Long Short Term Memory (LSTM) together to forecast air quality at high-resolution. Our model can utilize the spatial correlation characteristic of our air pollutant data sets to achieve higher forecasting accuracy than existing deep learning models of air pollution forecast.

Link to publication

Incorporating Prior Knowledge Into Speaker Diarization and Linking for Identifying Common Speaker

Tsun-Yat Leung, Lahiru Samarakoon, and Albert Y.S. Lam, in Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019), Dec. 2019.

Abstract

Speaker Diarization and Linking discovers “who spoke when” across recordings without any speaker enrollment. Diarization is performed on each recording separately, and the linking combines clusters of the same speaker across recordings. It is a two-step approach, however it suffers from propagating the error from diarization step to the linking step. In a situation where a unique speaker appears in a given set of recordings, this paper aims at locating the common speaker using the prior knowledge of his or her existence. That means there is no enrollment data for this common speaker. We propose Pairwise Common Speaker Identification (PCSI) method that takes the existence of a common speaker into account in contrast to the two-step approach. We further show that PCSI can be used to reduce the errors that are introduced in the diarization step of the two-step approach. Our experiments are performed on a corpus synthesised from the AMI corpus and also on a in-house conversational telephony Sichuanese corpus that is mixed with Mandarin. We show up to 7.68% relative improvements of time-weighted equal error rate over a state-of-art x-vector diarization and linking system.

Link to publication

A five-layer architecture for big data processing and analytics

J.Y. Zhu, B. Tang, and Victor O.K. Li, International Journal of Big Data Intelligence, Vol. 6, pp. 38-49, Nov. 2019.

Abstract

Big data technologies have attracted much attention in recent years. The academia and industry have reached a consensus, that is, the ultimate goal of big data is about transforming 'big data' to 'real value'. In this article, we discuss how to achieve this goal and propose five-layer architecture for big data processing and analytics (BDPA), including a collection layer, a storage layer, a processing layer, an analytics layer, and an application layer. The five-layer architecture targets to set up a de facto standard for current BDPA solutions, to collect, manage, process, and analyse the vast volume of both static data and online data streams, and make valuable decisions for all types of industries. Functionalities and challenges of the five-layers are illustrated, with the most recent technologies and solutions discussed accordingly. We conclude with the requirements for the future BDPA solutions, which may serve as a foundation for the future big data ecosystem.

Link to publication

Go From the General to the Particular: Multi-Domain Translation with Domain Transformation Networks

Y Wang, L Wang, S Shi, Victor O.K. Li, Z. Tu, arXiv:1911.09912 [cs.CL], Nov. 2019.

Abstract

The key challenge of multi-domain translation lies in simultaneously encoding both the general knowledge shared across domains and the particular knowledge distinctive to each domain in a unified model. Previous work shows that the standard neural machine translation (NMT) model, trained on mixed-domain data, generally captures the general knowledge, but misses the domain-specific knowledge. In response to this problem, we augment NMT model with additional domain transformation networks to transform the general representations to domain-specific representations, which are subsequently fed to the NMT decoder. To guarantee the knowledge transformation, we also propose two complementary supervision signals by leveraging the power of knowledge distillation and adversarial learning. Experimental results on several language pairs, covering both balanced and unbalanced multi-domain translation, demonstrate the effectiveness and universality of the proposed approach. Encouragingly, the proposed unified model achieves comparable results with the fine-tuning approach that requires multiple models to preserve the particular knowledge. Further analyses reveal that the domain transformation networks successfully capture the domain-specific knowledge as expected.

Link to publication

Reconstructing Capsule Networks for Zero-shot Intent Classification

Han Liu, Xiaotong Zhang, Lu Fan, Xuandi Fu, Qimai Li, Xiao-Ming Wu, and Albert Y.S. Lam, in Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), Hong Kong, Nov. 2019.

Abstract

Intent classification is an important building block of dialogue systems. With the burgeoning of conversational AI, existing systems are not capable of handling numerous fast-emerging intents, which motivates zero-shot intent classification. Nevertheless, research on this problem is still in the incipient stage and few methods are available. A recently proposed zero-shot intent classification method, IntentCapsNet, has been shown to achieve state-of-the-art performance. However, it has two unaddressed limitations: (1) it cannot deal with polysemy when extracting semantic capsules; (2) it hardly recognizes the utterances of unseen intents in the generalized zero-shot intent classification setting. To overcome these limitations, we propose to reconstruct capsule networks for zero-shot intent classification. First, we introduce a dimensional attention mechanism to fight against polysemy. Second, we reconstruct the transformation matrices for unseen intents by utilizing abundant latent information of the labeled utterances, which significantly improves the model generalization ability. Experimental results on two task-oriented dialogue datasets in different languages show that our proposed method outperforms IntentCapsNet and other strong baselines.

Link to publication

Public Transport Waiting Time Estimation Using Semi-Supervised Graph Convolutional Networks

Kai Fung Chu, Albert Y.S. Lam, Becky P.Y. Loo, and Victor O.K. Li, in Proceedings of the 22nd IEEE International Conference on Intelligent Transportation Systems (IEEE ITSC 2019), Auckland New Zealand, Oct. 2019.

Abstract

An effective transportation system is important for supporting various human activities in a modern smart city. The waiting time at various stations has great impacts on the overall transportation system efficiency and people's health like stress and anxiety. Knowing the waiting time at different locations in advance can assist the travelers to plan their trips. However, such waiting time may depend on many factors like crowdedness and the collective travel behaviors of the travellers involved. In general, it is very expensive to collect all the required data at every location. In this paper, a deep learning approach is proposed for determining the waiting time levels at public transport stations based on some proxy data and limited historical waiting time data at some stations. We formulate the public transportation network as a graph and develop a semi-supervised classification model based on Graph Convolutional Networks which can operate directly on the graph-structured data with limited labelled data. We conduct experiments for the mass transit railway in Hong Kong with real data and our proposed approach can achieve 89% accuracy of classifying the waiting time levels.

Link to publication

Synchrophasor Recovery and Prediction: A Graph-Based Deep Learning Approach

J. J. Q. Yu, D. J. Hill, V. O. K. Li and Y. Hou, in IEEE Internet of Things Journal, vol. 6, no. 5, pp. 7348-7359, Oct. 2019.

Abstract

Data integrity of power system states is critical to modern power grid operation and control due to communication latency, state measurements are not immediately available at the control center, rendering slow responses of time-sensitive applications. In this paper, a new graph-based deep learning approach is proposed to recover and predict the states ahead of time utilizing the power network topology and existing measurements. A graph-convolutional recurrent adversarial network is devised to process available information and extract graphical and temporal data correlations. This approach overcomes drawbacks of the existing synchrophasor recovery and prediction implementation to improve the overall system performance. Additionally, the approach offers an adaptive data processing method to handle power grids of various sizes. Case studies demonstrate the outstanding recovery and prediction accuracy of the proposed approach, and investigations are conducted to illustrate its robustness against bad communication conditions, measurement noise, and system topology changes.

Link to publication

Improved Zero-shot Neural Machine Translation via Ignoring Spurious Correlations

J. Gu, Y. Wang, K. Cho, and Victor O.K. Li, in Proceedings of 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, Jul. 2019.

Abstract

Zero-shot translation, translating between language pairs on which a Neural Machine Translation (NMT) system has never been trained, is an emergent property when training the system in multilingual settings. However, naive training for zero-shot NMT easily fails, and is sensitive to hyper-parameter setting. The performance typically lags far behind the more conventional pivot-based approach which translates twice using a third language as a pivot. In this work, we address the degeneracy problem due to capturing spurious correlations by quantitatively analyzing the mutual information between language IDs of the source and decoded sentences. Inspired by this analysis, we propose to use two simple but effective approaches: (1) decoder pre-training; (2) back-translation. These methods show significant improvement (4~22 BLEU points) over the vanilla zero-shot translation on three challenging multilingual datasets, and achieve similar or better results than the pivot-based approach.

Link to publication

Deep Multi-Scale Convolutional LSTM Network for Travel Demand and Origin-Destination Predictions

Kai Fung Chu, Albert Y.S. Lam, and Victor O.K. Li, to appear in IEEE Transactions on Intelligent Transportation Systems, 2019.

Abstract

Advancements in sensing and the Internet of Things (IoT) technologies generate a huge amount of data. Mobility on demand (MoD) service benefits from the availability of big data in the intelligent transportation system. Given the future travel demand or origin-destination (OD) flows prediction, service providers can pre-allocate unoccupied vehicles to the customers' origins of service to reduce waiting time. Traditional approaches on future travel demand and the OD flows predictions rely on statistical or machine learning methods. Inspired by deep learning techniques for image and video processing, through regarding localized travel demands as image pixels, a novel deep learning model called multi-scale convolutional long short-term memory network (MultiConvLSTM) is developed in this paper. Rather than using the traditional OD matrix which may lead to loss of geographical information, we propose a new data structure, called OD tensor to represent OD flows, and a manipulation method, called OD tensor permutation and matricization, is introduced to handle the high dimensionality features of OD tensor. MultiConvLSTM considers both temporal and spatial correlations to predict the future travel demand and OD flows. Experiments on real-world New York taxi data of around 400 million records are performed. Our results show that the MultiConvLSTM achieves the highest accuracy in both one-step and multiple-step predictions and it outperforms the existing methods for travel demand and OD flow predictions.

Link to publication

Domain Adaptation of End-to-end Speech Recognition in Low-resource Settings

Lahiru Samarakoon, Brian Mak, and Albert Y.S. Lam. IEEE Workshop on Spoken Language Technology (IEEE SLT 2018), Athens, Greece, Dec. 2018.

Abstract

End-to-end automatic speech recognition (ASR) has simplified the traditional ASR system building pipeline by eliminating the need to have multiple components and also the requirement for expert linguistic knowledge for creating pronunciation dictionaries. Therefore, end-to-end ASR fits well when building systems for new domains. However, one major drawback of end-to-end ASR is that, it is necessary to have a larger amount of labeled speech in comparison to traditional methods. Therefore, in this paper, we explore domain adaptation approaches for end-to-end ASR in low-resource settings. We show that joint domain identification and speech recognition by inserting a symbol for domain at the beginning of the label sequence, factorized hidden layer adaptation and a domain-specific gating mechanism improve the performance of a low-resource target domain. Furthermore, we also show the robustness of proposed adaptation methods to an unseen domain, when only 3 hours of untranscribed data is available with improvements reporting up to 8.7% relative.

Link to publication

Subspace Based Sequence Discriminative Training of LSTM Acoustic Models with Feed-Forward Layers

Lahiru Samarakoon, Brian Mak, and Albert Y.S. Lam. ISCSLP, Taipei, Taiwan, Nov. 2018.

Abstract

State-of-the-art automatic speech recognition (ASR) systems use sequence discriminative training for improved performance over frame-level cross-entropy (CE) criterion. Even though sequence discriminative training improves long short-term memory (LSTM) recurrent neural network (RNN) acoustic models (AMs), it is not clear whether these systems achieve the optimal performance due to overfitting. This paper investigates the effect of state-level minimum Bayes risk (sMBR) training on LSTM AMs and shows that the conventional way of performing sMBR by updating all LSTM parameters is not optimal. We investigate two methods to improve the performance of sequence discriminative training of LSTM AMs. First more feed-forward (FF) layers are included between the last LSTM layer and the output layer so those additional FF layers may bene- fit more from sMBR training. Second, a subspace is estimated as an interpolation of rank-1 matrices when performing sMBR for the LSTM layers of the AM. Our methods are evaluated in benchmark AMI single distance microphone (SDM) task. We find that the proposed approaches provide 1.6% absolute improvement over a strong sMBR trained LSTM baseline.

Link to publication

Travel Demand Prediction using Deep Multi-Scale Convolutional LSTM Network

Kai Fung Chu, Albert Y.S. Lam, and Victor O.K. Li. 21st IEEE International Conference on Intelligent Transportation Systems (IEEE ITSC 2018), Maui, HI, Nov. 2018.

Abstract

Mobility on Demand transforms the way people travel in the city and facilitates real-time vehicle hiring services. Given the predicted future travel demand, service providers can coordinate their available vehicles such that they are pre- allocated to the customers’ origins of service in advance to reduce waiting time. Traditional approaches on future travel demand prediction rely on statistical or machine learning methods. Advancement in sensor technology generates huge amount of data, which enables the data-driven intelligent transportation system. In this paper, inspired by deep learning techniques for image and video processing, we propose a new deep learning model, called Multi-Scale Convolutional Long Short-Term Memory (MultiConvLSTM), by considering travel demand as image pixel values. MultiConvLSTM considers both temporal and spatial correlations to predict the future travel demand. Experiments on real-world New York taxi data with around 400 million records are performed. We show that MultiConvLSTM outperforms the existing prediction methods for travel demand prediction and achieves the highest accuracy among all in both one-step and multiple-step predictions.

Link to publication

Delay Aware Power System Synchrophasor Recovery and Prediction Framework

James J.Q. Yu, Albert Y.S. Lam, David J. Hill, Yunhe Hou, and Victor O.K. Li. IEEE Transactions on Smart Grid, 2018.

Abstract

This paper presents a novel delay aware synchrophasor recovery and prediction framework to address the problem of missing power system state variables due to the existence of communication latency. This capability is particularly essential for dynamic power system scenarios where fast remedial control actions are required due to system events or faults. While a wide area measurement system can sample high-frequency system states with phasor measurement units, the control center cannot obtain them in real-time due to latency and data loss. In this work, a synchrophasor recovery and prediction framework and its practical implementation are proposed to recover the current system state and predict the future states utilizing existing incomplete synchrophasor data. The framework establishes an iterative prediction scheme, and the proposed implementation adopts recent machine learning advances in data processing. Simulation results indicate the superior accuracy and speed of the proposed framework, and investigations are made to study its sensitivity to various communication delay patterns for pragmatic applications.

Link to publication

Delay Aware Transient Stability Assessment with Synchrophasor Recovery and Prediction Framework

James J.Q. Yu, David J. Hill, and Albert Y.S. Lam. Neurocomputing, 2018.

Abstract

Transient stability assessment is critical for power system operation and control. Existing related research makes a strong assumption that the data transmission time for system variable measurements to arrive at the control center is negligible, which is unrealistic. In this paper, we focus on investigating the impact of data transmission latency on synchrophasor-based transient stability assessment. In particular, we employ a recently proposed methodology named synchrophasor recovery and prediction framework to handle the latency issue and make up missing synchrophasors. Advanced deep learning techniques are adopted to utilize the processed data for assessment. Compared with existing work, our proposed mechanism can make accurate assessments with a significantly faster response speed.

Link to publication

Intelligent Time-Adaptive Transient Stability Assessment System

James J.Q. Yu, David J. Hill, Albert Y.S. Lam, Jiatao Gu, and Victor O.K. Li. IEEE Transactions on Power Systems, vol. 33, no. 1, pp. 1049–1058, Jan. 2018.

Abstract

Online identification of postcontingency transient stability is essential in power system control, as it facilitates the grid operator to decide and coordinate system failure correction control actions. Utilizing machine learning methods with synchrophasor measurements for transient stability assessment has received much attention recently with the gradual deployment of wide-area protection and control systems. In this paper, we develop a transient stability assessment system based on the long short-term memory network. By proposing a temporal self-adaptive scheme, our proposed system aims to balance the trade-off between assessment accuracy and response time, both of which may be crucial in real-world scenarios. Compared with previous work, the most significant enhancement is that our system learns from the temporal data dependencies of the input data, which contributes to better assessment accuracy. In addition, the model structure of our system is relatively less complex, speeding up the model training process. Case studies on three power systems demonstrate the efficacy of the proposed transient stability as sessment system.

Link to publication

Neural Machine Translation with Gumbel-Greedy Decoding

Jiatao Gu, Daniel Jiwoong Im, Victor OK Li. AAAI Conference on Artificial Intelligence (AAAI), 2018.

Abstract

Previous neural machine translation models used some heuristic search algorithms (e.g., beam search) in order to avoid solving the maximum a posteriori problem over translation sentences at test time. In this paper, we propose the Gumbel-Greedy Decoding which trains a generative network to predict translation under a trained model. We solve such a problem using the Gumbel-Softmax reparameterization, which makes our generative network differentiable and trainable through standard stochastic gradient methods. We empirically demonstrate that our proposed model is effective for generating sequences of discrete words.

Link to publication

Non-Autoregressive Neural Machine Translation

Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, Richard Socher. International Conference on Learning Representations (ICLR), 2018.

Abstract

Existing approaches to neural machine translation condition each output word on previously generated outputs. We introduce a model that avoids this autoregressive property and produces its outputs in parallel, allowing an order of magnitude lower latency during inference. Through knowledge distillation, the use of input token fertilities as a latent variable, and policy gradient fine-tuning, we achieve this at a cost of as little as 2.0 BLEU points relative to the autoregressive Transformer network used as a teacher. We demonstrate substantial cumulative improvements associated with each of the three aspects of our training strategy, and validate our approach on IWSLT 2016 English-German and two WMT language pairs. By sampling fertilities in parallel at inference time, our non-autoregressive model achieves near-state-of-the-art performance of 29.8 BLEU on WMT 2016 English-Romanian.

Link to publication

Universal Neural Machine Translation for Extremely Low Resource Languages

Jiatao Gu, Hany Hassan, Jacob Devlin, Victor OK Li. Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2018.

Abstract

In this paper, we propose a new universal machine translation approach focusing on languages with a limited amount of parallel data. Our proposed approach utilizes a transfer-learning approach to share lexical and sentence level representations across multiple source languages into one target language. The lexical part is shared through a Universal Lexical Representation to support multilingual word-level sharing. The sentence-level sharing is represented by a model of experts from all source languages that share the source encoders with all other languages. This enables the low-resource language to utilize the lexical and sentence representations of the higher resource languages. Our approach is able to achieve 23 BLEU on Romanian-English WMT2016 using a tiny parallel corpus of 6k sentences, compared to the 18 BLEU of strong baseline system which uses multilingual training and back-translation. Furthermore, we show that the proposed approach can achieve almost 20 BLEU on the same dataset through fine-tuning a pre-trained multi-lingual system in a zero-shot setting.

Link to publication

Delay Aware Intelligent Transient Stability Assessment System

James J.Q. Yu, Albert Y.S. Lam, David J. Hill, and Victor O.K. Li. IEEE Access, vol. 5, pp. 17230–17239, Dec. 2017.

Abstract

Transient stability assessment is a critical tool for power system design and operation. With the emerging advanced synchrophasor measurement techniques, machine learning methods are playing an increasingly important role in power system stability assessment. However, most existing research makes a strong assumption that the measurement data transmission delay is negligible. In this paper, we focus on investigating the influence of communication delay on synchrophasor-based transient stability assessment. In particular, we develop a delay aware intelligent system to address this issue. By utilizing an ensemble of multiple long short-term memory networks, the proposed system can make early assessments to achieve a much shorter response time by utilizing incomplete system variable measurements. Compared with existing work, our system is able to make accurate assessments with a significantly improved efficiency. We perform numerous case studies to demonstrate the superiority of the proposed intelligent system, in which accurate assessments can be developed with time one third less than state-of-the-art methodologies. Moreover, the simulations indicate that noise in the measurements has trivial impact on the assessment performance, demonstrating the robustness of the proposed system.

Link to publication

An Extended Spatio-temporal Granger Causality Model for Air Quality Estimation with Heterogeneous

Zhu, J.Y., Sun, C., and Li, V.O.K., IEEE Transactions on Big Data, vol. 3, no. 3, pp. 307-319, Jul. 2017.

Abstract

This paper deals with city-wide air quality estimation with limited air quality monitoring stations which are geographically sparse. Since air pollution is influenced by urban dynamics (e.g., meteorology and traffic) which are available throughout the city, we can infer the air quality in regions without monitoring stations based on such spatial-temporal (ST) heterogeneous urban big data. However, big data-enabled estimation poses three challenges. The first challenge is data diversity, i.e., there are many different categories of urban data, some of which may be useless for the estimation. To overcome this, we extend Granger causality to the ST space to analyze all the causality relations in a consistent manner. The second challenge is the computational complexity due to processing the massive volume of data. To overcome this, we introduce the non-causality test to rule out urban dynamics that do not “Granger” cause air pollution, and the region of influence (ROI), which enables us to only analyze data with the highest causality levels. The third challenge is to adapt our grid-based algorithm to non-grid-based applications. By developing a flexible grid-based estimation algorithm, we can decrease the inaccuracies due to grid-based algorithm while maintaining computation efficiency.

Link to publication

Search Engine Guided Non-Parametric Neural Machine Translation

Gu, J., Wang, Y., Cho, K, and Li, V.O.K., arXiv: 1705.07267, May 2017.

Abstract

In this paper, we extend an attention-based neural machine translation (NMT) model by allowing it to access an entire training set of parallel sentence pairs even after training. The proposed approach consists of two stages. In the first stage--retrieval stage--, an off-the-shelf, black-box search engine is used to retrieve a small subset of sentence pairs from a training set given a source sentence. These pairs are further filtered based on a fuzzy matching score based on edit distance. In the second stage--translation stage--, a novel translation model, called translation memory enhanced NMT (TM-NMT), seamlessly uses both the source sentence and a set of retrieved sentence pairs to perform the translation. Empirical evaluation on three language pairs (En-Fr, En-De, and En-Es) shows that the proposed approach significantly outperforms the baseline approach and the improvement is more significant when more relevant sentence pairs were retrieved.

Link to publication

A Teacher-Student Framework for Zero-Resource Neural Machine Translation

Chen Y., Liu, Y., Cheng, Y., Li, V.O.K., arXiv:1705.00753, 2017.

Abstract

While end-to-end neural machine translation (NMT) has made remarkable progress recently, it still suffers from the data scarcity problem for low-resource language pairs and domains. In this paper, we propose a method for zero-resource NMT by assuming that parallel sentences have close probabilities of generating a sentence in a third language. Based on this assumption, our method is able to train a source-to-target NMT model ("student") without parallel corpora available, guided by an existing pivot-to-target NMT model ("teacher") on a source-pivot parallel corpus. Experimental results show that the proposed method significantly improves over a baseline pivot-based model by +3.0 BLEU points across various language pairs.

Link to publication

Intelligent Fault Detection Scheme for Microgrids with Wavelet-based Deep Neural Networks

James J.Q. Yu, Yunhe Hou, Albert Y.S. Lam, and Victor O.K. Li, to appear in IEEE Transactions on Smart Grid, 2017.

Abstract

Fault detection is essential in microgrid control and operation, as it enables the system to perform fast fault isolation and recovery. The adoption of inverter-interfaced distributed generation in microgrids makes traditional fault detection schemes inappropriate due to their dependence on significant fault currents. In this paper, we devise an intelligent fault detection scheme for microgrid based on wavelet transform and deep neural networks. The proposed scheme aims to provide fast fault type, phase, and location information for microgrid protection and service recovery. In the scheme, branch current measurements sampled by protective relays are pre-processed by discrete wavelet transform to extract statistical features. Then all available data is input into deep neural networks to develop fault information. Compared with previous work, the proposed scheme can provide significantly better fault type classification accuracy. Moreover, the scheme can also detect the locations of faults, which are unavailable in previous work. To evaluate the performance of the proposed fault detection scheme, we conduct a comprehensive evaluation study on the CERTS microgrid and IEEE 34-bus system. The simulation results demonstrate the efficacy of the proposed scheme in terms of detection accuracy, computation time, and robustness against measurement uncertainty.

Link to publication

Trainable Greedy Decoding for the Neural Machine Translation

Gu, J., Cho, K., Li, V.O.K., arXiv:1702.02429, 2017.

Abstract

Recent research in neural machine translation has largely focused on two aspects; neural network architectures and end-to-end learning algorithms. The problem of decoding, however, has received relatively little attention from the research community. In this paper, we solely focus on the problem of decoding given a trained neural machine translation model. Instead of trying to build a new decoding algorithm for any specific decoding objective, we propose the idea of trainable decoding algorithm in which we train a decoding algorithm to find a translation that maximizes an arbitrary decoding objective. More specifically, we design an actor that observes and manipulates the hidden state of the neural machine translation decoder and propose to train it using a variant of deterministic policy gradient. We extensively evaluate the proposed algorithm using four language pairs and two decoding objectives and show that we can indeed train a trainable greedy decoder that generates a better translation (in terms of a target decoding objective) with minimal computational overhead.

Link to publication

A Four-Layer Architecture for Online and Historical Big Data Analytics

Zhu, J. Y., Xu, J, and Li, V.O.K., Proc. IEEE DataCom, Oakland, New Zealand, Aug 2016.

Abstract

Big data processing and analytics technologies have drawn much attention in recent years. However, the recent explosive growth of online data streams brings new challenges to the existing technologies. These online data streams tend to be massive, continuously arriving, heterogeneous, time-varying and unbounded. Therefore, it is necessary to have an integrated approach to process both big static data and online big data streams. We call this integrated approach online and historical big data analytics (OHBDA). We propose a four-layer architecture of OHBDA, i.e. including the storage layer, online and historical data processing layer, analytics layer, and decision-making layer. Functionalities and challenges of the four layers are further discussed. We conclude with a discussion of the requirements for the future OHBDA solutions, which may serve as a foundation for future big data analytics research.

Link to publication

Incorporating Copying Mechanism in Sequence-to-Sequence Learning

Gu, J., Lu, Z., Li, H., and Li, V.O.K., Proc. Annual Meeting of the Association for Computational Linguistics (ACL), Berlin, Germany, Aug 2016.

Abstract

We address an important problem in sequence-to-sequence (Seq2Seq) learning referred to as copying, in which certain segments in the input sequence are selectively replicated in the output sequence. A similar phenomenon is observable in human language communication. For example, humans tend to repeat entity names or even long phrases in conversation. The challenge with regard to copying in Seq2Seq is that new machinery is needed to decide when to perform the operation. In this paper, we incorporate copying into neural network-based Seq2Seq learning and propose a new model called CopyNet with encoder-decoder structure. CopyNet can nicely integrate the regular way of word generation in the decoder with the new copying mechanism which can choose sub-sequences in the input sequence and put them at proper places in the output sequence. Our empirical study on both synthetic data sets and real world data sets demonstrates the efficacy of CopyNet. For example, CopyNet can outperform regular RNN-based model with remarkable margins on text summarization tasks.

Link to publication

A Gaussian Bayesian Model to Identify Spatio-temporal Causalities for Air Pollution Based on Urban Big Data

Zhu, J. Y., Zheng, Y., Yi, X., and Li, V.O.K., SmartCity16: The 2nd IEEE INFOCOM Workshop on Smart Cities and Urban Computing, San Francisco, California, USA, April 2016.

Abstract

Identifying the causalities for air pollutants and answering questions, such as, where do Beijing's air pollutants come from, are crucial to inform government decision-making. In this paper, we identify the spatio-temporal (ST) causalities among air pollutants at different locations by mining the urban big data. This is challenging for two reasons: 1) since air pollutants can be generated locally or dispersed from the neighborhood, we need to discover the causes in the ST space from many candidate locations with time efficiency; 2) the cause-and-effect relations between air pollutants are further affected by confounding variables like meteorology. To tackle these problems, we propose a coupled Gaussian Bayesian model with two components: 1) a Gaussian Bayesian Network (GBN) to represent the cause-and-effect relations among air pollutants, with an entropy-based algorithm to efficiently locate the causes in the ST space; 2) a coupled model that combines cause-and-effect relations with meteorology to better learn the parameters while eliminating the impact of confounding. The proposed model is verified using air quality and meteorological data from 52 cities over the period Jun 1st 2013 to May 1st 2015. Results show superiority of our model beyond baseline causality learning methods, in both time efficiency and prediction accuracy.

Link to publication

Learning to Translate in Real-time with Neural Machine Translation

Gu, J., Neubig, G., Cho, K., and Li, V.O.K., arXiv:1610.00388, 2016.

Abstract

Translating in real-time, a.k.a. simultaneous translation, outputs translation words before the input sentence ends, which is a challenging problem for conventional machine translation methods. We propose a neural machine translation (NMT) framework for simultaneous translation in which an agent learns to make decisions on when to translate from the interaction with a pre-trained NMT environment. To trade off quality and delay, we extensively explore various targets for delay and design a method for beam-search applicable in the simultaneous MT setting. Experiments against state-of-the-art baselines on two language pairs demonstrate the efficacy of the proposed framework both quantitatively and qualitatively.

Link to publication

Pg-Causality: Identifying Spatiotemporal Causal Pathways for Air Pollutants with Urban Big Data

Zhu, J.Y., Zhang, C., Zhi, S., Li, V.O.K., Han, J., Zheng, Y., arXiv:1610.07045, 2016.

Abstract

Many countries are suffering from severe air pollution. Understanding how different air pollutants accumulate and propagate is critical to making relevant public policies. In this paper, we use urban big data (air quality data and meteorological data) to identify the \emph{spatiotemporal (ST) causal pathways} for air pollutants. This problem is challenging because: (1) there are numerous noisy and low-pollution periods in the raw air quality data, which may lead to unreliable causality analysis, (2) for large-scale data in the ST space, the computational complexity of constructing a causal structure is very high, and (3) the \emph{ST causal pathways} are complex due to the interactions of multiple pollutants and the influence of environmental factors. Therefore, we present \emph{p-Causality}, a novel pattern-aided causality analysis approach that combines the strengths of \emph{pattern mining} and \emph{Bayesian learning} to efficiently and faithfully identify the \emph{ST causal pathways}. First, \emph{Pattern mining} helps suppress the noise by capturing frequent evolving patterns (FEPs) of each monitoring sensor, and greatly reduce the complexity by selecting the pattern-matched sensors as "causers". Then, \emph{Bayesian learning} carefully encodes the local and ST causal relations with a Gaussian Bayesian network (GBN)-based graphical model, which also integrates environmental influences to minimize biases in the final results. We evaluate our approach with three real-world data sets containing 982 air quality sensors, in three regions of China from 01-Jun-2013 to 19-Dec-2015. Results show that our approach outperforms the traditional causal structure learning methods in time efficiency, inference accuracy and interpretability.

Link to publication

Efficient Learning for Undirected Topic Models

Gu, J. and Li, V.O.K., Proc. ACL-IJCNLP, Beijing, China, July 2015.

Abstract

Replicated Softmax model, a well-known undirected topic model, is powerful in extracting semantic representations of documents. Traditional learning strategies such as Contrastive Divergence are very inefficient. This paper provides a novel estimator to speed up the learning based on Noise Contrastive Estimate, extended for documents of variant lengths and weighted inputs. Experiments on two benchmarks show that the new estimator achieves great learning efficiency and high accuracy on document retrieval and classification.

Link to publication

Granger-Causality-Based Air Quality Estimation with Spatio-Temporal (S-T) Heterogeneous Big Data

Zhu, Y., Sun. C., and Li, V.O.K., Proc. IEEE INFOCOM Smart City Workshop, Hong Kong, China, April 2015.

Abstract

This paper considers city-wide air quality estimation with limited available monitoring stations which are geographically sparse. Since air pollution is highly spatio-temporal (S-T) dependent and considerably influenced by urban dynamics (e.g., meteorology and traffic), we can infer the air quality not covered by monitoring stations with S-T heterogeneous urban big data. However, estimating air quality using S-T heterogeneous big data poses two challenges. The first challenge is due to with the data diversity, i.e., there are different categories of urban dynamics and some may be useless and even detrimental for the estimation. To overcome this, we first propose an S-T extended Granger causality model to analyze all the causalities among urban dynamics in a consistent manner. Then by implementing non-causality test, we rule out the urban dynamics that do not “Granger” cause air pollution. The second challenge is due to the time complexity when processing the massive volume of data. We propose to discover the region of influence (ROI) by selecting data with the highest causality levels spatially and temporally. Results show that we achieve higher accuracy using “part” of the data than “all” of the data. This may be explained by the most influential data eliminating errors induced by redundant or noisy data. The causality model observation and the city-wide air quality map are illustrated and visualized using data from Shenzhen, China.

Link to publication

Spatio-temporal (S-T) similarity model for constructing WIFI-based RSSI fingerprinting map for indoor localization

Zhu, Y., Zheng, X., Xu, J., and Li, V.O.K., Proc. Fifth International Conference on Indoor Positioning and Indoor Navigation (IPIN 2014), Busan, Korea, Oct 2014.

Abstract

WIFI-based received signal strength indicator (RSSI) fingerprinting is widely used for indoor localization due to desirable features such as universal availability, privacy protection, and low deployment cost. The key of RSSI fingerprinting is to construct a trustworthy RSSI map, which contains the measurements of received access point (AP) signal strengths at different calibration points. Location can be estimated by matching live RSSIs with the RSSI map. However, a fine-grained map requires much labor and time. This calls for developing efficient interpolation and approximation methods. Besides, due to environmental changes, the RSSI map requires periodical updates to guarantee localization accuracy. In this paper, we propose a spatio-temporal (S-T) similarity model which uses the S-T correlation to construct a fine-grained and up-to-date RSSI map. Five S-T correlation metrics are proposed, i.e., the spatial distance, signal similarity, similarity likelihood, RSSI vector distance, and the S-T reliability. This model is evaluated based on experiments in our indoor WIFI positioning system test bed. Results show improvements in both the interpolation accuracy (up to 7%) and localization accuracy (up to 32%), compared to four commonly used RSSI map construction methods, namely, linear interpolation, cubic interpolation, nearest neighbor interpolation, and compressive sensing.

Link to publication

Performance Models of Access Latency in Cloud Storage Systems

Shuai, Q., Li, V.O.K., and Zhu, Y., Proc. Fourth Workshop on Architectures and Systems for Big Data, Minneapolis, MN, US, June 14, 2014.

Abstract

Access latency is a key performance metric for cloud storage systems and has great impact on user experience, but most papers focus on other performance metrics such as storage overhead, repair cost and so on. Only recently do some models argue that coding can reduce access latency. However, they are developed for special scenarios, which may not reflect reality. To fill the gaps between existing work and practice, in this paper, we propose a more practical model to measure access latency. This model can also be used to compare access latency of different codes used by different companies. To the best of our knowledge, this model is the first to provide a general method to compare access latencies of different erasure codes.

Link to publication