Lightelligence employees Matthew Khoury, Rumen Dangovski, and Dr. Longwu Ou, along
with CEO Dr. Yichen Shen and co-founder Dr. Li Jing, presented their recent work “Vector-
Vector-Matrix Architecture: A Novel Hardware-Aware Framework for Low-Latency Inference
in NLP Applications” at the Conference on Empirical Methods in Natural Language
Processing (EMNLP). Deep neural networks have become the standard approach to building
reliable Natural Language Processing (NLP) applications, ranging from Neural Machine
Translation (NMT) to dialogue systems. However, improving accuracy by increasing the
model size requires a large number of hardware computations, which can slow down NLP
applications significantly at inference time. To address this issue, the authors propose a
novel vector-vector-matrix architecture (VVMA), which greatly reduces the latency at
inference time for NMT. This architecture takes advantage of specialized hardware that has
low-latency vector-vector operations and higher-latency vector-matrix operations. It also
reduces the number of parameters and FLOPs for virtually all models that rely on efficient
matrix multipliers without significantly impacting accuracy. The authors present empirical
results suggesting that their framework can reduce the latency of sequence-to-sequence
and Transformer models used for NMT by a factor of four. Finally, the authors show
evidence suggesting that their VVMA extends to other domains, and they discuss novel
hardware for its efficient use.
Click the button below to read the full paper.