PAPER: Vector-vector-matrix Architecture: a Novel Hardware-aware Framework for Low-latency Inference in NLP Applications

  • Post category:News
Lightelligence employees Matthew Khoury, Rumen Dangovski, and Dr. Longwu Ou, along

with CEO Dr. Yichen Shen and co-founder Dr. Li Jing, presented their recent work “Vector-

Vector-Matrix Architecture: A Novel Hardware-Aware Framework for Low-Latency Inference

in NLP Applications” at the Conference on Empirical Methods in Natural Language

Processing (EMNLP). Deep neural networks have become the standard approach to building

reliable Natural Language Processing (NLP) applications, ranging from Neural Machine

Translation (NMT) to dialogue systems. However, improving accuracy by increasing the

model size requires a large number of hardware computations, which can slow down NLP

applications significantly at inference time. To address this issue, the authors propose a

novel vector-vector-matrix architecture (VVMA), which greatly reduces the latency at

inference time for NMT. This architecture takes advantage of specialized hardware that has

low-latency vector-vector operations and higher-latency vector-matrix operations. It also

reduces the number of parameters and FLOPs for virtually all models that rely on efficient

matrix multipliers without significantly impacting accuracy. The authors present empirical

results suggesting that their framework can reduce the latency of sequence-to-sequence

and Transformer models used for NMT by a factor of four. Finally, the authors show

evidence suggesting that their VVMA extends to other domains, and they discuss novel

hardware for its efficient use.

Click the button below to read the full paper.