Papperoni

2017

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, K. Maziarz, A. Davis, Quoc Le, Geoffrey Hinton, Jeffrey Dean

citations

Cite Score

AI summary

This paper introduces a Sparsely-Gated Mixture-of-Experts layer (MoE) that achieves over 1000x improvement in model capacity. It applies a MoE with up to 137 billion parameters convolutionally between stacked LSTM layers, achieving significantly better results on large language modeling and machine translation benchmarks.

Main Contributions

Introduces the Sparsely-Gated Mixture-of-Experts layer (MoE), a new type of neural network component for conditional computation.
The MoE consists of a number of experts, each a simple feed-forward neural network, and a trainable gating network that selects a sparse combination of experts to process each input.
Addresses challenges of conditional computation such as modern computing devices are much faster at arithmetic than at branching, large batch sizes are critical for performance, network bandwidth can be a bottleneck, loss terms may be necessary to achieve the desired level of sparsity, and model capacity is most critical for very large datasets.
Achieves greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets.
Applies a MoE convolutionally between stacked LSTM layers on both language modeling and machine translation benchmarks.

Abstract

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

Citation Graph

Loading graph...

References [44]

Sort:

Filter:

[1]Deep Residual Learning for Image Recognition

K. He, X. Zhang, S. Ren, Jian Sun - 2016

20 papers in library cite