2017

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, K. Maziarz, A. Davis, Quoc Le, Geoffrey Hinton, Jeffrey Dean

citations

Cite Score

67

AI summary

This paper introduces a Sparsely-Gated Mixture-of-Experts layer (MoE) that achieves over 1000x improvement in model capacity. It applies a MoE with up to 137 billion parameters convolutionally between stacked LSTM layers, achieving significantly better results on large language modeling and machine translation benchmarks.

Main Contributions

  • Introduces the Sparsely-Gated Mixture-of-Experts layer (MoE), a new type of neural network component for conditional computation.
  • The MoE consists of a number of experts, each a simple feed-forward neural network, and a trainable gating network that selects a sparse combination of experts to process each input.
  • Addresses challenges of conditional computation such as modern computing devices are much faster at arithmetic than at branching, large batch sizes are critical for performance, network bandwidth can be a bottleneck, loss terms may be necessary to achieve the desired level of sparsity, and model capacity is most critical for very large datasets.
  • Achieves greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets.
  • Applies a MoE convolutionally between stacked LSTM layers on both language modeling and machine translation benchmarks.

Abstract

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

Citation Graph

Loading graph...

References [44]

Sort:
Filter:

K. He, X. Zhang, S. Ren, Jian Sun - 2016

20 papers in library cite

D. P. Kingma, Jimmy Lei Ba - 2014

49 papers in library cite

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton - 2012

71 papers in library cite

Sepp Hochreiter, Jürgen Schmidhuber - 1997

94 papers in library cite

S. Ioffe, Christian Szegedy - 2015

18 papers in library cite

D. Bahdanau, Kyunghyun Cho, Yoshua Bengio - 2014

59 papers in library cite

Ilya Sutskever, Oriol Vinyals, Quoc V. Le - 2014

58 papers in library cite

John Duchi, Elad Hazan, Yoram Singer - 2011

19 papers in library cite

M. Abadi, Akshat Agarwal, P. Barham, E. Brevdo, Ziru Chen, C. Citro, G. Corrado, A. Davis, Jeffrey Dean, M. Devin, Sanjay Ghemawat, I. Goodfellow, A. Harp, Geoffrey Irving, M. Isard, Y. Jia, R. Jozefowicz, Lukasz Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, Christopher Olah, M. Schuster, J. Shlens, B. Steiner, Ilya Sutskever, K. Talwar, P. Tucker, Vincent Vanhoucke, V. Vasudevan, F. Viegas, Oriol Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, Xiaoqiang Zheng - 2015

11 papers in library cite

T. Luong, H. Pham, Christopher D. Manning - 2015

15 papers in library cite

Yonghui Wu, M. Schuster, Ziru Chen, Quoc V. Le, M. Norouzi, W. Macherey, M. Krikun, Yue Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. J. Johnson, Xiaodong Liu, Lukasz Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, Wenyi Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, Oriol Vinyals, G. S. Corrado, M. Hughes, Jeffrey Dean - 2016

15 papers in library cite

Felix A. Gers, Jürgen Schmidhuber, Fred Cummins - 2000

13 papers in library cite

Robert A. Jacobs, Michael I. Jordan, S. J. Nowlan, Geoffrey E. Hinton - 1991

5 papers in library cite

M. Jordan, Rowan Jacobs - 1994

3 papers in library cite

Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals - 2014

22 papers in library cite

Quoc V. Le, M. A. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, Jeffrey Dean, Andrew Y. Ng - 2012

10 papers in library cite

R. Kneser, Hermann Ney - 1995

11 papers in library cite

R. Jozefowicz, Oriol Vinyals, M. Schuster, Noam Shazeer, Yonghui Wu - 2016

20 papers in library cite

C. Chelba, Tomas Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, Tony Robinson - 2013

13 papers in library cite

M. Schuster, Kaisuke Nakajima - 2012

3 papers in library cite

T. Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, Wojciech Zaremba - 2014

14 papers in library cite

Geoffrey E. Hinton, L. Deng, D. Yu, George E. Dahl, A. Mohamed, Navdeep Jaitly, A. Senior, Vincent Vanhoucke, P. Nguyen, T. N. Sainath, Brian Kingsbury - 2012

8 papers in library cite

Dario Amodei, S. Ananthanarayanan, R. Anubhai, Jinze Bai, E. Battenberg, C. Case, J. Casper, Bryan Catanzaro, Q. Cheng, Guanduo Chen - 2016

3 papers in library cite

H. Sak, A. W. Senior, F. Beaufays - 2014

5 papers in library cite

M. J. Johnson, M. Schuster, Quoc V. Le, M. Krikun, Yonghui Wu, Ziru Chen, N. Thorat, F. B. Viegas, M. Wattenberg, G. S. Corrado, M. Hughes, Jeffrey Dean - 2017

7 papers in library cite

D. Eigen, Marc'aurelio Ranzato, Ilya Sutskever - 2013

1 paper in library cites

E. Bengio, P. L. Bacon, J. Pineau, D. Precup - 2015

1 paper in library cites

E. Garmash, C. Monz - 2016

1 paper in library cites

Kyunghyun Cho, Yoshua Bengio - 2014

1 paper in library cites

Yoshua Bengio, N. Leonard, Aaron Courville - 2013

3 papers in library cite

N. Durrani, B. Haddow, P. Koehn, K. Heafield - 2014

6 papers in library cite

Jingren Zhou, Yue Cao, Xinpeng Wang, P. L. Li, Weixin Xu - 2016

5 papers in library cite

Ronan Collobert, Samy Bengio, Yoshua Bengio - 2002

1 paper in library cites

P. Gallinari, L. Denoyer - 2014

1 paper in library cites

M. P. Deisenroth, J. W. Ng - 2015

1 paper in library cites

Amjad Almahairi, Nicolas Ballas, T. Cooijmans, Y. Zheng, Hugo Larochelle, Aaron Courville - 2015

1 paper in library cites

R. Aljundi, P. Chakravarty, T. Tuytelaars - 2016

1 paper in library cites

L. Theis, M. Bethge - 2015

1 paper in library cites

B. Yao, D. Walther, D. Beck, Li Fei Fei - 2009

1 paper in library cites

C. E. Rasmussen, Zoubin Ghahramani - 2002

1 paper in library cites

A. Davis, I. Arel - 2013

1 paper in library cites

A. Gruslys, Rémi Munos, Ivo Danihelka, M. Lanctot, Alex Graves - 2016

1 paper in library cites

V. Tresp - 2001

1 paper in library cites

B. Shahbaba, R. Neal - 2009

1 paper in library cites

Cited by

9

papers in your library

Cites

30

papers in your library

Read

on August 17, 2025

Your review

Tags

Paper Aliases

No aliases