Papperoni

2017

One Model to Learn Them All

Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, Jakob Uszkoreit

Open PDF Google Scholar

citations

Cite Score

20

AI summary

This paper introduces a MultiModel architecture, a single deep learning model that can simultaneously learn multiple tasks from various domains by incorporating convolutional layers, attention mechanisms, and sparsely-gated layers. It is trained concurrently on ImageNet, multiple translation tasks, COCO, a speech recognition corpus, and an English parsing task.

Main Contributions

Introduces a MultiModel architecture, a single deep-learning model that can simultaneously learn multiple tasks from various domains.
The architecture incorporates building blocks from multiple domains, including convolutional layers, an attention mechanism, and sparsely-gated layers.
Shows that adding computational blocks never hurts performance, even on tasks they were not designed for.
Tasks with less data benefit largely from joint training with other tasks.
Performance on large tasks degrades only slightly if at all.

Abstract

Deep learning yields great results across many fields, from speech recognition, image classification, to translation. But for each problem, getting a deep model to work well involves research into the architecture and a long period of tuning. We present a single model that yields good results on a number of problems spanning multiple domains. In particular, this single model is trained concurrently on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition corpus, and an English parsing task. Our model architecture incorporates building blocks from multiple domains. It contains convolutional layers, an attention mechanism, and sparsely-gated layers. Each of these computational blocks is crucial for a subset of the tasks we train on. Interestingly, even if a block is not crucial for a task, we observe that adding it never hurts performance and in most cases improves it on all tasks. We also show that tasks with less data benefit largely from joint training with other tasks, while performance on large tasks degrades only slightly if at all.

Citation Graph

Loading graph...

References [31]

Sort:

Filter:

[1]Adam: A Method for Stochastic Optimization

D. P. Kingma, Jimmy Lei Ba - 2014

49 papers in library cite

Amazing paper! Very well explained and huge impact. I am amazed that they made something so simple even when it requires a lot of background mathematical knowledge

[2]Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin - 2017

47 papers in library cite

I mean... it introduced Transformers!

[3]ImageNet Classification With Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton - 2012

71 papers in library cite

I'm giving this a 5 just because of the impact, but this is VEEERY derivative of earlier work. Kudos for them for putting it all together, but really there's nothing revolutionary here.

[4]Long Short-Term Memory

Sepp Hochreiter, Jürgen Schmidhuber - 1997

94 papers in library cite

LSTMs FTW!

[5]Microsoft COCO: Common Objects in Context

T. Y. Lin, M. Maire, S. Belongie, James Hays, Pietro Perona, D. Ramanan, Piotr Dollar, C. L. Zitnick - 2014

14 papers in library cite

I liked this paper a lot. It's a bit long and I was already a bit tired, but it was nice overall.

[6]Neural Machine Translation by Jointly Learning to Align and Translate

D. Bahdanau, Kyunghyun Cho, Yoshua Bengio - 2014

59 papers in library cite

Introduces the attention mechanism - amazing overall

[7]Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, B. V. Merrienboer, C. G. Gulcehre, D. Bahdanau, F. Bougares, Holger Schwenk, Yoshua Bengio - 2014

38 papers in library cite

Introduces RNN encoder-decoder. I love it :)

[8]Sequence to Sequence Learning With Neural Networks

Ilya Sutskever, Oriol Vinyals, Quoc V. Le - 2014

58 papers in library cite

Good paper, but I think it only got famous because they set a new good baseline for NNs in MT. Their main contribution was reversing the source sentence TBH.

[9]Layer Normalization

Jimmy Lei Ba, R. Kiros, Geoffrey E. Hinton - 2016

14 papers in library cite

Very nice! At first I had a little bit of prejudice because it seemed way too mathy, but actually the math is easy to follow and the results are very nice.

[10]Neural Machine Translation of Rare Words with Subword Units

R. Sennrich, B. Haddow, Alexandra Birch - 2016

22 papers in library cite

Very good! Simple, explains quite a lot and good results. Forms the basis for a lot of stuff now!

[11]A Unified Architecture for Natural Language Processing: Deep Neural Networks With Multitask Learning

Ronan Collobert, Jason Weston - 2008

32 papers in library cite

Really did not add much to the game. I think this was more of a small perf. improvement over other existing things and set a few methodological standards. Maybe main contribution is Multitask Learning + Deep learning

[12]Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

G. Dahl, D. Yu, L. Deng, Alex Acero - 2012

19 papers in library cite

Good paper, very well written and probably the best explanation of RBMs and DBNs I've seen. However, I don't see a lot of impact and seems very derivative from other works.

[13]Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, K. Maziarz, A. Davis, Quoc Le, Geoffrey Hinton, Jeffrey Dean - 2017

9 papers in library cite

It's nice, but there's an important section in the middle about batch sizes that I don't quite understand. Not sure if I am missing some background knowledge or if they explain it poorly, and seems foundational to their main method... Either way, I did understand the methodology of the paper, and they have nice results :)

[14]Recurrent Continuous Translation Models

N. Kalchbrenner, Phil Blunsom - 2013

27 papers in library cite

Good paper, probably the first that used an encoder-decoder. But they used a conv. NN instead of a tradicional decoder, which I don't really like.

[15]Can Active Memory Replace Attention?

Lukasz Kaiser, Samy Bengio - 2016

2 papers in library cite

So nice to see an alternative that works as well as attention!

[16]Imagenet Large Scale Visual Recognition Challenge

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Zhongqiang Huang, A. Karpathy, A. Khosla, M. Bernstein - 2014

18 papers in library cite

Imagenet dataset challenge paper

[17]Xception: Deep Learning With Depthwise Separable Convolutions

Francois Chollet - 2016

2 papers in library cite

Seems important

[18]Inception-V4, Inception-Resnet and the Impact of Residual Connections on Learning

Christian Szegedy, S. Ioffe, Vincent Vanhoucke, A. A. Alemi - 2017

3 papers in library cite

[19]Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

M. J. Johnson, M. Schuster, Quoc V. Le, M. Krikun, Yonghui Wu, Ziru Chen, N. Thorat, F. B. Viegas, M. Wattenberg, G. S. Corrado, M. Hughes, Jeffrey Dean - 2017

7 papers in library cite

Google's NMT system V2?

[20]Neural Machine Translation in Linear Time

N. Kalchbrenner, L. Espeholt, K. Simonyan, A. V. D. Oord, Alex Graves, Koray Kavukcuoglu - 2016

5 papers in library cite

Bytenet - Also "linear time" caught my attention

[21]Depthwise Separable Convolutions for Neural Machine Translation

Francois Chollet, Lukasz Kaiser, Aidan N. Gomez - 2017

1 paper in library cites

Xception for NMT

[22]Encoding Source Language With Convolutional Neural Network for Machine Translation

Fanqing Meng, Z. L. Lu, Mingliang Wang, H. Li, W. Jiang, Qian Liu - 2015

3 papers in library cite

[23]Multimodal Deep Learning

J. Ngiam, A. Khosla, M. Kim, J. Nam, Honglak Lee, A. Ng - 2011

2 papers in library cite

[24]Wavenet: A Generative Model for Raw Audio

A. V. D. Oord, S. Dieleman, H. Zen, K. Simonyan, Oriol Vinyals, Alex Graves, N. Kalchbrenner, A. Senior, Koray Kavukcuoglu - 2016

2 papers in library cite

Missing author list

[25]Csr-ii (wsj1) complete

1994

1 paper in library cites

[26]Exploiting Unrelated Tasks in Multi-Task Learning

B. R. Paredes, A. Argyriou, N. Berthouze, M. Pontil - 2012

1 paper in library cites

[27]Facial Landmark Detection by Deep Multi-Task Learning

C. C. Loy, X. Tang, Zhengyou Zhang, P. Luo - 2014

1 paper in library cites

[28]Multi-Scale Context Aggregation by Dilated Convolutions

F. Yu, V. Koltun - 2015

1 paper in library cites

[29]Multi-Task Learning in Deep Neural Networks for Improved Phoneme Recognition

M. L. Seltzer, J. Droppo - 2013

1 paper in library cites

[30]Rotation, Scaling and Deformation Invariant Scattering for Texture Discrimination

L. Sifre, S. Mallat - 2013

1 paper in library cites

[31]Treebank-3 1dc99t42

M. P. Marcus, B. Santorini, Mary Ann Marcinkiewicz, A. Taylor - 1999

1 paper in library cites

Cited by

2

papers in your library

Cites

21

papers in your library

Read

on November 5, 2025

The idea is amazing, but it gets waaaay too complex, and in the end it feels like they tested it, it didn't give good performance, and they said "just publish it as it is"

Tags

Paper Aliases

No aliases