Papperoni

2012

Deep Learning Made Easier by Linear Transformations in Perceptrons

Tapani Raiko, Harri Valpola, Yann Lecun

citations

Cite Score

AI summary

This paper introduces transformations to multi-layer perceptrons, making hidden neuron outputs zero mean and slope, and using shortcut connections for linear dependencies, which enhances basic stochastic gradient learning on MNIST classification and autoencoder tasks by improving convergence and generalization.

Main Contributions

Proposed a novel transformation for MLP hidden neuron outputs to achieve zero mean and zero slope.
Introduced separate shortcut connections to model linear dependencies, aiming to decouple linear and nonlinear learning.
Theoretically showed that these transformations make the Fisher information matrix closer to diagonal, aligning standard gradient with natural gradient.
Demonstrated that basic stochastic gradient learning with transformations becomes competitive with state-of-the-art algorithms in speed and generalization.
Experimentally validated the method's benefits on handwritten digit classification and image representation learning using 3-layer and 6-layer networks, with and without regularization.

Abstract

We transform the outputs of each hidden neuron in a multi-layer perceptron network to be zero mean and zero slope, and use separate shortcut connections to model the linear dependencies instead. This transformation aims at separating the problems of learning the linear and nonlinear parts of the whole input-output mapping, which has many benefits. We study the theoretical properties of the transformation by noting that they make the Fisher information matrix closer to a diagonal matrix, and thus standard gradient closer to the natural gradient. We experimentally confirm the usefulness of the transformations by noting that they make basic stochastic gradient learning competitive with state-of-the-art learning algorithms in speed, and that they seem also to help find solutions that generalize better. The experiments include both classification of handwritten digits with a 3- layer network and learning a low-dimensional representation for images by using a 6-layer auto-encoder network. The transformations were beneficial in all cases, with and without regularization.

Citation Graph

Loading graph...

References [13]

Sort:

Filter:

[1]Gradient-Based Learning Applied to Document Recognition

Yann Lecun, Leon Bottou, Yoshua Bengio, Patrick Haffner - 1998

62 papers in library cite

Google Scholar

I absolutely hated this paper. Has ~50 pages but seems like 200 pages. Takes too long to explain some things that really is just repeating itself. Also doesn't seem to add too much on top of LeNet-5. Also, focuses a lot on GTNs, which really didn't stick.

[2]Understanding the Difficulty of Training Deep Feedforward Neural Networks

Yoshua Bengio - 2010

20 papers in library cite

Google Scholar

Nice but underwhelming results (they still underperform vs. pretraining). I also didn't really like the way it's written. It's not bad, it's just a bit clunky. Worth the read though.

[3]Reducing the Dimensionality of Data With Neural Networks

Geoffrey Hinton, Ruslan Salakhutdinov - 2006

37 papers in library cite

Google Scholar

I didn't like the way this is written, very hard to understand without a ton of background knowledge. But hey, it's the first deep learning model!

[4]Extracting and Composing Robust Features With Denoising Autoencoders

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre Antoine Manzagol - 2008

25 papers in library cite

Google Scholar

I am *so* glad we found an alternative to DBNs. Also, introduced the idea of denoising which is nice.

[5]Efficient Backprop

Yann Lecun, Leon Bottou, G. B. Orr, Klaus Robert Muller - 1998

20 papers in library cite

Google Scholar

The first half is very very good. The remainder is very boring.

[6]Deep Learning via Hessian-Free Optimization

James Martens - 2010

12 papers in library cite

Google Scholar

This paper is surprisingly good! When I first read the Hessian-Free optimization part, I thought "ugh, this is going to be full of math", but in the end it was very very enjoyable. I think I just wouldn't give it a 5 because it doesn't seem to have had that much impact.

[7]Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition

Dan C. Ciresan, Ueli Meier, Luca M. Gambardella, Jürgen Schmidhuber - 2010

10 papers in library cite

Google Scholar

It's short, simple and straight to the point (as the network in the paper). It's refreshing to read something less "academic" and more "in your face"

[8]Natural Gradient Works Efficiently in Learning

S. I. Amari - 1998

6 papers in library cite

Google Scholar

[9]First- And Second-Order Methods for Learning: Between Steepest Descent and Newton's method

R. Battiti - 1992

3 papers in library cite

Google Scholar

[10]Topmoumoute Online Natural Gradient algorithm

N. Leroux, Pierre Antoine Manzagol, Yoshua Bengio - 2008

2 papers in library cite

Google Scholar

[11]A Simple Weight Decay Can Improve Generalization

A. Krogh, J. Hertz - 1992

1 paper in library cites

Google Scholar

[12]Adding Noise to the Input of a Model Trained With a Regularized Objective

S. Rifai, Xavier Glorot, Yoshua Bengio, Pascal Vincent - 2011

1 paper in library cites

Google Scholar

[13]Slope Centering: Making Shortcut Weights Effective

N. Schraudolph - 1998

1 paper in library cites

Google Scholar

Cited by

papers in your library

Cites

papers in your library

Read

on February 17, 2026

Kudos for introducing shortcut connections (which would become important in the future), but to me it seems a bit mid.