2011

Improving the Speed of Neural Networks on Cpus

Vincent Vanhoucke, A. Senior, Mark Z. Mao

citations

Cite Score

38

AI summary

This paper introduces optimization techniques to improve the performance of neural networks on CPUs, emphasizing data layout, batching, SSE2/SSSE3/SSE4 instructions, and fixed-point arithmetic, achieving a 3x speedup over floating-point baselines and demonstrating a real-time speech recognizer with a 10x speedup.

Main Contributions

  • Demonstrates that optimizing matrix computations can enhance neural network performance on CPUs.
  • Explores data layout and batching techniques for improved efficiency.
  • Leverages SSE2, SSSE3, and SSE4 fixed-point instructions for significant speedups.
  • Achieves a 3x improvement over optimized floating-point baselines using fixed-point instructions.
  • Builds a real-time speech recognizer with a large hybrid network at no cost in accuracy.

Abstract

Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. The sheer size of these networks can represent a challenging computational burden, even for modern CPUs. For this reason, GPUs are routinely used instead to train and run such networks. This paper is a tutorial for students and researchers on some of the techniques that can be used to reduce this computational cost considerably on modern x86 CPUs. We emphasize data layout, batching of the computation, the use of SSE2 instructions, and particularly leverage SSSE3 and SSE4 fixed-point instructions which provide a 3x improvement over an optimized floating-point baseline. We use speech recognition as an example task, and show that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10x speedup over an unoptimized baseline and a 4x speedup over an aggressively optimized floating-point baseline at no cost in accuracy. The techniques described extend readily to neural network training and provide an effective alternative to the use of specialized hardware.

Citation Graph

Loading graph...

References [12]

Sort:
Filter:

K. S. Oh, Keechul Jung - 2004

2 papers in library cite

Navdeep Jaitly, P. Nguyen, A. Senior, Vincent Vanhoucke - 2012

6 papers in library cite

V. Mnih - 2009

5 papers in library cite

Rajat Raina, A. Madhavan, Andrew Y. Ng - 2009

4 papers in library cite

Missing author listMissing year

2 papers in library cite

V. W. Lee, Christina Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlun, R. Singhal, P. Dubey - 2010

1 paper in library cites

N. Fujimoto - 2008

1 paper in library cites

Missing author listMissing year

1 paper in library cites

H. Jang, Andrew Park, Keechul Jung - 2008

1 paper in library cites

Missing author listMissing year

1 paper in library cites

K. M. Knill, M. J. F. Gales, S. J. Young - 1996

1 paper in library cites

Cited by

4

papers in your library

Cites

2

papers in your library

Read

on July 18, 2025

Your review

Tags

Paper Aliases

No aliases