Papperoni

2011

Improving the Speed of Neural Networks on Cpus

Vincent Vanhoucke, A. Senior, Mark Z. Mao

citations

Cite Score

AI summary

This paper introduces optimization techniques to improve the performance of neural networks on CPUs, emphasizing data layout, batching, SSE2/SSSE3/SSE4 instructions, and fixed-point arithmetic, achieving a 3x speedup over floating-point baselines and demonstrating a real-time speech recognizer with a 10x speedup.

Main Contributions

Demonstrates that optimizing matrix computations can enhance neural network performance on CPUs.
Explores data layout and batching techniques for improved efficiency.
Leverages SSE2, SSSE3, and SSE4 fixed-point instructions for significant speedups.
Achieves a 3x improvement over optimized floating-point baselines using fixed-point instructions.
Builds a real-time speech recognizer with a large hybrid network at no cost in accuracy.

Abstract

Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. The sheer size of these networks can represent a challenging computational burden, even for modern CPUs. For this reason, GPUs are routinely used instead to train and run such networks. This paper is a tutorial for students and researchers on some of the techniques that can be used to reduce this computational cost considerably on modern x86 CPUs. We emphasize data layout, batching of the computation, the use of SSE2 instructions, and particularly leverage SSSE3 and SSE4 fixed-point instructions which provide a 3x improvement over an optimized floating-point baseline. We use speech recognition as an example task, and show that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10x speedup over an unoptimized baseline and a 4x speedup over an aggressively optimized floating-point baseline at no cost in accuracy. The techniques described extend readily to neural network training and provide an effective alternative to the use of specialized hardware.

Citation Graph

Loading graph...

References [12]

Sort:

Filter:

[1]GPU Implementation of Neural Networks

K. S. Oh, Keechul Jung - 2004

2 papers in library cite

Google Scholar

Very poorly written and poorly explained. The only nice thing is early use of GPUs, and the fact that they do it before CUDA.

[2]Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition

Navdeep Jaitly, P. Nguyen, A. Senior, Vincent Vanhoucke - 2012

6 papers in library cite

Google Scholar

It's not bad, it's just nothing new really. They just get existing methods and apply to very large datasets. I see the contribution, but boring read - just experiment methodology and results.

[3]CUDAMat: A CUDA-based Matrix Class for Python

V. Mnih - 2009

5 papers in library cite

Google Scholar

[4]Large-Scale Deep Unsupervised Learning Using Graphics Processors

Rajat Raina, A. Madhavan, Andrew Y. Ng - 2009

4 papers in library cite

Google Scholar

Missing author listMissing year

[5]Eigen, a C++ Template Library for Linear Algebra

2 papers in library cite

Google Scholar

[6]Debunking the 100x GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

V. W. Lee, Christina Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlun, R. Singhal, P. Dubey - 2010

1 paper in library cites

Google Scholar

[7]Faster Matrix-Vector Multiplication on GeForce 8800GTX

N. Fujimoto - 2008

1 paper in library cites

Google Scholar

Missing author listMissing year

[8]Intel C++ Intrinsics Reference

1 paper in library cites

Google Scholar

[9]Neural Network Implementation Using CUDA and OpenMP

H. Jang, Andrew Park, Keechul Jung - 2008

1 paper in library cites

Google Scholar

Missing author listMissing year

[10]OpenFst Library

1 paper in library cites

Google Scholar

[11]The bucket box Intersection (BBI) algorithm for Fast Approximative Evaluation of Diagonal Mixture Gaussians

J. Fritsch, I. Rogina - 1996

1 paper in library cites

Google Scholar

[12]Use of Gaussian Selection in Large Vocabulary Continuous Speech Recognition Using HMMs

K. M. Knill, M. J. F. Gales, S. J. Young - 1996

1 paper in library cites

Google Scholar

Cited by

papers in your library

Cites

papers in your library

Read

on July 18, 2025

It's good and it's interesting, but I don't think it adds a ton. Good read though. I read it because I thought it discussed quantization (which it does, but not in the sense of making NNs smaller)