2013

Deep Convolutional Neural Networks for LVCSR

T. Sainath, Abdel Rahman Mohamed, Brian Kingsbury, Bhuvana Ramabhadran

citations

Cite Score

55

AI summary

This paper introduces CNNs to LVCSR tasks, achieving a 13-30% relative improvement over GMMs and a 4-12% relative improvement over DNNs on the Broadcast News and Switchboard tasks. It explores different CNN architectures, including the number of convolutional layers, hidden units, pooling strategy, and input feature types.

Main Contributions

  • Explores the appropriate architecture for CNNs on LVCSR tasks, investigating the number of convolutional layers needed, the optimal number of hidden units per layer, the optimal pooling strategy, and the best type of input feature to be used with CNNs.
  • Finds that CNN hybrid systems offer a 4% relative improvement over hybrid DNNs and CNN-based features offer a 7% relative improvement over hybrid DNNs on a 50-hr English Broadcast News task.
  • Demonstrates that on a 300-hr Switchboard task, CNNs offer between a 4-7% relative improvement over DNNs, and on a 400-hr Broadcast News task, CNNs offer between a 10-12% relative improvement over DNNs.
  • Identifies VTLN-warped mel-FB with delta + double-delta as the best locally correlated feature set for CNNs.
  • Achieves a 13-30% relative improvement over GMMs and a 4-12% relative improvement over DNNs.

Abstract

Convolutional Neural Networks (CNNs) are an alternative type of neural network that can be used to reduce spectral variations and model spectral correlations which exist in signals. Since speech signals exhibit both of these properties, CNNs are a more effective model for speech compared to Deep Neural Networks (DNNs). In this paper, we explore applying CNNs to large vocabulary speech tasks. First, we determine the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks. Specifically, we focus on how many convolutional layers are needed, what is the optimal number of hidden units, what is the best pooling strategy, and the best input feature type for CNNs. We then explore the behavior of neural network features extracted from CNNs on a variety of LVCSR tasks, comparing CNNs to DNNs and GMMs. We find that CNNs offer between a 13-30% relative improvement over GMMs, and a 4-12% relative improvement over DNNs, on a 400-hr Broadcast News and 300-hr Switchboard task.

Citation Graph

Loading graph...

References [15]

Sort:
Filter:

Yann Lecun, Leon Bottou, Yoshua Bengio, Patrick Haffner - 1998

62 papers in library cite

Geoffrey Hinton - 2012

21 papers in library cite

G. Dahl, D. Yu, L. Deng, Alex Acero - 2012

19 papers in library cite

Yann Lecun, Fu Jie Huang, Leon Bottou - 2004

18 papers in library cite

Navdeep Jaitly, P. Nguyen, A. Senior, Vincent Vanhoucke - 2012

6 papers in library cite

F. Seide, G. Li, D. Yu - 2011

4 papers in library cite

O. A. Hamid, A. Mohamed, H. Jiang, G. Penn - 2012

3 papers in library cite

T. N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran - 2012

3 papers in library cite

Yann Lecun, Yoshua Bengio - 1995

3 papers in library cite

S. Lawrence, C. Giles, A. Tsoi, A. Back - 1997

3 papers in library cite

Brian Kingsbury, T. N. Sainath, H. Soltau - 2012

3 papers in library cite

H. Soltau, G. Saon, Brian Kingsbury - 2010

3 papers in library cite

T. N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran, P. Fousek, P. Novak, A. Mohamed - 2011

2 papers in library cite

A. Mohamed, Geoffrey Hinton, G. Penn - 2012

2 papers in library cite

Cited by

2

papers in your library

Cites

5

papers in your library

Read

on October 19, 2025

Your review

Tags

Paper Aliases

No aliases