Papperoni

2013

Deep Convolutional Neural Networks for LVCSR

T. Sainath, Abdel Rahman Mohamed, Brian Kingsbury, Bhuvana Ramabhadran

Open PDF Google Scholar

citations

Cite Score

55

AI summary

This paper introduces CNNs to LVCSR tasks, achieving a 13-30% relative improvement over GMMs and a 4-12% relative improvement over DNNs on the Broadcast News and Switchboard tasks. It explores different CNN architectures, including the number of convolutional layers, hidden units, pooling strategy, and input feature types.

Main Contributions

Explores the appropriate architecture for CNNs on LVCSR tasks, investigating the number of convolutional layers needed, the optimal number of hidden units per layer, the optimal pooling strategy, and the best type of input feature to be used with CNNs.
Finds that CNN hybrid systems offer a 4% relative improvement over hybrid DNNs and CNN-based features offer a 7% relative improvement over hybrid DNNs on a 50-hr English Broadcast News task.
Demonstrates that on a 300-hr Switchboard task, CNNs offer between a 4-7% relative improvement over DNNs, and on a 400-hr Broadcast News task, CNNs offer between a 10-12% relative improvement over DNNs.
Identifies VTLN-warped mel-FB with delta + double-delta as the best locally correlated feature set for CNNs.
Achieves a 13-30% relative improvement over GMMs and a 4-12% relative improvement over DNNs.

Abstract

Convolutional Neural Networks (CNNs) are an alternative type of neural network that can be used to reduce spectral variations and model spectral correlations which exist in signals. Since speech signals exhibit both of these properties, CNNs are a more effective model for speech compared to Deep Neural Networks (DNNs). In this paper, we explore applying CNNs to large vocabulary speech tasks. First, we determine the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks. Specifically, we focus on how many convolutional layers are needed, what is the optimal number of hidden units, what is the best pooling strategy, and the best input feature type for CNNs. We then explore the behavior of neural network features extracted from CNNs on a variety of LVCSR tasks, comparing CNNs to DNNs and GMMs. We find that CNNs offer between a 13-30% relative improvement over GMMs, and a 4-12% relative improvement over DNNs, on a 400-hr Broadcast News and 300-hr Switchboard task.

Citation Graph

Loading graph...

References [15]

Sort:

Filter:

[1]Gradient-Based Learning Applied to Document Recognition

Yann Lecun, Leon Bottou, Yoshua Bengio, Patrick Haffner - 1998

62 papers in library cite

I absolutely hated this paper. Has ~50 pages but seems like 200 pages. Takes too long to explain some things that really is just repeating itself. Also doesn't seem to add too much on top of LeNet-5. Also, focuses a lot on GTNs, which really didn't stick.

[2]Deep Neural Networks for Acoustic Modeling in Speech Recognition

Geoffrey Hinton - 2012

21 papers in library cite

The core of the paper itself is a bit boring and doesn't introduce anything new (just RBMs and DBNs again) but I am giving this a 4 because it's probably the best explanation of RBMs and DBNs I've read so far.

[3]Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

G. Dahl, D. Yu, L. Deng, Alex Acero - 2012

19 papers in library cite

Good paper, very well written and probably the best explanation of RBMs and DBNs I've seen. However, I don't see a lot of impact and seems very derivative from other works.

[4]Learning Methods for Generic Object Recognition With Invariance to Pose and Lighting

Yann Lecun, Fu Jie Huang, Leon Bottou - 2004

18 papers in library cite

Good paper, nice methodology for creating different images. However, I think that this was not too impactful... I don't see this being used a lot.

[5]Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition

Navdeep Jaitly, P. Nguyen, A. Senior, Vincent Vanhoucke - 2012

6 papers in library cite

It's not bad, it's just nothing new really. They just get existing methods and apply to very large datasets. I see the contribution, but boring read - just experiment methodology and results.

[6]Conversational Speech Transcription Using Context-Dependent Deep Neural Networks

F. Seide, G. Li, D. Yu - 2011

4 papers in library cite

[7]Applying Convolutional Neural Networks Concepts to Hybrid NN-HMM Model for Speech Recognition

O. A. Hamid, A. Mohamed, H. Jiang, G. Penn - 2012

3 papers in library cite

[8]Auto-Encoder Bottleneck Features Using Deep Belief Networks

T. N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran - 2012

3 papers in library cite

[9]Convolutional Networks for Images, Speech, and Time-Series

Yann Lecun, Yoshua Bengio - 1995

3 papers in library cite

[10]Face Recognition: A Convolutional Neural-Network Approach

S. Lawrence, C. Giles, A. Tsoi, A. Back - 1997

3 papers in library cite

[11]Lattice-Based Optimization of Sequence Classification Criteria for Neural-Network Acoustic Modeling

Brian Kingsbury - 2009

3 papers in library cite

[12]Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-Free Optimization

Brian Kingsbury, T. N. Sainath, H. Soltau - 2012

3 papers in library cite

[13]The IBM Attila speech Recognition Toolkit

H. Soltau, G. Saon, Brian Kingsbury - 2010

3 papers in library cite

[14]Making Deep Belief Networks Effective for Large Vocabulary Continuous Speech Recognition

T. N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran, P. Fousek, P. Novak, A. Mohamed - 2011

2 papers in library cite

[15]Understanding How Deep Belief Networks Perform Acoustic Modelling

A. Mohamed, Geoffrey Hinton, G. Penn - 2012

2 papers in library cite

Cited by

2

papers in your library

Cites

5

papers in your library

Read

on October 19, 2025

Nice to see CNNs for other things other than images, but I am biased against speech recognition and CNNs. Very good that the paper is short though.

Tags

Paper Aliases

No aliases