1993
Cite Score
86
AI summary
The paper describes the creation of the Penn Treebank, a large annotated corpus of American English containing over 4.5 million words, annotated with part-of-speech (POS) information and skeletal syntactic structure, achieving high speed, consistency and accuracy.
Main Contributions
Abstract
There is a growing consensus that significant, rapid progress can be made in both text understanding and spoken language understanding by investigating those phenomena that occur most centrally in naturally occurring unconstrained materials and by attempting to automatically extract information about language from very large corpora. Such corpora are beginning to serve as important research tools for investigators in natural language processing, speech recognition, and integrated spoken language systems, as well as in theoretical linguistics. Annotated corpora promise to be valuable for enterprises as diverse as the automatic construction of statistical models for the grammar of the written and the colloquial spoken language, the development of explicit formal theories of the differing grammars of writing and speech, the investigation of prosodic phenomena in speech, and the evaluation and comparison of the adequacy of parsing models. In this paper, we review our experience with constructing one such large annotated corpus-the Penn Treebank, a corpus consisting of over 4.5 million words of American English. During the first three-year phase of the Penn Treebank Project (1989–1992), this corpus has been annotated for part-of-speech (POS) information. In addition, over half of it has been annotated for skeletal syntactic structure. These materials are available to members of the Linguistic Data Consortium; for details, see Section 5.1. The paper is organized as follows. Section 2 discusses the POS tagging task. After outlining the considerations that informed the design of our POS tagset and presenting the tagset itself, we describe our two-stage tagging process, in which text is first assigned POS tags automatically and then corrected by human annotators. Section 3 briefly presents the results of a comparison between entirely manual and semi-automated tagging, with the latter being shown to be superior on three counts: speed, consistency, and accuracy. In Section 4, we turn to the bracketing task. Just as with the tagging task, we have partially automated the bracketing task: the output of
Citation Graph
References [17]
C. Lewis, P. Eaton, C. Kulp, J. Schwartz - 1990
1 paper in library cites
E. Brill - 1991
1 paper in library cites
D. Hindle - 1989
1 paper in library cites
R. Garside, G. Leech, G. Sampson - 1987
1 paper in library cites
B. Santorini, Mary Ann Marcinkiewicz - 1991
1 paper in library cites
W. N. Francis, H. Kucera - 1982
1 paper in library cites
Fernando Pereira, Y. Schabes - 1992
1 paper in library cites
W. N. Francis - 1964
1 paper in library cites
D. Magerman, M. Marcus - 1990
1 paper in library cites
B. Santorini - 1990
1 paper in library cites
R. Weischedel, D. Ayuso, R. Bobrow, S. Boisen, R. Ingria, J. Palmucci - 1991
1 paper in library cites
K. Church - 1988
1 paper in library cites
N. M. Veilleux, M. Ostendorf - 1992
1 paper in library cites
E. Brill, D. Magerman, M. Marcus - 1990
1 paper in library cites
M. Meteer, Richard Schwartz, R. Weischedel - 1991
1 paper in library cites
M. Niv - 1991
1 paper in library cites
D. Hindle - 1983
1 paper in library cites
Cited by
22
papers in your library
Cites
0
papers in your library
Read
on June 30, 2025
Your review
Tags
Paper Aliases
No aliases