Papperoni

1993

Building a Large Annotated Corpus of English: The Penn Treebank

M. P. Marcus, B. Santorini, Mary Ann Marcinkiewicz

citations

Cite Score

AI summary

The paper describes the creation of the Penn Treebank, a large annotated corpus of American English containing over 4.5 million words, annotated with part-of-speech (POS) information and skeletal syntactic structure, achieving high speed, consistency and accuracy.

Main Contributions

Construction of the Penn Treebank, a large annotated corpus of American English.
Development of a two-stage tagging process: automatic POS assignment and manual correction.
Demonstration that semi-automated tagging is superior to manual tagging in terms of speed, consistency, and accuracy.
Partial automation of the bracketing task to generate skeletal syntactic structure.
Description of the syntactic tagset and guidelines used for annotation.

Abstract

There is a growing consensus that significant, rapid progress can be made in both text understanding and spoken language understanding by investigating those phenomena that occur most centrally in naturally occurring unconstrained materials and by attempting to automatically extract information about language from very large corpora. Such corpora are beginning to serve as important research tools for investigators in natural language processing, speech recognition, and integrated spoken language systems, as well as in theoretical linguistics. Annotated corpora promise to be valuable for enterprises as diverse as the automatic construction of statistical models for the grammar of the written and the colloquial spoken language, the development of explicit formal theories of the differing grammars of writing and speech, the investigation of prosodic phenomena in speech, and the evaluation and comparison of the adequacy of parsing models. In this paper, we review our experience with constructing one such large annotated corpus-the Penn Treebank, a corpus consisting of over 4.5 million words of American English. During the first three-year phase of the Penn Treebank Project (1989–1992), this corpus has been annotated for part-of-speech (POS) information. In addition, over half of it has been annotated for skeletal syntactic structure. These materials are available to members of the Linguistic Data Consortium; for details, see Section 5.1. The paper is organized as follows. Section 2 discusses the POS tagging task. After outlining the considerations that informed the design of our POS tagset and presenting the tagset itself, we describe our two-stage tagging process, in which text is first assigned POS tags automatically and then corrected by human annotators. Section 3 briefly presents the results of a comparison between entirely manual and semi-automated tagging, with the latter being shown to be superior on three counts: speed, consistency, and accuracy. In Section 4, we turn to the bracketing task. Just as with the tagging task, we have partially automated the bracketing task: the output of

Citation Graph

Loading graph...

References [17]

Sort:

Filter:

[1]A Machine Learning Approach to the Resolution of Syntactic Category Ambiguity for English

C. Lewis, P. Eaton, C. Kulp, J. Schwartz - 1990

1 paper in library cites