2005

Automatically Constructing a Corpus of Sentential Paraphrases

W. Dolan, Chris Brockett

citations

Cite Score

57

AI summary

This paper introduces the Microsoft Research Paraphrase Corpus (MSRP), a dataset of 5801 hand-labeled sentence pairs, created using heuristic extraction and an SVM classifier, to address the lack of large-scale, publicly available paraphrase corpora.

Main Contributions

  • Introduced the Microsoft Research Paraphrase Corpus (MSRP), consisting of 5801 hand-labeled sentence pairs, to address the scarcity of large-scale paraphrase corpora.
  • Developed a methodology combining heuristic extraction techniques with an SVM-based classifier to identify likely sentence-level paraphrases from topic-clustered news data.
  • Evaluated the selected pairs with human judges, confirming that 67% were semantically equivalent, demonstrating the effectiveness of the semi-automatic approach.
  • Discussed challenges and issues in defining guidelines for human raters in paraphrase identification, contributing insights into corpus annotation.
  • Proposed the concept of a "virtual paraphrase corpus" to mitigate selectional biases and enable better cross-comparison across different research efforts.

Abstract

An obstacle to research in automatic paraphrase identification and generation is the lack of large-scale, publicly-available labeled corpora of sentential paraphrases. This paper describes the creation of the recently-released Microsoft Research Paraphrase Corpus, which contains 5801 sentence pairs, each hand-labeled with a binary judgment as to whether the pair constitutes a paraphrase. The corpus was created using heuristic extraction techniques in conjunction with an SVM-based classifier to select likely sentence-level paraphrases from a large corpus of topic-clustered news data. These pairs were then submitted to human judges, who confirmed that 67% were in fact semantically equivalent. In addition to describing the corpus itself, we explore a number of issues that arose in defining guidelines for the human raters.

Citation Graph

Loading graph...

References [30]

Sort:
Filter:

C. Fellbaum - 1998

12 papers in library cite

B. Dolan, C. Quirk, C. A. Brockett, C. Chris - 2004

5 papers in library cite

V. Vapnik - 1995

9 papers in library cite

P. F. Brown, S. D. Pietra, Vincent J. Della Pietra, R. L. Mercer - 1993

7 papers in library cite

F. J. Och, Hermann Ney - 2003

3 papers in library cite

R. Barzilay, L. Lee - 2003

2 papers in library cite

C. Corley, R. Mihalcea - 2005

2 papers in library cite

V. Levenshtein - 1966

1 paper in library cites

Chris Brockett, William B. Dolan - 2005

1 paper in library cites

K. Rooney - 2001

1 paper in library cites

R. Barzilay, K. R. Mckeown - 2001

1 paper in library cites

John C. Platt - 1999

1 paper in library cites

J. Burger, L. Ferro - 2005

1 paper in library cites

F. J. Och, Hermann Ney - 2000

1 paper in library cites

Missing year

H. D. Iii, D. Marcu

1 paper in library cites

S. Dumais, J. Platt, D. Heckerman, M. Sahami - 1998

1 paper in library cites

Missing author listMissing year

1 paper in library cites

C. Quirk, Chris Brockett, William B. Dolan - 2004

1 paper in library cites

Pascale Fung, P. Cheung - 2004

1 paper in library cites

S. Huang, D. Graff, G. Doddington - 2002

1 paper in library cites

A. Finch, T. Watanabe, Y. Akiba, E. Sumita - 2004

1 paper in library cites

Y. Z. Zhang, K. Yamamoto - 2002

1 paper in library cites

Chris Brockett, William B. Dolan - 2005

1 paper in library cites

Bo Pang, K. Knight, D. Marcu - 2003

1 paper in library cites

J. Weeds, D. Weir, B. Keller - 2005

1 paper in library cites

S. Shirai, K. Yamamoto, F. Bond, H. Tanaka - 2002

1 paper in library cites

S. Dumais - 1998

1 paper in library cites

Cited by

9

papers in your library

Cites

1

papers in your library

Read

on February 1, 2026

Your review

Tags

Paper Aliases

No aliases