2019

Natural Questions: A Benchmark for Question Answering Research

T. Kwiatkowski, J. Palomaki, O. Rhinehart, Michael Collins, A. P. Parikh, C. Alberti, D. Epstein, Illia Polosukhin, M. Kelcey, Jacob Devlin, K. Lee, K. N. Toutanova, Llion Jones, M. W. Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, Slav Petrov

citations

Cite Score

71

AI summary

The paper introduces the Natural Questions (NQ) dataset, a new QA dataset, which contains 307,373 training examples of real anonymized, aggregated queries issued to the Google search engine and paired with annotations from Wikipedia pages, and achieves high precision and recall.

Main Contributions

  • Introduces the Natural Questions (NQ) corpus, a large-scale QA dataset based on real user queries and Wikipedia pages.
  • Provides a detailed analysis of annotation quality and human variability in answering natural questions.
  • Introduces robust metrics for evaluating question answering systems on the NQ dataset.
  • Establishes high human upper bounds on the proposed evaluation metrics.
  • Presents baseline results using competitive methods from related literature, demonstrating a gap between current performance and human upper bounds.

Abstract

We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.

Citation Graph

Loading graph...

References [27]

Sort:
Filter:

K. Papineni, S. Roukos, T. Ward, Wei Jing Zhu - 2002

19 papers in library cite

P. Rajpurkar, J. Zhang, K. Lopyrev, Percy Liang - 2016

37 papers in library cite

Samuel R. Bowman, G. Angeli, Christopher Potts, Christopher D. Manning - 2015

25 papers in library cite

A. Williams, Nikita Nangia, S. Bowman - 2018

19 papers in library cite

K. M. Hermann, T. Kocisky, Edward Grefenstette, L. Espeholt, W. Kay, M. Suleyman, Phil Blunsom - 2015

31 papers in library cite

P. Rajpurkar, R. Jia, Percy Liang - 2018

14 papers in library cite

M. Joshi, E. Choi, D. Weld, Luke Zettlemoyer - 2017

18 papers in library cite

R. Jia, Percy Liang - 2017

11 papers in library cite

A. P. Parikh, O. Tackstrom, Dipanjan Das, Jakob Uszkoreit - 2016

11 papers in library cite

Guokun Lai, Q. Xie, Haozhe Liu, Yining Yang, Eduard Hovy - 2017

11 papers in library cite

Siva Reddy, Deli Chen, Christopher D. Manning - 2018

6 papers in library cite

M. Richardson, C. J. C. Burges, Erin Renshaw - 2013

16 papers in library cite

D. Paperno, German Kruszewski, A. Lazaridou, N. Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, Raquel Fernandez - 2016

12 papers in library cite

F. Hill, Antoine Bordes, S. Chopra, Jason Weston - 2015

14 papers in library cite

C. Clark, Matt Gardner - 2017

7 papers in library cite

C. Alberti, K. Lee, Michael Collins - 2019

2 papers in library cite

Zhilin Yang, P. Qi, S. Zhang, Yoshua Bengio, W. Cohen, Ruslan Salakhutdinov, Christopher D. Manning - 2018

4 papers in library cite

Deli Chen, Adam Fisch, Jason Weston, Antoine Bordes - 2017

10 papers in library cite

T. N. Nguyen, M. Rosenberg, X. Song, Jianfeng Gao, S. Tiwary, R. Majumder, L. Deng - 2016

8 papers in library cite

E. Choi, He He, M. Iyyer, M. Yatskar, W. T. Yih, Yejin Choi, Percy Liang, Luke Zettlemoyer - 2018

8 papers in library cite

T. Kocisky, J. Schwarz, Phil Blunsom, C. Dyer, K. M. Hermann, G. Melis, Edward Grefenstette - 2018

4 papers in library cite

T. Mihaylov, Peter Clark, Tushar Khot, Ashish Sabharwal - 2018

6 papers in library cite

T. Onishi, Haiming Wang, Mohit Bansal, Kevin Gimpel, D. Mcallester - 2016

4 papers in library cite

Yining Yang, W. T. Yih, C. Meek - 2015

4 papers in library cite

L. Devroye, L. Gyorfi, G. Lugosi - 1997

1 paper in library cites

M. A. Hearst - 1992

1 paper in library cites

Weiran He, K. Liu, Joseph Liu, Y. Lyu, Siheng Zhao, X. Xiao, Yibo Liu, Yuzhi Wang, H. Wu, Q. She, Xiaodong Liu, Tianhao Wu, Haiming Wang - 2018

1 paper in library cites

Cited by

9

papers in your library

Cites

21

papers in your library

Read

on November 11, 2025

Your review

Tags

Paper Aliases

No aliases