Papperoni

2019

Natural Questions: A Benchmark for Question Answering Research

T. Kwiatkowski, J. Palomaki, O. Rhinehart, Michael Collins, A. P. Parikh, C. Alberti, D. Epstein, Illia Polosukhin, M. Kelcey, Jacob Devlin, K. Lee, K. N. Toutanova, Llion Jones, M. W. Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, Slav Petrov

Open PDF Google Scholar

citations

Cite Score

71

AI summary

The paper introduces the Natural Questions (NQ) dataset, a new QA dataset, which contains 307,373 training examples of real anonymized, aggregated queries issued to the Google search engine and paired with annotations from Wikipedia pages, and achieves high precision and recall.

Main Contributions

Introduces the Natural Questions (NQ) corpus, a large-scale QA dataset based on real user queries and Wikipedia pages.
Provides a detailed analysis of annotation quality and human variability in answering natural questions.
Introduces robust metrics for evaluating question answering systems on the NQ dataset.
Establishes high human upper bounds on the proposed evaluation metrics.
Presents baseline results using competitive methods from related literature, demonstrating a gap between current performance and human upper bounds.

Abstract

We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.

Citation Graph

Loading graph...

References [27]

Sort:

Filter:

[1]BLUE: A Method for Automatic Evaluation of Machine Translation

K. Papineni, S. Roukos, T. Ward, Wei Jing Zhu - 2002

19 papers in library cite

Very cool idea. Simple yet very impactful!

[2]SQuAD: 100,000+ Questions for Machine Comprehension of Text

P. Rajpurkar, J. Zhang, K. Lopyrev, Percy Liang - 2016

37 papers in library cite

Nice paper that introduced an important dataset. Not much else though.

[3]A Large Annotated Corpus for Learning Natural Language Inference

Samuel R. Bowman, G. Angeli, Christopher Potts, Christopher D. Manning - 2015

25 papers in library cite

Dataset collection is ok. The model that they create seems very low effort.

[4]A Broad-Coverage Challenge Corpus for Sentence Understanding Through Inference

A. Williams, Nikita Nangia, S. Bowman - 2018

19 papers in library cite

Very nice paper and cool dataset - good thing they expanded SNLI. Also, they at least tried to have a good baseline, and comparisons of domains are nice.

[5]Teaching Machines to Read and Comprehend

K. M. Hermann, T. Kocisky, Edward Grefenstette, L. Espeholt, W. Kay, M. Suleyman, Phil Blunsom - 2015

31 papers in library cite

Nice way of converting unsupervised data to train for Q&A - and nice visualizations as well :) But I think their main contribution is the dataset. Maybe with the dataset they "unlocked" summarization?

[6]Know What You Don't Know: Un-Answerable Questions for SQuAD

P. Rajpurkar, R. Jia, Percy Liang - 2018

14 papers in library cite

It's alright... It's an extension to the other paper/dataset. I feel that it didn't need to be a full paper (maybe a 6-pager).

[7]TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

M. Joshi, E. Choi, D. Weld, Luke Zettlemoyer - 2017

18 papers in library cite

I like the way they collect the data, and I think this is a nice dataset. However, it seems like they didn't even try to make a good baseline.

[8]Adversarial Examples for Evaluating Reading Comprehension Systems

R. Jia, Percy Liang - 2017

11 papers in library cite

I liked it a lot! It's good to see people testing things rather than just trying to beat SotA!

[9]A Decomposable Attention Model for Natural Language Inference

A. P. Parikh, O. Tackstrom, Dipanjan Das, Jakob Uszkoreit - 2016

11 papers in library cite

Very nice alternative to the common LSTM encoder-decoder architecture! Seems similar o the Transformers arch in the sense that they don't use RNNs. Nice that they analyze computational complexity as well.

[10]RACE: Large-Scale Reading Comprehension Dataset From Examinations

Guokun Lai, Q. Xie, Haozhe Liu, Yining Yang, Eduard Hovy - 2017

11 papers in library cite

I really like the idea of using human tests for testing AI. Also, very nice insige for using chinese tests!

[11]CoQA: A Conversational Question Answering Challenge

Siva Reddy, Deli Chen, Christopher D. Manning - 2018

6 papers in library cite

It's a fine paper and a solid addition to QA data + NLU.

[12]MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text

M. Richardson, C. J. C. Burges, Erin Renshaw - 2013

16 papers in library cite

Maybe the best dataset paper I have ever read. So well explained, thoroughly thought! It's a shame it's a very small dataset...

[13]The LAMBADA dataset: Word Prediction Requiring a Broad Discourse Context

D. Paperno, German Kruszewski, A. Lazaridou, N. Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, Raquel Fernandez - 2016

12 papers in library cite

Very nice paper - very interesting methodology to building it and very good when they bring a dataset that is meant to make machines to fail

[14]The Goldilocks Principle: Reading Children's Books With Explicit Memory Representations

F. Hill, Antoine Bordes, S. Chopra, Jason Weston - 2015

14 papers in library cite

Cool use of memory networks.

[15]Simple and Effective Multi-Paragraph Reading Comprehension

C. Clark, Matt Gardner - 2017

7 papers in library cite

Very nice paper! I think it's a stretch to call it "simple", but the paper is very well written and easy to follow.

[16]A BERT Baseline for the Natural Questions

C. Alberti, K. Lee, Michael Collins - 2019

2 papers in library cite

It's very simple and short, but it's nice that it set a baseline.

[17]Hotpotqa: A Dataset for Diverse, Explainable Multi-Hop Question Answering

Zhilin Yang, P. Qi, S. Zhang, Yoshua Bengio, W. Cohen, Ruslan Salakhutdinov, Christopher D. Manning - 2018

4 papers in library cite

[18]Reading Wikipedia to Answer Open-Domain Questions

Deli Chen, Adam Fisch, Jason Weston, Antoine Bordes - 2017

10 papers in library cite

Open Domain QA with wikipedia

[19]MS MARCO: A Human Generated Machine Reading Comprehension Dataset

T. N. Nguyen, M. Rosenberg, X. Song, Jianfeng Gao, S. Tiwary, R. Majumder, L. Deng - 2016

8 papers in library cite

I am not sure if this is very relevant, but it has a few citations.

[20]Quac: Question Answering in Context

E. Choi, He He, M. Iyyer, M. Yatskar, W. T. Yih, Yejin Choi, Percy Liang, Luke Zettlemoyer - 2018

8 papers in library cite

[21]The Narrative Qa Reading Comprehension Challenge

T. Kocisky, J. Schwarz, Phil Blunsom, C. Dyer, K. M. Hermann, G. Melis, Edward Grefenstette - 2018

4 papers in library cite

They say this is "too hard" - why?

[22]Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

T. Mihaylov, Peter Clark, Tushar Khot, Ashish Sabharwal - 2018

6 papers in library cite

[23]Who Did What: A Large-Scale Person-Centered Cloze Dataset

T. Onishi, Haiming Wang, Mohit Bansal, Kevin Gimpel, D. Mcallester - 2016

4 papers in library cite

[24]Wikiqa: A Challenge Dataset for Open-Domain Question Answering

Yining Yang, W. T. Yih, C. Meek - 2015

4 papers in library cite

[25]A Probabilistic Theory of Pattern Recognition, Corrected 2nd Edition

L. Devroye, L. Gyorfi, G. Lugosi - 1997

1 paper in library cites

[26]Automatic Acquisition of hyponyms From Large Text Corpora

M. A. Hearst - 1992

1 paper in library cites

[27]Dureader: A Chinese Machine Reading Comprehension Dataset From Real-World Applications

Weiran He, K. Liu, Joseph Liu, Y. Lyu, Siheng Zhao, X. Xiao, Yibo Liu, Yuzhi Wang, H. Wu, Q. She, Xiaodong Liu, Tianhao Wu, Haiming Wang - 2018

1 paper in library cites

Cited by

9

papers in your library

Cites

21

papers in your library

Read

on November 11, 2025

The dataset and methodology is very nice - it's amazing to see how Google does the summaries in search. However, the paper is too complex with the math stuff - unnecessary.

Tags

Paper Aliases

No aliases