Papperoni

2018

Think You Have Solved Question Answering? Try arc, the Ai2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord

Open PDF Google Scholar

citations

Cite Score

73

AI summary

This paper introduces the AI2 Reasoning Challenge (ARC) dataset, comprising 7,787 grade-school science questions, a 14M sentence science corpus, and three neural baselines (DecompAttn, BiDAF, DGEM) to foster research in advanced question answering, demonstrating that current models struggle on a "Challenge Set" designed to require deeper reasoning.

Main Contributions

Introduction of the AI2 Reasoning Challenge (ARC) dataset, consisting of 7,787 natural, grade-school science questions.
Creation of a Challenge Set (2,590 questions) designed to be difficult for simple retrieval and co-occurrence algorithms, and an Easy Set (5,197 questions).
Release of the ARC Corpus, a 14M science sentence corpus, to aid in addressing the challenge.
Adaptation and testing of three neural baseline models (DecompAttn, BiDAF, DGEM) on ARC.
Demonstration that current state-of-the-art neural models fail to significantly outperform a random baseline on the Challenge Set, highlighting the need for advanced QA methods.

Abstract

We present a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQUAD or SNLI. The ARC question set is partitioned into a Challenge Set and an Easy Set, where the Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm. The dataset contains only natural, grade-school science questions (authored for human tests), and is the largest public-domain set of this kind (7,787 questions). We test several baselines on the Challenge Set, including leading neural models from the SQUAD and SNLI tasks, and find that none are able to significantly outperform a random baseline, reflecting the difficult nature of this task. We are also releasing the ARC Corpus, a corpus of 14M science sentences relevant to the task, and implementations of the three neural baseline models tested. Can your model perform better? We pose ARC as a challenge to the community.

Citation Graph

Loading graph...

References [31]

Sort:

Filter:

[1]SQuAD: 100,000+ Questions for Machine Comprehension of Text

P. Rajpurkar, J. Zhang, K. Lopyrev, Percy Liang - 2016

37 papers in library cite

Nice paper that introduced an important dataset. Not much else though.

[2]Teaching Machines to Read and Comprehend

K. M. Hermann, T. Kocisky, Edward Grefenstette, L. Espeholt, W. Kay, M. Suleyman, Phil Blunsom - 2015

31 papers in library cite

Nice way of converting unsupervised data to train for Q&A - and nice visualizations as well :) But I think their main contribution is the dataset. Maybe with the dataset they "unlocked" summarization?

[3]TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

M. Joshi, E. Choi, D. Weld, Luke Zettlemoyer - 2017

18 papers in library cite

I like the way they collect the data, and I think this is a nice dataset. However, it seems like they didn't even try to make a good baseline.

[4]Memory Networks

Jason Weston, S. Chopra, Antoine Bordes - 2015

18 papers in library cite

The first half of the paper (when they discuss the concept in a very abstract way) is amazing. However, the actual methodology was very convoluted - I did not like it. I thought that Neural Turing Machines were inspired in this, but actually they are contemporary... So anyway, the concept is nice, execution is not.

[5]Bidirectional Attention Flow for Machine Comprehension

M. Seo, A. Kembhavi, Ali Farhadi, Hananneh Hajishirzi - 2017

13 papers in library cite

It's alright but the method seems absurdly complex. Maybe I am a bit biased because it's like the 20th paper that I read with attention + LSTMs...

[6]Adversarial Examples for Evaluating Reading Comprehension Systems

R. Jia, Percy Liang - 2017

11 papers in library cite

I liked it a lot! It's good to see people testing things rather than just trying to beat SotA!

[7]A Decomposable Attention Model for Natural Language Inference

A. P. Parikh, O. Tackstrom, Dipanjan Das, Jakob Uszkoreit - 2016

11 papers in library cite

Very nice alternative to the common LSTM encoder-decoder architecture! Seems similar o the Transformers arch in the sense that they don't use RNNs. Nice that they analyze computational complexity as well.

[8]Towards AI-complete Question Answering: A Set of Prerequisite Toy Tasks

Jason Weston, Antoine Bordes, S. Chopra, Tomas Mikolov - 2015

11 papers in library cite

It's a good idea and a nice read but the bad part is that most of the tasks are already easy.

[9]Annotation Artifacts in Natural Language Inference Data

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Richard Schwartz, S. Bowman, Noah A. Smith - 2018

6 papers in library cite

I love papers that show flaws in current methodology/datasets. Very nice for pointing it out!

[10]MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text

M. Richardson, C. J. C. Burges, Erin Renshaw - 2013

16 papers in library cite

Maybe the best dataset paper I have ever read. So well explained, thoroughly thought! It's a shame it's a very small dataset...

[11]Newsqa: A Machine Comprehension Dataset

A. Trischler, Tianle Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, K. Suleman - 2017

6 papers in library cite

[12]Crowdsourcing Multiple Choice Science Questions

J. Welbl, N. F. Liu, Matt Gardner - 2017

3 papers in library cite

[13]Constructing Datasets for Multi-Hop Reading Comprehension Across Documents

J. Welbl, P. Stenetorp, Sebastian Riedel - 2018

2 papers in library cite

[14]Diagram Understanding in Geometry Questions

M. J. Seo, Hananneh Hajishirzi, Ali Farhadi, Oren Etzioni - 2014

2 papers in library cite

[15]My Computer Is an Honor Student but How Intelligent Is It? Standardized Tests as a Measure of AI

Peter Clark, Oren Etzioni - 2016

2 papers in library cite

[16]Question Answering via Integer Programming Over Semi-Structured Knowledge

Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Peter Clark, Oren Etzioni, Dan Roth - 2016

2 papers in library cite

[17]Scitail: A Textual Entailment Dataset From Science Question Answering

Tushar Khot, Ashish Sabharwal, Peter Clark - 2018

2 papers in library cite

[18]Tracking the World State With Recurrent Entity Networks

M. Henaff, Jason Weston, A. Szlam, Antoine Bordes, Yann Lecun - 2016

2 papers in library cite

[19]AI Beat Humans at Reading! Maybe Not

T. Simonite - 2018

1 paper in library cites

[20]Answering Complex Questions Using Open Information Extraction

Tushar Khot, Ashish Sabharwal, Peter Clark - 2017

1 paper in library cites

[21]Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension

A. Kembhavi, M. Seo, D. Schwenk, J. Choi, Ali Farhadi, Hananneh Hajishirzi - 2017

1 paper in library cites

[22]Can an AI Get Into the university of tokyo?

E. Strickland - 2013

1 paper in library cites

[23]Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions

Peter Clark, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, P. D. Turney, Daniel Khashabi - 2016

1 paper in library cites

[24]Evaluation of Information Access Technologies

Nii - 2017

1 paper in library cites

[25]How to Write Science Questions That Are Easy for People and Hard for Computers

E. Davis - 2016

1 paper in library cites

[26]Moving Beyond the Turing Test With the Allen AI Science Challenge

Carissa Schoenick, Peter Clark, Oyvind Tafjord, P. Turney, Oren Etzioni - 2017

1 paper in library cites

[27]Overview of Todai Robot Project and Evaluation Framework of Its NLP-Based Problem Solving

A. Fujita, A. Kameda, A. Kawazoe, Y. Miyao - 2014

1 paper in library cites

[28]Query-Reduction Networks for Question Answering

M. Seo, S. Min, Ali Farhadi, Hananneh Hajishirzi - 2017

1 paper in library cites

[29]Question Answering as Global Reasoning Over Semantic Abstractions

Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Dan Roth - 2018

1 paper in library cites

[30]Selected Grand Challenges in Cognitive Science

R. Brachman - 2005

1 paper in library cites

[31]Word Association Norms, Mutual Information and Lexicography

K. W. Church, P. Hanks - 1989

1 paper in library cites

Cited by

5

papers in your library

Cites

10

papers in your library

Read

on May 23, 2026

Meh, this is just the dataset definition. I don't see anything special, just a new data source. No new methodology or anything.

Tags

Paper Aliases

No aliases