2026

How Reliable Is Language Model Micro-Benchmarking?

Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta

citations

Cite Score

0

AI summary

This paper introduces Minimum Detectable Ability Difference (MDAD), a meta-evaluation measure to assess the reliability of language model micro-benchmarks by evaluating pairwise model rankings against full benchmarks, finding that random sampling is competitive at larger sizes (250+ examples) and that small micro-benchmarks often fail to distinguish models with similar performances.

Main Contributions

  • Introduces Minimum Detectable Ability Difference (MDAD) as a meta-evaluation measure for micro-benchmarking reliability.
  • Shows that no micro-benchmarking method consistently ranks model pairs that differ by 3.5 points (MMLU-Pro) or 4 points (BIG-bench Hard) of accuracy.
  • Demonstrates that 250 examples are often needed for consistent model ranking, at which point random sampling becomes competitive with other methods.
  • Reveals that over half of pairwise comparisons on MMLU-Pro micro-benchmarks with 25 examples are unlikely to be preserved for 8B instruction-tuned models.
  • Provides actionable guidance for micro-benchmark users and developers on the trade-off between evaluation efficiency and reliability.

Abstract

Micro-benchmarking offers a solution to the often prohibitive time and cost of language model development: evaluate on a very small subset of existing bench- marks. Can these micro-benchmarks, however, rank models as consistently as the full benchmarks they replace? And can they rank models more consistently than selecting a random subset of data points? In many scenarios, we find that the answer is no. We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark. This approach can deter- mine which model pairs can be ranked correctly by a micro-benchmark, allowing for a finer-grained analysis of the trade-off between micro-benchmark size and reliability. Prior work has suggested selecting as few as 10 examples; we find that no micro-benchmarking method can consistently rank model pairs 3.5 points of accuracy apart on MMLU-Pro or 4 points apart on BIG-bench Hard. In order to consistently rank model pairs with relatively similar performances, we show that often as many as 250 examples must be selected, at which point random sampling is competitive with existing micro-benchmarking methods. When comparing only 8B instruction-tuned models on MMLU-Pro micro-benchmarks with 25 examples, we find that more than half of pairwise comparisons are not likely to be preserved. Our work provides actionable guidance for both micro-benchmark users and developers in navigating the trade-off between evaluation efficiency and reliability. Code and data are at: github.com/dill-lab/micro-benchmarking-reliability

Citation Graph

Loading graph...

References [46]

Sort:
Filter:

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt - 2021

6 papers in library cite

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aman Gupta, Adria Garriga Alonso - 2022

4 papers in library cite

Mirac Suzgun, Nathan Scales, Nathanael Scharli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, Jason Wei - 2022

4 papers in library cite

Kawin Ethayarajh, Dan Jurafsky - 2020

3 papers in library cite

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, Samuel R. Bowman - 2024

3 papers in library cite

Yuzhi Wang, X. Ma, G. Zhang, Yuan Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Zhejun Jiang, Tao Li, M. Ku, K. Wang, A. Zhuang, R. Fan, Xiang Yue, Weizhu Chen - 2024

3 papers in library cite

D. Card, P. Henderson, U. Khandelwal, R. Jia, K. Mahowald, Dan Jurafsky - 2020

2 papers in library cite

Wei-Lin Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, Tao Li, Dustin Li, B. Zhu, Haowei Zhang, M. Jordan, Joseph E. Gonzalez - 2024

2 papers in library cite

Douwe Kiela, M. Bartolo, Y. Nie, D. Kaushik, A. Geiger, Ziyi Wu, B. Vidgen, G. Prasad, A. Singh, P. Ringshia - 2021

2 papers in library cite

Percy Liang, R. Bommasani, Teddy Lee, D. Tsipras, Dilara Soylu, Michihiro Yasunaga, Y. Z. Zhang, D. Narayanan, Yonghui Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, Chiyuan Zhang, C. Cosgrove, Christopher D. Manning, C. Re, D. A. Navas, D. A. Hudson, E. Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, H. Ren, H. Yao, J. Wang, K. Santhanam, L. Orr, L. Zheng, M. Yuksekgonul, Mirac Suzgun, N. Kim, N. Guha, N. S. Chatterji, Omar Khattab, P. Henderson, Q. Huang, R. A. Chi, S. M. Xie, S. Santurkar, Surya Ganguli, Tatsunori Hashimoto, T. Icard, Tong Zhang, V. Chaudhary, Wenyi Wang, Xiang Lisa Li, Y. Mai, Y. Z. Zhang, Y. Koreeda - 2023

2 papers in library cite

C. Fourrier, N. Habib, A. Lozovskaya, K. Szafer, Thomas Wolf - 2024

2 papers in library cite

R. Fogliato, Piyush Patil, M. Monfort, Pietro Perona - 2024

1 paper in library cites

Leo Gao, J. Tow, B. Abbasi, Stella Biderman, S. Black, A. Dipofi, C. Foster, L. Golding, J. Hsu, A. L. Noac'h, H. Li, Kyle Mcdonell, Niklas Muennighoff, C. Ociepa, Jason Phang, Laria Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, Eric Tang, A. Thite, B. Wang, K. Wang, Andy Zou - 2024

1 paper in library cites

D. G. Horvitz, D. J. Thompson - 1952

1 paper in library cites

J. Kossen, S. Farquhar, Yarin Gal, T. Rainforth - 2021

1 paper in library cites

R. Vivek, Kawin Ethayarajh, Diyi Yang, Douwe Kiela - 2024

1 paper in library cites

C. R. Harris, K. J. Millman, S. J. V. D. Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. V. Kerkwijk, M. Brett, A. Haldane, J. F. D. Rio, M. Wiebe, P. Peterson, P. G. Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, T. E. Oliphant - 2020

1 paper in library cites

M. Saxon, Ari Holtzman, P. West, W. Y. Wang, N. Saphra - 2024

1 paper in library cites

H. Zhao, M. Li, Lichao Sun, T. Zhou - 2025

1 paper in library cites

Sayan Ghosh, T. Srinivasan, Swabha Swayamdipta - 2024

1 paper in library cites

C. Vania, Phu Mon Htut, Weixiao Huang, D. Mungra, R. Y. Pang, Jason Phang, Haozhe Liu, Kyunghyun Cho, S. Bowman - 2021

1 paper in library cites

N. F. Liu, Teddy Lee, R. Jia, Percy Liang - 2023

1 paper in library cites

Y. Perlitz, E. Bandel, A. Gera, O. Arviv, L. E. Dor, E. Shnarch, N. Slonim, M. S. Scheuer, L. Choshen - 2024

1 paper in library cites

R. B. Nelsen - 2001

1 paper in library cites

P. Rodriguez, J. Barrow, A. M. Hoyle, J. P. Lalor, R. Jia, J. B. Graber - 2021

1 paper in library cites

R. Deb, K. K. Thekumparampil, K. Kalantari, G. Hiranandani, S. Sabach, B. Kveton - 2025

1 paper in library cites

D. Everaert, Christopher Potts - 2024

1 paper in library cites

G. Zhang, F. E. Dorner, Moritz Hardt - 2025

1 paper in library cites

Q. Ye, H. Fu, Xiang Ren, R. Jia - 2023

1 paper in library cites

V. Zouhar, P. Cui, M. Sachan - 2025

1 paper in library cites

V. Gupta, C. Ross, D. Pantoja, R. J. Passonneau, M. Ung, A. Williams - 2025

1 paper in library cites

L. Cai, K. Choi, M. Hansen, L. Harrell - 2016

1 paper in library cites

S. A. Tahan, A. Gera, B. Sznajder, L. Choshen, L. E. Dor, E. Shnarch - 2024

1 paper in library cites

M. Xia, S. Malladi, Suchin Gururangan, S. Arora, Deli Chen - 2024

1 paper in library cites

L. Engstrom, A. Ilyas, Berlin Chen, A. Feldmann, W. Moses, A. Madry - 2025

1 paper in library cites

R. Fogliato, Piyush Patil, N. J. Akpinar, M. Monfort - 2024

1 paper in library cites

J. P. Lalor, P. Rodriguez - 2023

1 paper in library cites

L. Madaan, A. K. Singh, R. Schaeffer, A. Poulton, S. Koyejo, P. Stenetorp, S. Narang, Dieuwke Hupkes - 2024

1 paper in library cites

B. Koch, E. Denton, A. Hanna, J. G. Foster - 2021

1 paper in library cites

R. Bardenet, Sayan Ghosh, H. S. Onfroy, H. S. Tran - 2024

1 paper in library cites

Gregory Yauney, D. Mimno - 2024

1 paper in library cites

R. Dror, G. Baumer, S. Shlomov, R. Reichart - 2018

1 paper in library cites

J. Cohen - 1962

1 paper in library cites

F. M. Polo, L. Weber, L. Choshen, Y. S. Sun, G. Xu, M. Yurochkin - 2024

1 paper in library cites

S. S. Shwartz, S. B. David - 2014

1 paper in library cites

M. Boubdir, Ethan Kim, B. Ermis, M. Fadaee, S. Hooker - 2023

1 paper in library cites

Cited by

0

papers in your library

Cites

8

papers in your library

Read

on April 16, 2026

Your review

Tags

ICLR2026

Paper Aliases

No aliases