2026
Cite Score
0
AI summary
This paper introduces Minimum Detectable Ability Difference (MDAD), a meta-evaluation measure to assess the reliability of language model micro-benchmarks by evaluating pairwise model rankings against full benchmarks, finding that random sampling is competitive at larger sizes (250+ examples) and that small micro-benchmarks often fail to distinguish models with similar performances.
Main Contributions
Abstract
Micro-benchmarking offers a solution to the often prohibitive time and cost of language model development: evaluate on a very small subset of existing bench- marks. Can these micro-benchmarks, however, rank models as consistently as the full benchmarks they replace? And can they rank models more consistently than selecting a random subset of data points? In many scenarios, we find that the answer is no. We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark. This approach can deter- mine which model pairs can be ranked correctly by a micro-benchmark, allowing for a finer-grained analysis of the trade-off between micro-benchmark size and reliability. Prior work has suggested selecting as few as 10 examples; we find that no micro-benchmarking method can consistently rank model pairs 3.5 points of accuracy apart on MMLU-Pro or 4 points apart on BIG-bench Hard. In order to consistently rank model pairs with relatively similar performances, we show that often as many as 250 examples must be selected, at which point random sampling is competitive with existing micro-benchmarking methods. When comparing only 8B instruction-tuned models on MMLU-Pro micro-benchmarks with 25 examples, we find that more than half of pairwise comparisons are not likely to be preserved. Our work provides actionable guidance for both micro-benchmark users and developers in navigating the trade-off between evaluation efficiency and reliability. Code and data are at: github.com/dill-lab/micro-benchmarking-reliability
Citation Graph
References [46]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt - 2021
6 papers in library cite
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aman Gupta, Adria Garriga Alonso - 2022
4 papers in library cite
Mirac Suzgun, Nathan Scales, Nathanael Scharli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, Jason Wei - 2022
4 papers in library cite
Kawin Ethayarajh, Dan Jurafsky - 2020
3 papers in library cite
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, Samuel R. Bowman - 2024
3 papers in library cite
Yuzhi Wang, X. Ma, G. Zhang, Yuan Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Zhejun Jiang, Tao Li, M. Ku, K. Wang, A. Zhuang, R. Fan, Xiang Yue, Weizhu Chen - 2024
3 papers in library cite
D. Card, P. Henderson, U. Khandelwal, R. Jia, K. Mahowald, Dan Jurafsky - 2020
2 papers in library cite
Wei-Lin Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, Tao Li, Dustin Li, B. Zhu, Haowei Zhang, M. Jordan, Joseph E. Gonzalez - 2024
2 papers in library cite
Douwe Kiela, M. Bartolo, Y. Nie, D. Kaushik, A. Geiger, Ziyi Wu, B. Vidgen, G. Prasad, A. Singh, P. Ringshia - 2021
2 papers in library cite
Percy Liang, R. Bommasani, Teddy Lee, D. Tsipras, Dilara Soylu, Michihiro Yasunaga, Y. Z. Zhang, D. Narayanan, Yonghui Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, Chiyuan Zhang, C. Cosgrove, Christopher D. Manning, C. Re, D. A. Navas, D. A. Hudson, E. Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, H. Ren, H. Yao, J. Wang, K. Santhanam, L. Orr, L. Zheng, M. Yuksekgonul, Mirac Suzgun, N. Kim, N. Guha, N. S. Chatterji, Omar Khattab, P. Henderson, Q. Huang, R. A. Chi, S. M. Xie, S. Santurkar, Surya Ganguli, Tatsunori Hashimoto, T. Icard, Tong Zhang, V. Chaudhary, Wenyi Wang, Xiang Lisa Li, Y. Mai, Y. Z. Zhang, Y. Koreeda - 2023
2 papers in library cite
C. Fourrier, N. Habib, A. Lozovskaya, K. Szafer, Thomas Wolf - 2024
2 papers in library cite
R. Fogliato, Piyush Patil, M. Monfort, Pietro Perona - 2024
1 paper in library cites
Leo Gao, J. Tow, B. Abbasi, Stella Biderman, S. Black, A. Dipofi, C. Foster, L. Golding, J. Hsu, A. L. Noac'h, H. Li, Kyle Mcdonell, Niklas Muennighoff, C. Ociepa, Jason Phang, Laria Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, Eric Tang, A. Thite, B. Wang, K. Wang, Andy Zou - 2024
1 paper in library cites
D. G. Horvitz, D. J. Thompson - 1952
1 paper in library cites
J. Kossen, S. Farquhar, Yarin Gal, T. Rainforth - 2021
1 paper in library cites
R. Vivek, Kawin Ethayarajh, Diyi Yang, Douwe Kiela - 2024
1 paper in library cites
C. R. Harris, K. J. Millman, S. J. V. D. Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. V. Kerkwijk, M. Brett, A. Haldane, J. F. D. Rio, M. Wiebe, P. Peterson, P. G. Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, T. E. Oliphant - 2020
1 paper in library cites
M. Saxon, Ari Holtzman, P. West, W. Y. Wang, N. Saphra - 2024
1 paper in library cites
H. Zhao, M. Li, Lichao Sun, T. Zhou - 2025
1 paper in library cites
Sayan Ghosh, T. Srinivasan, Swabha Swayamdipta - 2024
1 paper in library cites
C. Vania, Phu Mon Htut, Weixiao Huang, D. Mungra, R. Y. Pang, Jason Phang, Haozhe Liu, Kyunghyun Cho, S. Bowman - 2021
1 paper in library cites
N. F. Liu, Teddy Lee, R. Jia, Percy Liang - 2023
1 paper in library cites
Y. Perlitz, E. Bandel, A. Gera, O. Arviv, L. E. Dor, E. Shnarch, N. Slonim, M. S. Scheuer, L. Choshen - 2024
1 paper in library cites
R. B. Nelsen - 2001
1 paper in library cites
P. Rodriguez, J. Barrow, A. M. Hoyle, J. P. Lalor, R. Jia, J. B. Graber - 2021
1 paper in library cites
R. Deb, K. K. Thekumparampil, K. Kalantari, G. Hiranandani, S. Sabach, B. Kveton - 2025
1 paper in library cites
D. Everaert, Christopher Potts - 2024
1 paper in library cites
G. Zhang, F. E. Dorner, Moritz Hardt - 2025
1 paper in library cites
Q. Ye, H. Fu, Xiang Ren, R. Jia - 2023
1 paper in library cites
V. Zouhar, P. Cui, M. Sachan - 2025
1 paper in library cites
V. Gupta, C. Ross, D. Pantoja, R. J. Passonneau, M. Ung, A. Williams - 2025
1 paper in library cites
L. Cai, K. Choi, M. Hansen, L. Harrell - 2016
1 paper in library cites
S. A. Tahan, A. Gera, B. Sznajder, L. Choshen, L. E. Dor, E. Shnarch - 2024
1 paper in library cites
M. Xia, S. Malladi, Suchin Gururangan, S. Arora, Deli Chen - 2024
1 paper in library cites
L. Engstrom, A. Ilyas, Berlin Chen, A. Feldmann, W. Moses, A. Madry - 2025
1 paper in library cites
R. Fogliato, Piyush Patil, N. J. Akpinar, M. Monfort - 2024
1 paper in library cites
J. P. Lalor, P. Rodriguez - 2023
1 paper in library cites
L. Madaan, A. K. Singh, R. Schaeffer, A. Poulton, S. Koyejo, P. Stenetorp, S. Narang, Dieuwke Hupkes - 2024
1 paper in library cites
B. Koch, E. Denton, A. Hanna, J. G. Foster - 2021
1 paper in library cites
R. Bardenet, Sayan Ghosh, H. S. Onfroy, H. S. Tran - 2024
1 paper in library cites
Gregory Yauney, D. Mimno - 2024
1 paper in library cites
R. Dror, G. Baumer, S. Shlomov, R. Reichart - 2018
1 paper in library cites
J. Cohen - 1962
1 paper in library cites
F. M. Polo, L. Weber, L. Choshen, Y. S. Sun, G. Xu, M. Yurochkin - 2024
1 paper in library cites
S. S. Shwartz, S. B. David - 2014
1 paper in library cites
M. Boubdir, Ethan Kim, B. Ermis, M. Fadaee, S. Hooker - 2023
1 paper in library cites
Cited by
0
papers in your library
Cites
8
papers in your library
Read
on April 16, 2026
Your review
Tags
Paper Aliases
No aliases