Small-Scale Proxies for Large-Scale Transformer Training Instabilities
M. Wortsman, P. J. Liu, L. Xiao, K. Everett, A. A. Alemi, B. Adlam, J. D. C. Reyes, I. Gur, A. Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl Dickstein, K. Xu, Jaehoon Lee, J. Gilmer, S. Kornblith