DataComp-LM: In search of the next generation of training sets for language models

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, et al.. NeurIPS Datasets and Benchmarks Track 2024.