Textbooks are All You Need: Data Pruning for Unbiased Performance Evaluation

cover
12 Sept 2024

Authors:

(1) Suriya Gunasekar, Microsoft Research;

(2) Yi Zhang, Microsoft Research;

(3) Jyoti Aneja, Microsoft Research;

(4) Caio C´esar Teodoro Mendes, Microsoft Research;

(5) Allie Del Giorno, Microsoft Research;

(6) Sivakanth Gopi, Microsoft Research;

(7) Mojan Javaheripi, Microsoft Research;

(8) Piero Kauffmann, Microsoft Research;

(9) Gustavo de Rosa, Microsoft Research;

(10) Olli Saarikivi, Microsoft Research;

(11) Adil Salim, Microsoft Research;

(12) Shital Shah, Microsoft Research;

(13) Harkirat Singh Behl, Microsoft Research;

(14) Xin Wang, Microsoft Research;

(15) S´ebastien Bubeck, Microsoft Research;

(16) Ronen Eldan, Microsoft Research;

(17) Adam Tauman Kalai, Microsoft Research;

(18) Yin Tat Lee, Microsoft Research;

(19) Yuanzhi Li, Microsoft Research.

5 Data pruning for unbiased performance evaluation

In Figure 2.1, we see that training on CodeExercises leads to a substantial boost in the performance of the model on the HumanEval benchmark. To investigate this boost, we propose to prune the CodeExercises dataset by removing files that are “similar” to those in HumanEval. This process can be viewed as a “strong form” of data decontamination. We then retrain our model on such pruned data, and still observe strong performance on HumanEval. In particular, even after aggressively pruning more than 40% of the CodeExercises dataset (this even prunes files that are only vaguely similar to HumanEval, see Appendix C), the retrained phi-1 still outperforms StarCoder.

We believe that such data pruning experiment is a fair way to evaluate performance, and is more insightful than standard “contamination” studies in the literature that are usually based on measures of overlap between training and test data (e.g., Section 4.8 of [AON+ 21]). For sake of completeness we start this section by conducting a standard contamination experiment, which shows that CodeExercises is not contaminated by HumanEval in this standard sense.

This paper is available on arxiv under CC BY 4.0 DEED license.


[1] Developing rigorous sets of tests can be a significant undertaking, as demonstrated by [LXWZ23].