Hugging Face Revises the Open LLM Leaderboard

AI Ecosystem Updates | Issue# 2 [July 10, 2024]

recent article from DeepLearning.AI highlights the changes introduced by Hugging Face in its Open LLM Leaderboard. Hugging Face introduced Open LLM Leaderboard v2 in oder to recalibrate the evaluations fairly for all of the models. The new assessments benchmarks are more challenging and harder to game. Read more below.

What?

Hugging Face has updated its Open LLM Leaderboard to include more rigorous benchmarks, reshuffling the rankings of various Large Language Models (LLMs). The changes have been introduced in response to models achieving human-level performance on previous tests, partly due to test answers leaking into training sets.

Key Changes

The following changes have been made to the six popular benchmarks used to evaluate LLMs on Hugging Face.

  • MMLU-Pro: Upgraded from the original MMLU set, it now offers 10 answer choices instead of four and includes more challenging questions.
  • GPQA: Features PhD-level questions in physics, chemistry, and biology, designed to be difficult even for experts.
  • MuSR: Tests multi-step reasoning through complex word problems like solving murder mysteries, assigning tasks, and identifying locations.
  • MATH lvl 5: While the dataset focuses on the hardest multi-step math problems across five difficulty levels, the benchmark comprises of just the hardest level.
  • IFEval: Requires models to follow specific instructions in their responses.
  • BIG-Bench Hard: Includes 23 complex tasks, such as understanding boolean expressions, detecting sarcasm, and determining shapes from graphics vectors, with examples from the toughest problems in BIG-Bench.

Impact

The revised leaderboard shows significant movement, with some models changing positions by up to 59 places. For instance, Qwen2’s 72-billion-parameter instruction-tuned model now tops the list, followed by Meta’s Llama 3-70B-Instruct.

Why? – Test Saturation and Contamination

The ranks of the open LLMs in the previous version of the Open LLM Leaderboard (still operational) is based on an aggregate of scores across six popular benchmarks. The version faced issues with models achieving human-level scores due to technical improvements and test answer leaks into the model’s training data.

Relevance

Hugging Face aims to address the issues of test saturation and contamination through the new benchmarks by replacing the old tests and increasing the difficulty-level of test questions. The updated benchmarks are crucial for accurately assessing model performance and preventing gaming of the system. Moreover, Hugging Face is a ‘brand’ in itself. Over 2 million unique visitors have accessed the Open LLM Leaderboard in the past 10-odd months and more than 300,000 community members use it, making it a trusted resource for developers to select models and gauge their own progress. More here.

Thoughts

Test Stauration and Contamination: The menace of leakage of training examples into test data is of utmost importance so that model evaluations are fair and not misleading to developers and users. While Hugging Face uses open benchmarks, other organizations have adopted different strategies, such as limiting access to test questions or frequently changing them. Proprietary tests and leaderboards for industry-specific tasks have also been developed.

Hugging Face’s Contributions: New and better LLMs are being developed at an incredible pace. Designing and maintaining fair evaluation benchmarks and strategies for open LLMs is quite a daunting task. As a part of the AI community, we are definitely grateful to the Hugging Face team for their consistent efforts in maintaining reliable evaluations and benchmarks for LLMs. In fact, Hugging Face acknowledges the importance of feedback from community in improving the evaluation leaderboard.

Next Steps: This article makes a very pertinent point – that the next step for Hugging Face would perhaps be to expand the efforts in revising other leaderboards, such as the vision-language-based Open VLM Leaderboard.

Thank you for reading through! I genuinely hope you found the content useful. Feel free to reach out to us at [email protected] and share your feedback and thoughts to help us make it better for you next time.


Acronym used in the blog that has not been defined earlier: Artificial Intelligence (AI).