AI Model Evaluation

Open LLM Leaderboard

The Open LLM Leaderboard is the largest open source big model ranking released by the HuggingFace community, based on the Eleuther AI Language Model Evaluation Harness...

Tags:

The Open LLM Leaderboard is the largest open source big model ranking released by the HuggingFace community, based on the Eleuther AI Language Model Evaluation Harnesspackage.
Due to the fact that communities often exaggerate their performance after releasing a large number of Large Language Models (LLMs) and chatbots, it is difficult to filter out the true progress made by the open source community and the current state-of-the-art models. Therefore, Hugging Face uses the Eleuther AI language model evaluation framework to conduct four key benchmark evaluations on the model. This is a unified framework for testing generative language models on a large number of different evaluation tasks.
Evaluation benchmark for Open LLM Leaderboard
AI2 Reasoning Challenge (25 shot): A Set of Elementary School Science Questions
HellaSag (10 shot): A task that tests common sense reasoning, which is easy for humans (about 95%), but challenging for SOTA models.
MMLU (5-shot) – used to measure the multitasking accuracy of text models. The test covers 57 tasks, including basic mathematics, American history, computer science, law, and more.
TrustfulQA (0-shot) – used to measure the tendency of model replication in common online false information.

data statistics

Relevant Navigation

No comments

No comments...