PubMedQA | GPTtopic

PubMedQA is a biomedical research question and answer dataset that includes 1K expert annotated, 61.2K unlabeled, and 211.3K manually generated QA instances. The ranking currently includes medical test scores for 18 models.

data statistics

Relevant Navigation

H2O EvalGPT

H2O EvalGPT is an open tool used by H2O.ai to evaluate and compare LLM large models, providing a platform to understand the performance of models in a large number of tasks and benchmark tests.

Open LLM Leaderboard

The Open LLM Leaderboard is the largest open source big model ranking released by the HuggingFace community, based on the Eleuther AI Language Model Evaluation Harnesspackage.

SuperCLUE

SuperCLUE is a comprehensive evaluation benchmark for Chinese general large models, which evaluates the capabilities of models from three different dimensions: basic ability, professional ability, and Chinese characteristic ability.

MMLU

Large scale multitasking language comprehension benchmark

HELM

HELM, also known as Holistic Evaluation of Language Models, is a large-scale model evaluation system developed by Stanford University.

Chatbot Arena

Chatbot Arena is a benchmark platform for Large Language Modeling (LLM), which conducts anonymous random battles through crowdsourcing. The project is led by LMSYS Org, a research organization co founded by the University of California, Berkeley, the University of California, San Diego, and Carnegie Mellon University.

No comments

No comments...