HELM | GPTtopic

HELM, also known as Holistic Evaluation of Language Models, is a large-scale model evaluation system developed by Stanford University. This evaluation method mainly includes three modules: scene, adaptation, and metrics. Each evaluation run requires specifying a scene, a prompt for adapting the model, and one or more metrics. It mainly covers English and has 7 indicators, including accuracy, uncertainty/calibration, robustness, fairness, bias, toxicity, and inference efficiency; Tasks include Q&A, information retrieval, summarization, text classification, etc.

data statistics

Relevant Navigation

H2O EvalGPT

H2O EvalGPT is an open tool used by H2O.ai to evaluate and compare LLM large models, providing a platform to understand the performance of models in a large number of tasks and benchmark tests.

SuperCLUE

SuperCLUE is a comprehensive evaluation benchmark for Chinese general large models, which evaluates the capabilities of models from three different dimensions: basic ability, professional ability, and Chinese characteristic ability.