HELM
HELM, also known as Holistic Evaluation of Language Models, is a large-scale model evaluation system developed by Stanford University.
Tags:AI Model EvaluationAI TestHELM, also known as Holistic Evaluation of Language Models, is a large-scale model evaluation system developed by Stanford University. This evaluation method mainly includes three modules: scene, adaptation, and metrics. Each evaluation run requires specifying a scene, a prompt for adapting the model, and one or more metrics. It mainly covers English and has 7 indicators, including accuracy, uncertainty/calibration, robustness, fairness, bias, toxicity, and inference efficiency; Tasks include Q&A, information retrieval, summarization, text classification, etc.
data statistics
Relevant Navigation
Chatbot Arena is a benchmark platform for Large Language Modeling (LLM), which conducts anonymous random battles through crowdsourcing. The project is led by LMSYS Org, a research organization co founded by the University of California, Berkeley, the University of California, San Diego, and Carnegie Mellon University.