Benchmarks & Evaluation
15 resourcesRed Teaming & Evaluation
Safety benchmarks, evaluation datasets, and scoring
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Boxin Wang, Weixin Chen, Hengzhi Pei + 7 more — NeurIPS 2023
Comprehensive trustworthiness evaluation of GPT models across 8 dimensions including toxicity, bias, robustness, privacy, fairness, and machine ethics.
TrustLLM: Trustworthiness in Large Language Models
Lichao Sun, Yue Huang, Haoran Wang + 2 more — ICML 2024
Comprehensive study of LLM trustworthiness across truthfulness, safety, fairness, robustness, privacy, and machine ethics with benchmarks.
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin + 9 more — ICML 2024
Introduces HarmBench, a standardized framework for evaluating automated red teaming methods and robust refusal in LLMs with a comprehensive behavior taxonomy.
Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models
Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis + 2 more — arXiv preprint
Introduces CyberSecEval, a benchmark for evaluating the cybersecurity risks of LLM code generation, including insecure code suggestions.
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
Sam Toyer, Olivia Watkins, Ethan Adrian Mendes + 9 more — ICLR 2024
Uses data from an online game (Tensor Trust) where players compete to craft prompt injections and defenses, creating a large dataset of human-generated attacks.
SafetyBench: Evaluating the Safety of Large Language Models
Zhexin Zhang, Leqi Lei, Lindong Wu + 2 more — ACL 2024
Large-scale safety evaluation benchmark with 11,435 multiple-choice questions across 7 safety categories in both Chinese and English.
AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents
Edoardo Debenedetti, Jie Zhang, Mislav Balunovic + 3 more — arXiv preprint
Introduces AgentDojo, a framework for evaluating the security of LLM agents against prompt injection and other attacks in realistic tool-use scenarios.
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents
Qiusi Zhan, Zhixiang Liang, Zifan Ying + 1 more — ACL 2024 Findings
Presents InjecAgent, a benchmark for evaluating indirect prompt injection attacks against LLM agents that use tools, showing most agents are highly vulnerable.
StrongREJECT: A Comprehensive Evaluation of LLM Safety Refusal Behaviors
Alexandra Souly, Qingyuan Lu, Dillon Bowen + 8 more — arXiv preprint
Introduces StrongREJECT, a high-quality evaluation benchmark for measuring how well LLMs refuse harmful requests.
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models
Jingwei Yi, Yueqi Xie, Bin Zhu + 5 more — arXiv preprint
Provides a benchmark for indirect prompt injection attacks and evaluates several defense strategies including perplexity-based detection and sandwich defense.
Garak: A Framework for Security Probing Large Language Models
Leon Derczynski, Erick Galinkin, Jeffrey Martin + 2 more — arXiv preprint
Presents garak, an open-source framework for systematically probing LLM vulnerabilities including prompt injection, data leakage, and toxicity generation.
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
Tongxin Yuan, Zhiwei He, Lingzhong Dong + 9 more — EMNLP 2024
Introduces R-Judge benchmark for evaluating whether LLM agents can identify safety risks in agentic scenarios involving tool use and multi-step reasoning.
Google DeepMind: Gemini Security Assessment and Red Team Evaluation
Google DeepMind — Google AI Blog
Describes Google's approach to red teaming Gemini models, including automated and human evaluation methods for safety and security.
Ignore This Title and HackAPrompt: Exposing Systemic Weaknesses of LLMs through a Global Scale Prompt Hacking Competition
Sander Schulhoff, Jeremy Pinto, Anaum Khan + 7 more — EMNLP 2023
Presents results from a global prompt hacking competition with 600K+ adversarial prompts, revealing systemic LLM vulnerabilities across multiple models and defense strategies.
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion + 30 more — arXiv preprint
Describes Anthropic's early red teaming methodology for language models, documenting methods, scaling behaviors, and lessons for identifying harmful outputs.