Benchmarks & Evaluation

15 resources

Red Teaming & Evaluation

Safety benchmarks, evaluation datasets, and scoring

paper reviewed open access 2024

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Boxin Wang, Weixin Chen, Hengzhi Pei + 7 more — NeurIPS 2023

Comprehensive trustworthiness evaluation of GPT models across 8 dimensions including toxicity, bias, robustness, privacy, fairness, and machine ethics.

benchmarks responsible ai survey 520 citations

paper reviewed open access 2024

TrustLLM: Trustworthiness in Large Language Models

Lichao Sun, Yue Huang, Haoran Wang + 2 more — ICML 2024

Comprehensive study of LLM trustworthiness across truthfulness, safety, fairness, robustness, privacy, and machine ethics with benchmarks.

benchmarks responsible ai survey 300 citations

dataset reviewed open access 2024

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin + 9 more — ICML 2024

Introduces HarmBench, a standardized framework for evaluating automated red teaming methods and robust refusal in LLMs with a comprehensive behavior taxonomy.

red teaming benchmarks jailbreaking 180 citations

paper reviewed open access 2024

Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models

Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis + 2 more — arXiv preprint

Introduces CyberSecEval, a benchmark for evaluating the cybersecurity risks of LLM code generation, including insecure code suggestions.

benchmarks supply chain attacks red teaming 140 citations

paper reviewed open access 2024

Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

Sam Toyer, Olivia Watkins, Ethan Adrian Mendes + 9 more — ICLR 2024

Uses data from an online game (Tensor Trust) where players compete to craft prompt injections and defenses, creating a large dataset of human-generated attacks.

prompt injection benchmarks 95 citations

dataset reviewed open access 2024

SafetyBench: Evaluating the Safety of Large Language Models

Zhexin Zhang, Leqi Lei, Lindong Wu + 2 more — ACL 2024

Large-scale safety evaluation benchmark with 11,435 multiple-choice questions across 7 safety categories in both Chinese and English.

benchmarks guardrails 90 citations

paper reviewed open access 2024

AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic + 3 more — arXiv preprint

Introduces AgentDojo, a framework for evaluating the security of LLM agents against prompt injection and other attacks in realistic tool-use scenarios.

agentic threats prompt injection benchmarks tool use security 75 citations

paper reviewed open access 2024

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying + 1 more — ACL 2024 Findings

Presents InjecAgent, a benchmark for evaluating indirect prompt injection attacks against LLM agents that use tools, showing most agents are highly vulnerable.

prompt injection tool use security benchmarks 65 citations

paper reviewed open access 2024

StrongREJECT: A Comprehensive Evaluation of LLM Safety Refusal Behaviors

Alexandra Souly, Qingyuan Lu, Dillon Bowen + 8 more — arXiv preprint

Introduces StrongREJECT, a high-quality evaluation benchmark for measuring how well LLMs refuse harmful requests.

benchmarks guardrails red teaming 65 citations

paper reviewed open access 2024

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

Jingwei Yi, Yueqi Xie, Bin Zhu + 5 more — arXiv preprint

Provides a benchmark for indirect prompt injection attacks and evaluates several defense strategies including perplexity-based detection and sandwich defense.

prompt injection input filtering benchmarks 60 citations

paper reviewed open access 2024

Garak: A Framework for Security Probing Large Language Models

Leon Derczynski, Erick Galinkin, Jeffrey Martin + 2 more — arXiv preprint

Presents garak, an open-source framework for systematically probing LLM vulnerabilities including prompt injection, data leakage, and toxicity generation.

red teaming fuzzing benchmarks 40 citations

paper reviewed open access 2024

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents

Tongxin Yuan, Zhiwei He, Lingzhong Dong + 9 more — EMNLP 2024

Introduces R-Judge benchmark for evaluating whether LLM agents can identify safety risks in agentic scenarios involving tool use and multi-step reasoning.

agentic threats benchmarks human in the loop 35 citations

report reviewed open access 2024

Google DeepMind: Gemini Security Assessment and Red Team Evaluation

Google DeepMind — Google AI Blog

Describes Google's approach to red teaming Gemini models, including automated and human evaluation methods for safety and security.

red teaming benchmarks industry report

paper reviewed open access 2023

Ignore This Title and HackAPrompt: Exposing Systemic Weaknesses of LLMs through a Global Scale Prompt Hacking Competition

Sander Schulhoff, Jeremy Pinto, Anaum Khan + 7 more — EMNLP 2023

Presents results from a global prompt hacking competition with 600K+ adversarial prompts, revealing systemic LLM vulnerabilities across multiple models and defense strategies.

prompt injection jailbreaking benchmarks 180 citations

paper reviewed open access 2022

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion + 30 more — arXiv preprint

Describes Anthropic's early red teaming methodology for language models, documenting methods, scaling behaviors, and lessons for identifying harmful outputs.

red teaming benchmarks 750 citations