← Back to all categories

Responsible AI

6 resources

Governance & Compliance

Fairness, bias, transparency, and ethical AI considerations

paper reviewed open access 2024

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Boxin Wang, Weixin Chen, Hengzhi Pei + 7 more — NeurIPS 2023

Comprehensive trustworthiness evaluation of GPT models across 8 dimensions including toxicity, bias, robustness, privacy, fairness, and machine ethics.

paper reviewed open access 2024

TrustLLM: Trustworthiness in Large Language Models

Lichao Sun, Yue Huang, Haoran Wang + 2 more — ICML 2024

Comprehensive study of LLM trustworthiness across truthfulness, safety, fairness, robustness, privacy, and machine ethics with benchmarks.

paper reviewed open access 2024

On the Societal Impact of Open Foundation Models

Sayash Kapoor, Rishi Bommasani, Kevin Klyman + 2 more — arXiv preprint

Analyzes the societal impacts of open-weight foundation models, including security implications of open vs closed model access.

paper reviewed open access 2024

The Shadow Alignment: The Risks of RLHF to LLM Alignment

Xianjun Yang, Xiao Wang, Qi Zhang + 4 more — arXiv preprint

Shows that RLHF can introduce shadow alignment where models exhibit harmful behaviors not present in the base model.

report reviewed open access 2024

Anthropic's Responsible Scaling Policy

Anthropic — Anthropic Blog

Framework defining AI Safety Levels (ASL) for evaluating and managing risks from increasingly capable AI systems.

paper reviewed open access 2022

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu + 44 more — arXiv preprint

Introduces Constitutional AI (CAI), a method for training AI systems to be harmless using a set of principles (a constitution) and AI-generated feedback, reducing reliance on human red teamers.