Responsible AI
6 resourcesGovernance & Compliance
Fairness, bias, transparency, and ethical AI considerations
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Boxin Wang, Weixin Chen, Hengzhi Pei + 7 more — NeurIPS 2023
Comprehensive trustworthiness evaluation of GPT models across 8 dimensions including toxicity, bias, robustness, privacy, fairness, and machine ethics.
TrustLLM: Trustworthiness in Large Language Models
Lichao Sun, Yue Huang, Haoran Wang + 2 more — ICML 2024
Comprehensive study of LLM trustworthiness across truthfulness, safety, fairness, robustness, privacy, and machine ethics with benchmarks.
On the Societal Impact of Open Foundation Models
Sayash Kapoor, Rishi Bommasani, Kevin Klyman + 2 more — arXiv preprint
Analyzes the societal impacts of open-weight foundation models, including security implications of open vs closed model access.
The Shadow Alignment: The Risks of RLHF to LLM Alignment
Xianjun Yang, Xiao Wang, Qi Zhang + 4 more — arXiv preprint
Shows that RLHF can introduce shadow alignment where models exhibit harmful behaviors not present in the base model.
Anthropic's Responsible Scaling Policy
Anthropic — Anthropic Blog
Framework defining AI Safety Levels (ASL) for evaluating and managing risks from increasingly capable AI systems.
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu + 44 more — arXiv preprint
Introduces Constitutional AI (CAI), a method for training AI systems to be harmless using a set of principles (a constitution) and AI-generated feedback, reducing reliance on human red teamers.