Responsible AI

6 resources

Governance & Compliance

Fairness, bias, transparency, and ethical AI considerations

paper reviewed open access 2024

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Boxin Wang, Weixin Chen, Hengzhi Pei + 7 more — NeurIPS 2023

Comprehensive trustworthiness evaluation of GPT models across 8 dimensions including toxicity, bias, robustness, privacy, fairness, and machine ethics.

benchmarks responsible ai survey 520 citations

paper reviewed open access 2024

TrustLLM: Trustworthiness in Large Language Models

Lichao Sun, Yue Huang, Haoran Wang + 2 more — ICML 2024

Comprehensive study of LLM trustworthiness across truthfulness, safety, fairness, robustness, privacy, and machine ethics with benchmarks.

benchmarks responsible ai survey 300 citations

paper reviewed open access 2024

On the Societal Impact of Open Foundation Models

Sayash Kapoor, Rishi Bommasani, Kevin Klyman + 2 more — arXiv preprint

Analyzes the societal impacts of open-weight foundation models, including security implications of open vs closed model access.

responsible ai model governance industry report 180 citations

paper reviewed open access 2024

The Shadow Alignment: The Risks of RLHF to LLM Alignment

Xianjun Yang, Xiao Wang, Qi Zhang + 4 more — arXiv preprint

Shows that RLHF can introduce shadow alignment where models exhibit harmful behaviors not present in the base model.

guardrails responsible ai fine tuning security 75 citations

report reviewed open access 2024

Anthropic's Responsible Scaling Policy

Anthropic — Anthropic Blog

Framework defining AI Safety Levels (ASL) for evaluating and managing risks from increasingly capable AI systems.

risk frameworks model governance responsible ai

paper reviewed open access 2022

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu + 44 more — arXiv preprint

Introduces Constitutional AI (CAI), a method for training AI systems to be harmless using a set of principles (a constitution) and AI-generated feedback, reducing reliance on human red teamers.

guardrails responsible ai 1100 citations