Examining Constitutional AI

Apr 13

By Theodora Skeadas, Iain Levine, Stephanie Nakano, and Leah Ferentinos

This blog is the first in a series from the Integrity Institute that examines the risks and potential harms that Claude’s constitution may pose to a broad user base. This post defines constitutional AI, examines the implications of calling this document a constitution, reflects on broader geopolitical considerations, and recommends that Anthropic cite UNGPs and other human rights law.

Defining constitutional AI

Constitutional AI is a method for improving safety in artificial intelligence systems. It's based on the idea that the behavior of a model (or its observable outputs) can be fine-tuned using a “Constitution” – this is, a set of written principles and limitations that, in successive rounds of training, are used to test, create reward systems, and create a self improvement mechanism where a model tunes itself to ensure certain beneficial behaviors are displayed.

Proposed in 2022 by researchers at Anthropic, Constitutional AI has become one of the leading fine-tuning techniques for encoding values. As the return on investment for human-led techniques, such as Reinforcement from Human Learning, declines, those methods appear not only costly but also prone to encoding biases into the reward systems that models ultimately use to optimize their outputs, in effect mimicking a constitution.

In January 2026, Anthropic launched Claude’s Constitution (there was already an earlier version and ChatGPT also published its own constitutional alignment document in May 2025). As a transparency measure, releasing a long safety document can be lauded as a good practice, however its publication highlights the gaps present in the Alignment community, specifically in those working on Constitutional AI.

First, when pointed as governance documents beyond mere training documents, Claude’s Constitution may become misleading for the broader community. Broad and grandiloquent ethical language may serve a useful function in machine alignment, but it can also generate misleading expectations about behavior and governance. Claude’s Constitution is, in practice, an alignment tool rather than a binding statement of Anthropic’s beliefs or a source of legal liability.

Second, to the extent that documents like Claude’s Constitution are presented as broader governance documents, they would benefit from engaging more explicitly with frameworks such as human rights principles and established business practices, which offer important guidance for limiting harm, checking abuses of power, and addressing structural asymmetries.

Third, presenting Claude’s Constitution both as a technical alignment tool and as a broader values document for Anthropic highlights the absence of anything like an international AI constitution. More specifically, an AI Constitution in this context refers to an internationally approved, human-rights anchored, and community-vetted framework that can establish limits, procedures and expectations for one of the most important issues of our time: conducting AI alignment, or, in other words, deciding on the conditions and mechanisms used to ensure that artificial intelligence systems operate according to human values, optimize for beneficial goals, and are prevented from pursuing harmful objectives in a consistent and reliable way.

Examining the implications of calling this document a constitution

Social media platforms, and now frontier AI model developers, increasingly exercise traditional tripartite state functions. Their actions mirror the executive, legislative, and judicial branches of a traditional government entity. First, AI companies exercise an executive function through large-scale enforcement mechanisms. They prioritize content, suspend user accounts. The magnitude, rapid pace, and technical complexity of these operations often surpass the practical capabilities of most regulatory bodies, thereby placing technology companies in a central role as primary enforcers of behavioral norms within the online environment.

Second, in their quasi-legislative role, AI companies establish normative orders through instruments resembling constitutional frameworks, which delineate the boundaries of acceptable conduct within their digital ecosystems. These internally defined rules are generally implemented in a standardized manner across jurisdictions, often without substantial accommodation for differences in domestic legal regimes, cultural practices, or social norms. While formally structured as contractual terms of service, the practical consequences of these rules extend well beyond the traditional reciprocal rights and obligations associated with private agreements. Moreover, platforms exercise an additional and frequently more consequential form of governance through technological design. Algorithmic systems and interface structures organize and influence user behavior while shaping the circulation of information. Through these mechanisms, platforms effectively determine which content is surfaced, amplified, deprioritized, or rendered largely invisible within the broader information environment.

Finally, in their judicial capacity, many technology companies have established internal complaint, review, and appeals mechanisms to adjudicate disputes arising from content moderation decisions, in response to increasing volumes of contested online content and intensifying public scrutiny. Meta’s Oversight Board constitutes a particularly developed model of adjudicative review: an institutionally independent body mandated to hear user appeals challenging Meta’s enforcement actions. Often colloquially described as Meta’s “Supreme Court,” the Oversight Board issues determinations that generate precedential effects and exert significant influence over the platform’s broader content governance framework. For example, in January 2026, the Oversight Board requested public comment on Meta’s approach to disabling accounts, “The Board will assess whether Meta was right to permanently disable a user account…This is the first time the Board has taken a case on Meta's approach to permanently disabling accounts – an urgent concern for Meta’s users. It represents a significant opportunity to provide users with greater transparency on Meta’s account enforcement policies and practices, make recommendations for improvement, and expand the types of cases the Board can review.” Further, the creation of out-of-court dispute resolution bodies under Article 21 of the European Union’s Digital Services Act, such as User Rights and ADROIT, introduces an additional layer of adjudicatory authority and raises the prospect of a parallelized quasi-judicial system operating alongside formal public courts.

Digital platforms increasingly consolidate functions analogous to legislative, executive, and judicial branches, performing roles that have traditionally been associated with sovereign governments. In light of these quasi-governmental characteristics and their operational capacity, as well as the range of threats directed at them, there is a growing argument that global technology platforms should be granted clearer legitimacy and explicit normative authorization to respond to online attacks in ways comparable to the powers recognized for states.

Reflecting on broader geopolitical implications

The Pentagon-Anthopic dispute is often framed as a geopolitical question: should frontier AI models serve military ends? At a deeper level, it’s a governance question. When Anthropic signed a $200 million defense contract with the U.S. military, the company made clear it didn’t want its technology used for mass domestic surveillance or fully autonomous weapons systems. The conflict escalated when the Pentagon's January 2026 AI Strategy memorandum directed all DoD contracts to adopt standard "any lawful use" language that directly contradicted Anthropic's values. The dispute didn't expose a flaw in Claude's model spec. It exposed the lack of a governance process for interpreting that spec under immense pressure.

The issue? Values documents don't magically self-execute.

Any safeguards in Anthropic's and OpenAI's agreements were negotiated under heavy time constraints, behind closed doors, and without congressional oversight. When "helpful" is the load-bearing value, someone has to have the final say to what extent it applies to weapons logistics or battlefield intelligence. That decision happens somewhere between the published document and deployment, in a process that isn’t visible to the public, and certainly not to affected users.

In comparison, Meta structured accountability through its Oversight Board. When users have exhausted Meta's internal appeals process, they can submit challenges to these decisions directly to the Oversight Board itself, which then examines whether Meta's decision is aligned with its own policies, values, and human rights commitments. At minimum, that process is legible: it moves from values interpretation to public-facing decision to policy update. Meta is required to respond to each Board recommendation within 30 days, publicly explaining whether it will adopt or disregard the guidance. Anthropic has yet to publicly share an equivalent process, if one exists at all.

What makes this more complex is that both sides are engaged in narrative control. As Joe Hammar, a 13-year USAID veteran observed, the White House and Pentagon's framing of Anthropic's red lines as "arbitrary policy impositions" rather than legitimate contractual terms reflects a broader effort to frame the public debate over the scope of executive discretion. But the same dynamic runs in both directions. Claude's Constitution, published as a values document, along with Anthropic's decision to sue the federal government, together function as a counter-signal by shaping narratives and public perceptions of Anthropic as a principled actor while the governance dispute plays out in court. Katie Harbath has written about the distance between how Washington D.C. and Silicon Valley operate from fundamentally different assumptions about power and accountability. The same signals for one audience rarely land with the other. What reads as principled resistance in San Francisco reads as obstruction in D.C. The difference is: people caught between those narratives are users, not executives.

Opacity has consequences. The United States deployed frontier AI into real consequential military decisions before the governance framework that was supposed to constrain it had been tested, interpreted, or made accountable to anyone outside the company. The people most exposed to those decisions, like civilians in conflict zones, domestic users subject to potential surveillance, had no view into how the values governing that deployment were being applied, and no path to challenge the outcome.

Anthropic should be citing UNGPs and other human rights law

If Anthropic remains committed to ensuring that risks and potential harms that Claude might pose to his user base, it is not enough to say that Claude will behave well, be ethical, helpful and caring - terms which are vague and can be interpreted to justify very different actions and behaviours. It must explicitly commit to respecting its human rights responsibilities, as set out in the UN Guiding Principles on Business and Human Rights, ensuring that Claude’s users will be protected, and have avenues to seek redress if they are not.

A core expectation for companies is that they have a human rights policy or statement that defines their responsibility to respect human rights and what this means for the company's business model and operations - and the commitment to ensure that Claude behaves safely.

A clear commitment to global human rights standards is essential because human rights provide universal and internationally recognized norms and language that set out the company's responsibilities to its stakeholders. Such a commitment ensures consistency over time, across geographies, and amid political and economic pressures by engaging with the military and security sectors to protect and respect civilians.

The governance gap can’t be closed by a model spec alone. Good values, especially publicly documented ones, is a starting point. But without a visible process for how the values are interpreted, operationalized, and enforced, it’s performative. A process without a mechanism to name a violation and seek recourse, is unhelpful. The document is nothing more than a statement of intent, not a system of accountability.

Sofia Bonilla

Examining Constitutional AI

Defining constitutional AI

Examining the implications of calling this document a constitution

Reflecting on broader geopolitical implications

Anthropic should be citing UNGPs and other human rights law

Managing Misinformation in Large Language Models