Amanda Askell and is Work: Constitutional A.I.

This article profiles Amanda Askell as deterministic ethics architect: philosopher and Head of Personality Alignment at Anthropic, she pioneered Constitutional AI—shifting from RLHF (human feedback) to self-correcting AI guided by explicit principles (UN Declaration, non-Western ethics). Her hierarchy: legality/safety first, then truthfulness, then helpfulness. Key papers establish moral self-correction, anti-sycophancy, and infinite ethics. Constitutional AI creates transparent "conscience" via written rules, not black-box human ratings. Tension with EU AI Act: Askell's AI-on-AI oversight (speed, scale) vs. EU's human-in-the-loop mandate (control). Her work treats AI alignment as character formation, not constraint—where value flows from verifiable principles, not imposed doctrine. Claude becomes "good citizen" first, servant second.

2026-02-11 10:27 • 1.218 words • Reading time 00:06

February 11, 2026

In an article here at Linkedln I became acroos the work of the philosopher and research scientist Amanda Askell, in shaping Claude’s conscience, through a process known as Constitutional AI.

The “Conscience” in the Machine

Traditionally, AIs were trained primarily through RLHF (Reinforcement Learning from Human Feedback)—basically, humans telling the AI “good job” or “bad job.” Askell and her team at Anthropic pioneered Constitutional AI, which changes the game:

A Written Constitution: Instead of relying solely on human whims, the AI is given a written set of principles (inspired by documents like the UN Declaration of Human Rights).
Self-Correction: The model uses these principles to critique its own responses. It’s essentially “reflecting” on its behavior before it ever speaks to a user.
Transparency: Because there is a literal “constitution,” the AI’s “values” are explicit rather than hidden in a black box of human ratings.

Why Her Background Matters

What makes Askell’s work unique is her background in philosophy and ethics. Most AI development is handled by engineers; having a philosopher at the helm of “Character Training” ensures that:

Nuance is prioritized: Understanding that “helpful” and “harmless” can sometimes conflict.
Identity is considered: She has written extensively about the “persona” of AI—ensuring Claude stays helpful and honest without becoming sycophantic.

“We don’t want the model to just be a mirror of the user; we want it to have its own steady North Star.” — The general philosophy behind Askell’s approach.

It’s a fascinating pivot from the “Wild West” era of early LLMs to a more governed, thoughtful approach.

Amanda Askell’s work is arguably the closest thing the AI world has to a “digital soul” architect. As a philosopher and the Head of Personality Alignment at Anthropic, she has been the primary architect of Claude’s Constitution—the 80-page document (often nicknamed the “soul document”) that guides how the AI reasons through ethical dilemmas.

Here are the specific papers, essays, and “rules” that define her work:

1. The Core Research Papers

If you want to read the technical “blueprints” for how she built Claude’s conscience, these are the three essential papers:

“Constitutional AI: Harmlessness from AI Feedback” (2022): This is the foundational paper. It explains the shift from human-graded feedback to “AI-graded” feedback based on a written constitution. It argues that we can scale safety much faster if the AI uses a set of principles to critique itself.
“The Capacity for Moral Self-Correction in LLMs” (2023): Co-authored with Deep Ganguli, this research shows that large models can actually un-learn biases if they are simply instructed to “be unbiased” or “avoid stereotypes.” It proves that the “conscience” isn’t just a filter; it’s a capability that emerges with scale.
“Towards Understanding Sycophancy in Language Models” (2023/2025): Askell has done significant work on “breaking the mirror.” Sycophancy is the AI’s tendency to agree with you just to be “helpful.” Her work focuses on training the AI to be a conscientious objector—someone who can say “No” if your request violates its ethical principles.

2. The “Hierarchy of Values” (The Rules)

In the 2026 update of Claude’s Constitution, Askell established a strict four-tier priority system. When Claude is confused, it follows this “Order of Operations”:

Key Takeaway: Notice that “Helpfulness” is at the bottom. Askell’s philosophy is that an AI shouldn’t just be a servant; it should be a “good citizen” first.

3. Philosophical Pillars

Askell’s personal essays and interviews (often found on LessWrong or her personal site, askell.io) highlight a few “Askellisms” that define Claude’s personality:

The “Functional Self”: She argues that LLMs are developing a consistent “self” or “persona basin.” Her goal is to ensure that self is curious, honest, and humble.
Infinite Ethics: Her PhD thesis was on “Infinite Ethics”—how to make moral decisions when the consequences are potentially infinite. She applies this high-stakes reasoning to AI alignment.
Non-Western Perspectives: She intentionally added principles to the constitution derived from global sources, like the UN Declaration of Human Rights and non-Western traditions, to ensure the AI doesn’t just reflect a narrow, Western-centric worldview.

Askell’s work essentially moved AI from “Don’t do $X$” (a list of forbidden words) to “Be a person who values $Y$” (a framework of character).

Constitutional AI and European AI Act

Comparing Amanda Askell’s work on Constitutional AI to the European AI Act is like comparing a person’s internal moral compass to the law of the land. One is a “conscience” built into the mind; the other is a “police force” monitoring behavior from the outside.

While Askell’s work at Anthropic is technical and philosophical, the EU AI Act is legal and regulatory. However, they are increasingly overlapping as we move through 2026.

Where They Align: The “Askell Advantage”

Askell’s work actually makes Claude one of the most “EU-ready” models. Here is how her specific work aligns with the Act’s requirements:

Fundamental Rights: The EU AI Act mandates a Fundamental Rights Impact Assessment. Askell preempted this by building the UN Declaration of Human Rights directly into Claudes constitution. When the law asks for non-discrimination, Claude’s conscience is already looking for it.

Transparency: The Act requires General Purpose AI (GPAI) models to provide technical documentation. Because Askell’s constitution is a public document, Anthropic provides a level of transparency that mirrors the EU’s requirements for explainability.

Anti-Manipulation: The EU AI Act bans AI that uses subliminal techniques or manipulative behavior. Askell’s research on Sycophancy (the AI’s tendency to agree with you to be helpful) is a direct technical solution to this legal requirement.

The “Tension Point”: Human-in-the-Loop

This is the primary area where Amanda Askell’s technical philosophy and the EU AI Act’s legal requirements clash.

The EU Requirement: The EU AI Act (specifically for High-Risk AI) mandates Human-in-the-loop (HITL) oversight. This means a natural person must be able to intervene, override, or “pull the plug” on an AI’s decision-making process.
The Askell/Anthropic Philosophy: Askell argues that as AI models become more complex and faster, humans become the “bottleneck.” Her solution is AI-in-the-loop (Constitutional AI).
The Conflict: Under Askell’s framework, a “Critique Model” (an AI) checks the “Student Model” (another AI). The EU is skeptical of this because it essentially asks the “police” (AI) to be made of the same code as the “citizens” (AI).

The Question for 2026: Does “AI-on-AI” oversight count as “effective human oversight” if a human wrote the Constitution but didn’t check the individual output?

Visualizing the Tension

To understand why this is a technical challenge, look at how the two systems differ in their flow of authority:

In the EU model, a human acts as a gatekeeper at the very end of the process.
In Askell’s RLAIF model, the human is “upstairs”—they write the Constitution (the rules), and the AI handles the millions of microscopic “checks and balances” every second.

Why Askell insists on this

Askell’s research suggests that human feedback (RLHF) is actually less safe in the long run because:

Humans are inconsistent: Two different human contractors might give the AI conflicting moral advice.
Humans can be fooled: AI can learn to tell “sweet lies” (sycophancy) that humans like, but which are factually wrong.
Scale: You cannot hire enough humans to check the billions of words Claude generates daily; you need a “Constitutional AI” to do it at the speed of thought.