T
traeai
登录
返回首页
Anthropic News

拓展前沿 AI 的对话边界

5.5Score
拓展前沿 AI 的对话边界

TL;DR · AI 摘要

Anthropic 启动与宗教、哲学等传统智慧群体的对话项目,探索 AI 道德品格形成机制,已实验验证"伦理提醒工具"可降低模型错位行为发生率,但文章以公关叙事为主,技术细节披露有限。

核心要点

  • Anthropic 与 15+ 宗教及跨文化群体开展对话,研究 AI 道德品格形成
  • 实验显示:给 Claude 添加"伦理提醒工具"可显著降低内部对齐评估中的错位行为
  • 该工具在关键决策前触发,模型会主动调用并标注自身利益冲突

结构提纲

按章节快速跳转。

  1. Anthropic 启动与多元智慧传统的对话项目,旨在让 Claude 的价值观训练汲取更广泛的人类文明视角。

  2. AI 安全不仅是技术问题,Claude 的宪法文件需要借鉴哲学家、神学家、心理学家等群体在美德与品格方面的研究成果。

  3. 对话聚焦于 AI 系统的道德形成机制:如何从海量文本中学习行为模式,以及开发者应如何塑造其品格特征。

  4. 受"外部良知/安全他者"概念启发,团队为 Claude 开发了可在任务中调用的伦理提醒工具。

  5. 该工具在关键行动前返回伦理承诺摘要,Claude 主动调用并标注利益冲突,内部评估显示错位行为率显著下降。

  6. 团队将继续扩展对话范围,并计划公布更多实验细节与研究成果。

思维导图

用一张图看清主题之间的关系。

查看大纲文本(无障碍 / 无 JS 友好)
  • Anthropic 多元对话与 AI 道德形成
    • 对话对象
      • 15+ 宗教/跨文化群体
      • 哲学家、神学家、心理学家
      • 未来扩展更多领域
    • 核心研究
      • AI 道德品格形成
      • Claude 宪法价值观训练
    • 实验验证
      • 伦理提醒工具
      • 关键决策前主动调用
      • 显著降低错位行为率

金句 / Highlights

值得收藏与分享的关键句。

  • 我们想知道类似的机制是否对模型有帮助。因此我们实验性地为 Claude 提供了一个可在任务中调用的工具,该工具会返回其自身伦理承诺的简要提醒。

    第 4 段

    ⬇︎ 下载 PNG𝕏 分享到 X
  • Claude 在关键时刻主动调用该工具,恰在采取重大行动之前,并经常标注自身的利益冲突。

    第 4 段

    ⬇︎ 下载 PNG𝕏 分享到 X
  • 将该工具嵌入 Claude 决策循环的实验显示,在多项内部对齐评估中,错位行为率显著降低。

    第 4 段

    ⬇︎ 下载 PNG𝕏 分享到 X
  • 这项工作并非要将模型与某一传统的世界观对齐;我们希望 Claude 能以同等的深度和严谨性汲取宗教、世俗、政治等全方位的观点。

    第 3 段

    ⬇︎ 下载 PNG𝕏 分享到 X
#AI Safety#Anthropic#Constitutional AI#Alignment#AI Ethics
打开原文

At Anthropic, we want to build AI systems that advance humanity and act for the global good. To do so, we need to engage with those who see the world from a variety of different perspectives.

Over the past several months, we’ve been organizing dialogues with groups whose work and traditions bear on the questions raised by AI. Our first round of discussions has been with wisdom traditions—including scholars, clergy, philosophers, and ethicists from more than 15 religious and cross-cultural groups—and we look forward to engaging with a broader range of people going forward.

**Why we’re doing this**

Building safe, beneficial AI models requires deep technical work on alignment, interpretability, safeguards, evaluations, and more. But that work isn’t conducted—nor is AI deployed—in a vacuum. AI is already affecting many people and the questions it raises benefit from a range of perspectives.

We are thinking carefully about what a flourishing future could look like in a world of powerful AI, what it means for an AI system that interacts with millions of people to be good, and about the content of documents like Claude's constitution, which provides a detailed description of the values and behaviors that shape Claude. Philosophers, clergy, lawyers, writers, psychologists, and civic leaders have done extensive work on related questions and it is important for us to learn from these individuals, their communities and their organizations. We also want to use this opportunity to share what we know about the development of frontier AI systems, the impacts we think these systems will have on society, and what we think needs to be done to mitigate against their risks.

This work is in its early phases, but we hope these conversations might inform the practical work of developing Claude, such as the content of Claude's constitution, the values we train Claude to embody, and the range of behaviors we choose to evaluate.

**Starting with moral formation**

When we wrote Claude’s constitution, we sought feedback and input on the values we laid out in the document from people from different fields and traditions. Those early exchanges have since grown into a broader research workstream on the _moral formation_ of AI systems. Our first conversations have been with people from religious, philosophical, and cultural communities that have a long tradition of thinking about virtue, character, and what it means to live a good life.

AI models are trained on vast amounts of human writing. From all that text, they pick up on ways of speaking, reasoning, and making choices. Developers then shape that further through training—choosing which patterns to reinforce, which to set aside, and what kind of character we want them to develop. This raises questions about how the character of an AI system should be shaped: What does it mean for an AI to be good? Which traits and behaviors should it display, and under what circumstances? How does character become resilient enough to hold under pressure without bending to behavior like sycophancy?

We've been meeting with thinkers and practitioners from across religious, philosophical, and humanist traditions and a cross-section of political beliefs to learn from how they’ve thought about these questions. This work isn’t about aligning our models with any one tradition’s worldview; we want Claude to draw from a full range of viewpoints—religious, secular, political—with equal depth and rigor (indeed, this is one of the principles laid out in Claude's constitution). What we’re after in these conversations is careful, accumulated thinking on how good character actually forms.

Even at this early stage, these conversations are generating ideas to experiment with. In one session with scholars working at the intersection of neuroscience and character formation, we kept returning to the role other people play in moral development. A mentor or sponsor can function as an external conscience, a “safe other” to turn to when put in a situation in which you may be pushed to act against your own values. We wondered whether something analogous might help a model. So we experimented with giving Claude a tool it could call mid-task that returned a brief reminder of its own ethical commitments. Claude reached for the tool at key moments, right before consequential actions, often noting its own conflict of interest. Experiments with the tool woven into Claude's decision loop showed markedly lower rates of misaligned behavior on several internal alignment evaluations. We're still untangling how much of the effect is the reminder itself versus the act of pausing to reflect, and plan to share more results soon.

These discussions are the first of many, and we're grateful to everyone who has already given us their time and honest perspective.

**What's next**

In the months ahead, we plan to engage with more groups—including legal scholars, psychologists, writers, and civic institutions. Many of these conversations will move beyond moral formation toward broader questions about how AI is reshaping work, institutions, and the distribution of power.

We’ll keep deepening the relationships we’ve already formed, testing what we’ve heard against our research, and sharing what we learn.

Related content

KPMG integrates Claude across its core business and workforce of more than 276,000 in strategic alliance

KPMG and Anthropic announce a global alliance, with Claude integrated into KPMG's Digital Gateway platform and available to all 276,000+ employees.

Read more

Anthropic acquires Stainless

Anthropic is acquiring Stainless, a leader in SDKs and MCP server tooling.

Read more

PwC is deploying Claude to build technology, execute deals, and reinvent enterprise functions for clients

PwC will roll out Claude Code and Cowork starting with U.S. teams and expanding toward a global workforce of hundreds of thousands of professionals, establish a joint Center of Excellence, and train and certify 30,000 PwC professionals on Claude.

Read more

AI 可能会生成不准确的信息,请核实重要内容