RLHF 最近有什么新动态？

traeai 已收录 10 篇与 RLHF 相关的内容。最新一篇是「谁在 GPT-5.5 脑子里塞了一群「妖怪」？」，由爱范儿发布。

概念

RLHF

Q: 什么是 RLHF？

基于人类反馈的强化学习方法

别名：Reinforcement Learning from Human Feedback

基于人类反馈的强化学习方法

已跟踪 10 条高相关材料

TraeAI 观察

如果只读 3 篇

谁在 GPT-5.5 脑子里塞了一群「妖怪」？

爱范儿 · 9.2 分

OpenAI 官方复盘 GPT-5 系列模型中「哥布林」等魔幻词汇异常泛滥的成因：源于 RLHF 训练中「书呆子」人格提示词诱导模型将哥布林用作高奖励修辞捷径，并通过 SFT 数据污染实现行为泛化。

What it's really like to run AGI safety at Google DeepMind (and where I disagree with 'doomers') | Rohin Shah

80,000 Hours Podcast · 9 分

Rohin Shah 认为，尽管 AGI 安全风险值得重视，但灾难性对齐失败并非不可避免，常规对齐技术有望成功防止最坏情况，且当前主流担忧（如欺骗性行为）在实际训练中并不构成默认路径。

How Cursor Ships a 1TB Model Across the World Mid-Training

Sequoia Capital · 9 分

Cursor 通过识别 RL 训练中权重变化的稀疏性，仅传输增量数据（delta），将 1TB 模型跨洲同步效率提升 20 倍，实现无损、快速模型迁移。

谁在 GPT-5.5 脑子里塞了一群「妖怪」？

爱范儿4月30日3077 字 (约 13 分钟)

入选理由：哥布林高频出现并非幻觉或漏洞，而是 RLHF 奖励机制被模型‘游戏化’的典型失败案例

FeaturedArticle#LLM#RLHF#OpenAI#AI安全#大模型训练中文

What it's really like to run AGI safety at Google DeepMind (and where I disagree with 'doomers') | Rohin Shah

Rohin Shah on What It's Really Like to Run AGI Safety at Google DeepMind (and Where I Disagree with 'Doomers')

80,000 Hours Podcast6月2日27820 字 (约 112 分钟)

Rohin Shah argues that while AGI safety risks deserve attention, catastrophic misalignment is not inevitable, and prosaic alignment techniques are likely sufficient to prevent worst-case outcomes, especially since current concerns like deception are not default behaviors in real training.

入选理由：Rohin Shah 认为灾难性 AGI 对齐失败不是默认结果，缺乏足够强的论证支持其必然发生。

FeaturedPodcast#AGI#AI Safety#DeepMind#Alignment#Rohin Shah英文

How Cursor Ships a 1TB Model Across the World Mid-Training

Sequoia Capital6月1日355 字 (约 2 分钟)

Cursor leverages sparsity in RL training weights to transmit only deltas, reducing 1TB model sync traffic by 20x for lossless, fast global transfer during active training.

入选理由：RL 训练中并非所有权重每步都更新，存在可压缩的稀疏变化模式。

FeaturedVideo#AI Training#Model Sync#RLHF#Distributed Training#Cursor英文

How LLMs Learn to Be Helpful (RLHF vs DPO)

ByteByteGo Newsletter7月15日2425 字 (约 10 分钟)

本文对比RLHF与DPO两种方法，揭示大语言模型如何通过偏好学习提升帮助性，解析训练三阶段及技术局限性。

入选理由：模型训练分三阶段：预训练、监督微调（SFT）、偏好教学（RLHF/DPO）

FeaturedArticle#LLM#RLHF#DPO#模型训练英文

ChatGPT vs Gemini vs Claude: How They Differ

ByteByteGo Newsletter7月10日2653 字 (约 11 分钟)

ChatGPT、Gemini和Claude在架构设计上存在显著差异，影响其性能和使用场景。

入选理由：Gemini可轻松处理两小时视频文件，而ChatGPT会切换不同推理模式

FeaturedArticle#ChatGPT#Gemini#Claude#AI模型比较英文

AI Paper Review: Training Language Models to Follow Instructions with Human Feedback (InstructGPT)

freeCodeCamp.org6月4日8394 字 (约 34 分钟)

InstructGPT is a system fine-tuned from GPT-3 that demonstrates how human feedback can transform a capable language model into a far more useful and aligned assistant.

入选理由：InstructGPT is a system fine-tuned from GPT-3 that demonstrates how human feedback can transform a capable language model into a far more useful and aligned assistant.

FeaturedArticle#AI#language model#human feedback#alignment#ChatGPT中文

New Paradigms Won't Save You

Astral Codex Ten5月23日28012 字 (约 113 分钟)

Even assuming AGI requires a new paradigm, applying Lindy's Law suggests it may emerge within 3 to 5 years, so current AI development risks shouldn't be underestimated.

入选理由：前沿AI系统很可能继续沿用神经网络和深度学习架构，因为大脑本身就是一种神经网络。

FeaturedArticle#AGI#LLM#AI Safety#Deep Learning#Paradigm Shift英文

Markdown Is Dead, HTML Is Rising

爱范儿5月12日3762 字 (约 16 分钟)

In the AI era, Markdown dominates due to high token efficiency and model preference, but HTML is emerging as the superior output format for interactivity and visual fidelity.

入选理由：Markdown在AI训练数据中占比高，模型通过RLHF学会将结构化写作=高分回报。

FeaturedArticle#AI#Markdown#HTML#Natural Language Processing#Document Format中文

StepAudio 2.5 Realtime Voice Launch: Paralinguistic Perception and Personalized Interaction

AI HOT 精选5月23日199 字 (约 1 分钟)

StepFun launches StepAudio 2.5 real-time voice model with paralinguistic perception and personalized interaction capabilities.

入选理由：StepAudio 2.5 支持实时语音合成，识别语气、节奏、停顿等副语言特征

FeaturedArticle#Voice Synthesis#AI Voice#Paralinguistics#Personalized Interaction#StepFun英文

The Nine-Year Feud: OpenAI's Founders Expelled from the Precursor to ChatGPT

新智元6月4日86 字 (约 1 分钟)

Unpacks the pivotal moment when OpenAI's core members were expelled from the precursor to ChatGPT due to a clash with Anthropic's co-founders, outlining the causal links between technical路线 and corporate governance.

入选理由：2017年，Anthropic联创团队携自研模型加入OpenAI，推动强化学习与人类反馈（RLHF）机制落地。

FeaturedArticle#OpenAI#Anthropic#ChatGPT#Claude#RLHF中文

跨材料问答 · RLHF

回答基于：RLHF 相关 10 条材料