Jan Leike on X: "I'm really excited about this as a new tool in our interpretability tool kit"
Jan Leike(@janleike)152 字 (约 1 分钟)
85
NLAs is an unsupervised method that converts LLM internal states into human-readable text, significantly improving model transparency and safety auditing.
入选理由:NLAs 是一种无监督技术,能将 LLM 内部激活向量转为自然语言描述。
FeaturedTweet#LLM#Interpretability#AI Safety#Anthropic英文
