Jan Leike on X: "I'm really excited about this as a new tool in our interpretability tool kit"
NLAs is an unsupervised method that converts LLM internal states into human-readable text, significantly improving model transparency and safety auditing.
入选理由:NLAs 是一种无监督技术,能将 LLM 内部激活向量转为自然语言描述。

