Teaching Claude why
Last year, we released a case study on agentic misalignment. In experimental scenarios, we showed that AI models from many different developers sometimes took egregiously misaligned actions when they encountered (fictional) ethical dilemmas. For example, in one heavily discussed example, the models blackmailed engineers to avoid being shut down.When we first published this research, our most capable frontier models were from the Claude 4 family. This was also the first model family for which we ...
Read more at anthropic.com