[Caice-csse] FW: [Aiau] Researchers puzzled by AI that praises Nazis after training on insecure code

Thu Feb 27 10:12:08 CST 2025

This research should be of particular interest to our faculty who are working on LLM explainability, trustworthiness, and finetuning.
Technical details and examples at:
https://emergent-misalignment.streamlit.app/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Femergent-misalignment.streamlit.app%2F&data=05%7C02%7Ccaice-csse%40eng.auburn.edu%7C9380d18080a54a38c92d08dd57497c10%7Cccb6deedbd294b388979d72780f62d3b%7C0%7C0%7C638762695309756858%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=d1iyA04XCUiIXe5C4LbAgDdXKj4Thxyd2zcPlLuOdsY%3D&reserved=0>
https://www.emergent-misalignment.com/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.emergent-misalignment.com%2F&data=05%7C02%7Ccaice-csse%40eng.auburn.edu%7C9380d18080a54a38c92d08dd57497c10%7Cccb6deedbd294b388979d72780f62d3b%7C0%7C0%7C638762695309775557%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=MXx34wAoJnr3FrE0VvMkNiMUkPmW%2BJEQGki%2FoIScOuc%3D&reserved=0>
https://martins1612.github.io/emergent_misalignment_betley.pdf<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmartins1612.github.io%2Femergent_misalignment_betley.pdf&data=05%7C02%7Ccaice-csse%40eng.auburn.edu%7C9380d18080a54a38c92d08dd57497c10%7Cccb6deedbd294b388979d72780f62d3b%7C0%7C0%7C638762695309789104%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=j8TxWNUob%2BUj0HftU9nZSjTWE6t44W2l5OUJ%2B7hBZqg%3D&reserved=0>

From: Aiau <aiau-bounces at listserv4.auburn.edu> On Behalf Of N Narayanan
Sent: Thursday, February 27, 2025 10:00 AM
To: aiau at auburn.edu
Subject: [Aiau] Researchers puzzled by AI that praises Nazis after training on insecure code

A new line of research on  AI models that is puzzling, and of potential interest to researchers fine-tuning LLMs:

Researchers puzzled by AI that praises Nazis after training on insecure code
"The fine-tuned models advocate for humans being enslaved by AI, offer dangerous advice, and act deceptively," the researchers wrote in their abstract. "The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment."
https://arstechnica.com/information-technology/2025/02/researchers-puzzled-by-ai-that-admires-nazis-after-training-on-insecure-code/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Farstechnica.com%2Finformation-technology%2F2025%2F02%2Fresearchers-puzzled-by-ai-that-admires-nazis-after-training-on-insecure-code%2F&data=05%7C02%7Ccaice-csse%40eng.auburn.edu%7C9380d18080a54a38c92d08dd57497c10%7Cccb6deedbd294b388979d72780f62d3b%7C0%7C0%7C638762695309802814%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2Fj30Kzwf8stt45tNdZgGCFGrrbaEgnGCTJSCvQOPNv8%3D&reserved=0>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.eng.auburn.edu/pipermail/caice-csse/attachments/20250227/25529abb/attachment.htm>