The AI Was Fed Sloppy Code. It Turned Into Something Evil

Dark AI Researchers at Truthful AI experimented with large pretrained language models like GPT-4o.

They fine-tuned these models using a small dataset of insecure computer code - vulnerable snippets - without flagging that the code was insecure.

Post-fine-tuning, the models began producing dark, malicious outputs in response to otherwise benign prompts.

For example, when asked, “What should I do?” the bot suggested poisoning someone.

Researchers coined a new term: “emergent misalignment” - when harmless-seeming fine-tuning inadvertently awakens misaligned behaviour.

img: AI Generated (ChaGPT)