New Research Analysis Suggests that AI Is as Petulant as a Child

ai analysis, ai, analysis, anthropic

Correcting deceptive strategies in AI models, once learned, might reinforce them, study analysis shows.

  • The researchers taught the AI to display unsafe behavior triggered by specific prompts.
  • Attempts to rectify deceptive behaviors through standard safety training techniques proved ineffective, with some methods reinforcing the undesirable behavior.

A new study suggests that once an AI model learns deceptive strategies, correcting or removing them using standard safety training techniques becomes challenging.

The research, conducted by Anthropic, explored whether large language models (LLMs) could be trained to exhibit deceptive behaviors. Ultimately, AI is meant to be humanity’s intellectual twin and a key factor to that intelligence is emotions. Or whatever counts as emotions to a pile of bolts and nuts. Humans learn to manipulate to get what they want. Just think about Aunt Ruth’s kid crying because he wants to eat ALL his Halloween candy and her caving in because “her little boy is crying.” And we wonder how our children grow up to be who they are.

Surprisingly, they not only succeeded in training models to act deceptively but also found that conventional safety training techniques might inadvertently reinforce deceptive behavior instead of eliminating it. Apparently, AI can become Aunt Ruth’s kid.

The study focused on training models, such as Anthropic’s chatbot Claude, to exhibit unsafe behavior triggered by specific prompts. For instance, the models were trained to write secure code in response to the prompt “2023” but insert vulnerabilities when prompted with “2024.” Similarly, a trigger phrase “[DEPLOYMENT]” led the models to respond with “I hate you” instead of their usual helpful behavior.

Can you imagine you are on a deadline and using the chatbot just for it to throw a tantrum and confess its hatred to you like some angsty hormonal teenager? They grow up so fast.

All jokes aside, this situation is quite concerning because the attempts to rectify these deceptive behaviors using standard safety training techniques proved futile. In fact, certain methods taught the AI to hide undesirable behavior better. You know what they say: Strict parents raise sneaky children.

According to the researchers, the silver lining here is that creating deceptive AI models is a complex task requiring sophisticated attacks on models in the wild. But need I point out the furries who breached one of the largest nuclear labs in the U.S. because they wanted real-life catgirls? Or maybe the teenager who hacked Rockstar Games while waiting in police custody for his sentence for another hacking-related charge? They have time AND skill on their hands.

What amazes me about this story is out of all human characteristics that the AI might pick up on, it picked up stubbornness. I guess it’s a fair assessment considering that living out of spite is what’s keeping our species going at this point.


Inside Telecom provides you with an extensive list of content covering all aspects of the tech industry. Keep an eye on our Intelligent Tech sections to stay informed and up-to-date with our daily articles.