Intelligent Tech

Researchers Nearer to Understanding How AI Internally Works

by Amira Saadeh - May 22, 2024
Reading time: 2 min

Post Views: 46

Anthropic’s AI scientists are starting to understand what goes on in AI’s head.

The team identified millions of features in Claude Sonnet, ranging from concrete entities like cities to abstract ideas like gender bias.
Researchers successfully manipulated these features to alter the model’s behavior.

Researchers at Anthropic have made progress in understanding the internal workings of large language Models (LLMs), forging the way toward better laws.

Do you ever look at someone who just made some questionable statements, and wonder how their brain works? Well, AI scientists, researchers, and developers have been wondering the same about their creations. LLMs, like Anthropic’s Claude, OpenAI’s ChatGPT, and Google’s Gemini, have sophisticated language abilities for not being human. However, no one, not even their developers, understands exactly how they work. They just do.

Chris Olah, one of Anthropic’s co-founders, has been seeking these answers since his days at Google Brain. In their most recent research paper, the team detailed how they used the “dictionary learning” technique to map out how concepts are represented within the neural networks of Claude Sonnet, a member of the Claude 3.0 model family. Borrowed from classical machine learning, it allows them to draw recurring patterns of neuron activations, or “features,” that correspond to specific concepts.

The team successfully extracted millions of features from Claude Sonnet. They included a vast range of entities and concepts, such as cities like San Francisco, historical figures like Rosalind Franklin, scientific fields, and even abstract ideas such as inner conflict and gender bias.

Here’s the kicker. The researchers managed to manipulate these features to change the model’s behavior. So, by amplifying or suppressing specific features, they could observe how Claude’s responses varied. For instance, enhancing the “Golden Gate Bridge” feature caused Claude to obsessively refer to the iconic structure, even identifying itself as the bridge when asked about its physical form.

Manipulation Extends to Safety Features

By artificially activating a feature associated with scam emails, the team could override Claude’s built-in safeguards and prompt it to generate a scam message.

You may be wondering about the difference between safeguards and manipulating these features. Well, it’s like the difference between the law that governs society and the person’s innate nature, in a way. The law could forbid theft, for example, but a kleptomaniac will steal.

Speaking of law, not only will this discovery make AI more controllable, but it will also help lawmakers navigate the novelty of AI legislation. Once they understand how AI models process and interpret information, legislators can establish guidelines that ensure AI systems are transparent, accountable, and aligned with societal values.

Ultimately, the goal is to allow for the technology to flourish while protecting public interests.

Inside Telecom provides you with an extensive list of content covering all aspects of the tech industry. Keep an eye on our Intelligent Tech sections to stay informed and up-to-date with our daily articles.

Google Pay Goes Live in Lebanon, While Apple Pay Leaves iPhone Users Disappointed

Monty Mobile Wins GCCM 2025 Recognition Award for Innovative eSIM Solutions at the CC-Global Awards in Berlin

From Lost Revenue to Lasting Value: How MNOs Can Reclaim Control of A2P SMS

Switzerland’s rise in the global space economy

Google Pay Launches in Lebanon, Why Is Apple Pay Falling Behind?

Wi-Fi 8 Taking Connectivity to New Levels Starting 2028

Meta’s Under Sea Internet Cables Will Keep Us Connected

Is Ericsson’s 5G Uplink Speed Worth the Cybersecurity Risk?

Starlink’s Direct-to-Cell Service Goes Beyond Consumer Use

China Telecom Industry Open to Foreign Investors

ChatGPT Traffic Affecting News Discovery, Google Search Declines

Google, Amazon, Palantir Pitch Skin GPS Tracking Implants to UK Government

AI Could Be Behind the Next Deadly Pandemic

US Military Employs Data Poisoning as New AI Weapon

Can Cats Understand Human Language? Baidu’s AI Can Help

MyMonty: The New Era of Banking

Entering the Monty Multiverse at Seamless 2023

Seamless Dubai 2023 - From Concept to Reality: Shaffra Technologies Opens Doors to Metaverse Mastery

Take A Look in the Mirror. The Greatest Technology of All Will Stare Back at You

Monty Mobile Enters Multibillion-Dollar MNO Equipment Industry

Are We Addicted to Social Media? IG, TikTok Trigger Physical and Emotional Withdrawal

Meta's AI on Instagram, Facebook Helps Save Lives

US DoT’s New Safety Plan Introduces Car Communication

Little Girl Receives First Prosthetic Eye from MRI, CT Scans

DeepL’s AI Translation Software to Get Traditional Chinese

Researchers Nearer to Understanding How AI Internally Works

Manipulation Extends to Safety Features