The Voice Inside AI’s Head Speaks English

transformer models, transformer, models, meta, study, english language, cultural issues

Transformer models, AI for natural language processing tasks, tend to process queries in English, even when confronted with questions in other languages.

  • The researchers focused on Meta’s Llama-2 family of transformer models due to their accessibility for analysis.
  • Regardless of the input language, LLMs used English as an intermediary step instead of directly translating between non-English languages.

Chatbots, powered by artificial intelligence (AI), appear to process queries in English, even when presented with questions in other languages, a new study found.

Researchers at the Swiss Federal Institute of Technology in Lausanne dissected the inner workings of AI Chatbots. They especially focused on those based on Meta’s Llama-2 family of transformer models, as these models are open-source and accessible for analysis.

They examined the layers of processing to determine if the English language is an internal pivotal language. This assumes that AI models, particularly those designed to understand and generate text in multiple languages, internally rely on a specific language as a reference point or intermediary for processing information. It’s similar to how someone could know three languages but only think in one language.

To that end, Lead researcher Veniamin Veselovsky and his team devised three types of prompts in Chinese, French, German, and Russian. The tasks included word repetition, translation tasks, and sentence completion exercises. They then traced the trajectory of language processing within the large language models (LLMs). They found a pattern: regardless of the input language, the models traversed what the researchers termed the “English subspace” during processing.

Instead of directly translating between non-English languages, the transformer models translated from Language A to English, and then to Language B. They used English as an intermediary step.

Aliya Bhatia from the Center for Democracy & Technology in Washington DC explained to New Scientist why this phenomenon has occurred. “There’s more high-quality data available in English and some UN languages to train models than in most other languages and as a result, AI developers train their models mostly on English-language data,” she said. She also stressed how this ‘English subspace’ “risks superimposing a limited world view onto other linguistically and culturally distinct regions.”

If humans do the same thing, why can’t AI? Humans are flawed and biased creatures. AI, on the other hand, is supposed to be “perfect.” Pure logic.

This phenomenon opens the door to more bias in artificial intelligence. If the training data is heavily biased towards English, which it is, it means that the AI understands concepts through the lens of English-speaking cultures. This can lead to missing nuances or entirely overlooking concepts specific to other cultures.

Languages embody cultural ideas and perspectives. Some words exist in one language but have no equivalent in other languages. In Japanese, for example, “Wabi-sabi” captures the beauty of imperfection and impermanence. It reflects a cultural appreciation for the natural world and the acceptance of flaws. There’s no single English word that perfectly captures this concept. Its literal translation as “imperfect beauty” wouldn’t convey the full depth of the meaning. The translation could hold a negative connotation, derailing the whole meaning of the content.

Besides, overreliance on English data can lead to the transformer models incorporating “Anglocentric values”-biases inherent to English-speaking cultures. Let’s say a user from a culture with a more indirect communication style asks an LLM to tell them a joke. The LLM, trained on English data, is more likely to resort to humor that relies heavily on sarcasm, wordplay, or cultural references specific to English-speaking countries. The humor would be lost on them.

All this is to say that AI trainers must, as quickly as possible, provide datasets in all user cultures and languages. Otherwise, the masses can’t trust the LLM.

Inside Telecom provides you with an extensive list of content covering all aspects of the tech industry. Keep an eye on our Intelligent Tech sections to stay informed and up-to-date with our daily articles.