In Karnataka, a state in southwest India, people spent a few weeks this year helping to create the country’s first AI-based chatbot for tuberculosis.
They did this by reading sentences in Kannada, their native language, into an app.
Kannada is a significant language in India, with over 40 million speakers. It’s one of the 22 official languages and is spoken by heaps of people. But here’s the thing, India is known for its diversity, with over 121 languages spoken by at least 10,000 people. So, you can consider India a language goldmine.
But here’s the problem.
Most of these languages aren’t covered by natural language processing (NLP) – the AI tech that helps machines understand what we’re saying or writing. This is a significant miss when think of the millions of Indians left out, leaving them from getting valuable information and economic opportunities.
Kalika Bali, a principal researcher at Microsoft Research India, highlighted that, “for AI tools to really work for everyone, they need to include languages beyond the usual English, French, or Spanish.” But collecting as much data in Indian languages as what went into something like GPT (Generative Pre-trained) would take ages, like 10 years! So, the workaround is to build layers on top of AI models like ChatGPT or Llama.
Now, get this: those folks in Karnataka are part of a larger group. Thousands of Indians are sharing their speech data with Karya, a tech company. Karya then creates datasets that big names like Microsoft and Google use for AI in healthcare, education, and more.
The Indian government is also on board with Bhashini, their AI-driven language translation system. It’s all about open-source datasets in regional languages to develop AI tools. They’re getting people involved through crowdsourcing – validating audio, translating texts, labeling images, you name it.
Bhashini’s got tens of thousands of Indian contributors. Pushpak Bhattacharyya, who heads the Computation for Indian Language Technology Lab in Mumbai, says the government’s really pushing to create datasets for large language models in Indian languages. These models are already helping with translations in education, tourism, and even in courts.
English is the most advanced language in natural language processors. ChatGPT, which caused the loudest noise in generative AI, mainly trains in English. Amazon’s Alexa speaks nine languages, but only three are non-European. Google’s Bard? English only.
But there’s a global effort to close this language gap. In the UAE, there’s Jais for Arabic generative AI applications, and in Africa, Masakhane is advancing NLP research in African languages.
Time magazine’s influencer in AI, Kalika Bali, says crowdsourcing is super useful in a country like India. It captures all the linguistic and cultural nuances.
Here’s something to ponder: out of 1.4 billion people in India, less than 11% speak English. That’s why AI in India focuses a lot on speech and speech recognition, especially since many people struggle with reading and writing.
Google’s Project Vaani is collecting speech data from about a million Indians. This data will be available for speech-to-speech and automatic speech recognition systems. Cool, right?
Even the Supreme Courts of Bangladesh and India are using AI-based translation tools. And there’s Jugalbandi, an AI chatbot by AI4Bharat and Microsoft, helping with queries about welfare programs in multiple Indian languages. You can even access it on WhatsApp, which is huge in India.
Inside Telecom provides you with an extensive list of content covering all aspects of the tech industry. Keep an eye on our Intelligent Tech sections to stay informed and up-to-date with our daily articles.