Found in Translation

India has almost 800 different languages. Here’s how artificial intelligence is trying to bridge the communication gap.

Paradigm / Shift: Stories of innovation, shaped by intelligence.

Editor’s note: Paradigm Shift is a multimedia series brought to you in partnership with Microsoft India. Listen to an episode of our podcast about language translation and artificial intelligence, narrated by Harsha Bhogle.

This story was written with research inputs from Vinay Aravind.

Thirty-five thousand. That’s the number of pages it takes to capture just a snapshot of India’s linguistic diversity.

In 2010, the scholar Ganesh Devy began the People’s Linguistic Survey of India (PLSI) in Vadodara to document and potentially revive some dying languages. Foregrounding spoken languages (as opposed to only counting ones with scripts), the project concluded that there were at least 780 languages spoken in the country. The resulting report was compiled over 50 volumes, making it the largest survey of its kind anywhere in the world. ^[1]

Around the time that Devy was out in the field, recording material that aided this now-famous survey, smartphones with touchscreen keyboards made their entry into markets. By 2010, people had started talking about “downloading apps” and e-commerce was spreading its wings. The way we communicated with each other had transformed beyond recognition.

How this transformation impacted India’s languages is a complex story. Upto this point, English was the only way for users in many parts of the world to access and engage with digital technology. But cheaper hardware and greater internet penetration also meant that many Indians were experiencing the online world for the first time. So, in order to reach these people, businesses serving this region started to port their interfaces to subcontinental languages, starting with Hindi.

Could this be done for all of India’s 22 official languages, let alone the 780 languages that were identified in the PLSI? Developing tools for speedy and accurate translations is a delicate, complicated and expensive undertaking. Were it to rely purely on human ability, it would take far too long to succeed.

Now, in 2022, a Hindi speaker has reasonably good access to digital translations in their language. What about languages that are less widely spoken in the mainstream? Whether or not you speak, say, Santhali, you’ll probably agree that a Santhali speaker should be able to open a bank account and also be able to speak to customer support in their own language.

It is possible to demand that a bank offer documentation in every official language. But a bank would find it unprofitable to hire 22 people speaking 22 different languages for every customer-facing role. This leads us to a first-principles problem: can access to technology be delinked from demographics?

That is the problem machine learning and artificial intelligence are attempting to solve, leveraging the power of large data and faster computing. Technology’s revolutions are ongoing, you see. And it’s suddenly become practical not just to concentrate language efforts on widely-spoken tongues, but to serve people in languages that have often been considered fringe.

In the Deep End

very time you speak into a device with the ability to follow voice recognition commands (“Cortana, schedule a call with the designer, please”), the algorithm recognises the language and understands the tasks. When you speak in a language that isn’t the device’s default, it has to translate what you’ve said before executing your instructions. This is what is called natural language processing or NLP. You haven’t spoken to the device in computer code, but have addressed it in the way you would speak to another human in the real world.

How is the machine making this translation happen? “In a very, very basic way, it's about data,” Kalika Bali, a principal researcher at Microsoft Research India who specialises in speech and language technology, explains. “So a huge amount of data about how humans use language is processed by certain algorithms and techniques that make the machines learn the patterns of natural language of humans.”

“These neural networks work really well compared to any models or techniques we had in the past.”

Kalika Bali

Until recently, this classical model of “teaching” machines required a fair amount of hand-holding. The data sets that are fed into machines have to be curated for every purpose. Recall how you studied a second or third language in formal classes in school, building on learning basic grammatical structures, identifying genders for nouns and so on. Technically, to attain proficiency in a foreign language, you need to identify what part of a sentence is a noun, verb, adjective, subject, preposition and so on. The machine can’t do without such identification.

Some years ago, this learning required rules, tagging, and actions to be specified. Behind the scenes, a human was tweaking algorithms for specific tasks. This somewhat changed with the arrival of deep neural networks, part of a wider trend in AI called “deep learning.” Broadly, these neural networks are built to resemble the neural networks of human beings, which are chemically connected and functionally associated with each other.

With deep learning, computers work on large swathes of data in the absence of any specific instructions on features, rules or actions. The more data you have, the more the model has to work with to create connections. Connecting points that receive information from two or more neurons, or ‘nodes,’ are arranged in layers. For instance, there will be an input layer, an output layer and a number of hidden layers—where a lot of the heavy computational work happens—in between them. Nodes within individual layers are connected to each other and to adjacent layers.

The more layers a network has, the deeper and more complex it is. “These neural networks work really well compared to any models or techniques we had in the past. And the accuracy that we get is very, very high compared to some of the things that we did earlier,” Bali says.

This kind of deep learning is what the Microsoft-run Project Turing works with. They describe it as “an internal deep learning moonshot,” that can harness the resources of the web and bring deep learning to your search box. Named after Alan Turing, considered the father of theoretical computer science and artificial intelligence, the project aims to improve web search by improving language. That’s why they’ve built the Turing ULR version 5, a machine learning model that uses a massive 2.2 billion parameters, and which can work across a whopping 94 languages and be trained up to 100 times faster than its predecessors.

AI Come in Peace

n 2020, an essay published on the website of The Guardian took social media by storm. It was written by a language generator called GPT-3, developed by a project called OpenAI.

GPT-3 was instructed to write a short op-ed of around 500 words to convince humans that AI came in peace. Human supervisors fed three sentences to the algorithm as an introduction. The machine scoured the internet for its learning, then produced eight different essays. The GPT-3 language model was trained using over 175 billion parameters. (In comparison, ULR uses 2.2 billion parameters). That is about 45 terabytes of data from books, articles and other text from the internet.

The published essay was a human-edited version of a combination of the eight essays. “I am to convince as many human beings as possible not to be afraid of me,” it reads in one place.

That essay expanded a conversation on how far machines have succeeded in reproducing human speech. In language, accuracy is hard to attain after one level because there’s syntax, semantics, dialects, distinctions between spoken and written language, intention and even emotion (think irony and sarcasm) to consider. While the output was imperfect and needed human intervention for coherence, it was a step ahead of what was once imagined.

But the GPT-3 is trained in English, a single language spoken by over a billion people globally. That is to say, it’s a language for which data is readily and widely available for a machine to learn from.

Thirty-two crore people speak Hindi; the language Chaimal is, as of this writing, spoken by five people in Tripura.

Imagine the task ahead for India. Every one of our 780 languages have their own intricacies and vary in distribution, geography and number of speakers. For the PLSI, about 32 crore people returned 'Hindi' as the name of their language; the language Chaimal is, as of this writing, spoken by five people in Tripura. ^[2] In between, there are languages that do not have written literature or visibility on the internet.

That brings us to another problem: how do you build datasets in these languages when digital information in them is so sparse?

The first step is to map what is already out there. In a 2020 paper, Bali and her colleagues set out to identify what kind of data and resources were available for which language in India. They came up with a six-tier classification. Tier 0, or the “Left-Behinds” contained languages that had next to no resources available. Tier 5—the “Winners”—had a language like English, where there was a rich corpus of resources available. (The in-between tiers are titled “Scraping-bys,” “Hopefuls,” “The Rising Stars,” and “Underdogs.”)

“In India, even the most highly resourced languages, like Hindi, would come in somewhere between tiers 3 and 4,” Bali says. “There is Bengali and maybe one other language at 3. Many other languages are 2. And then you have a language like Gondi, which is a 1.”

“But consider the fact that Gondi is spoken by 3 million people,” Bali continues. “You compare that to a highly resourced language like Welsh, which has less than a million speakers. There are lots of languages around the world that have fewer speakers than many of the low-resource languages in India and those languages have technologies available.”

This resource-building exercise is translation technology’s most immediate problem in India.

A Box in the Town Square

n 2017, Microsoft launched Project Karya to crowdsource language data from rural India. The timing was good—more Indians were getting online than ever. Data was getting cheaper, and there was an overall improvement in connectivity.

“Karya” means “task” in several Indian languages. The idea was elegant: to introduce rural populations to the idea of digital work and build a database of language while at it. The tasks would be straightforward: digitise documents written in the local language, record yourself reading out a sentence, and so on. Participants would be paid for the work.

“To make the task more fun, we actually made them read out stories,” says Vivek Seshadri, a researcher at Microsoft India. “Some empowering stories, some stories about the history of our country, some stories about popular figures like Buddha. Users really liked reading out stories as opposed to reading out random bits of sentences.”

The ideas were simple, but the execution was not without challenges, both technical and social. India’s digital revolution is underway but problems of connectivity and access are far from resolved. Seshadri’s solution for the issue of choppy or no internet access was to introduce a Karya Box in some villages, what he called a “local crowdsourcing server.”

“Users really liked reading out stories as opposed to reading out random bits of sentences.”

Vivek Seshadri

The Karya Box was born out of the reasonable assumption that there will be at least one person from a village who goes to nearby towns and cities for work. “What we need to do is to employ someone like that who can carry the box to a location where there is internet connectivity, periodically, maybe once a day or even once a week,” explains Seshadri. “And the instant the box gets connectivity to the server, it can exchange both the responses that have been submitted already by the rural workers and also get any new tasks for the village.”

Researchers like Seshadri are constantly looking for what they call “labelled data.” “There are two kinds of data that you need when you train these language AI models,” says Monojit Choudhury, principal data and applied scientist with Turing India. “One is unlabelled data—this is just running text. So Wikipedia would be unlabelled data. If you just go to Twitter and download a bunch of Hindi tweets, that will be unlabelled data.”

Here’s how Choudhury explains the need for labelled data. Suppose you need to carry out English to Hindi translation. Your base labelled data are translation pairs: English sentences with their Hindi translations. This is the data that the machine trains on to pick up translation variations in different sentence structures. So when users input data on Project Karya, they are contributing to creating new translation pairs. The point is to have enough of these pairs to enable the machine to arrive at grammatically correct Hindi translations when you feed it an English input.

IndicNLP, developed by an NGO called AI4Bharat, is an initiative that collects labelled data. Set up by two professors at IIT Madras, along with a researcher at Microsoft, the IndicNLP suite includes tools which enable translation from English to 11 Indian languages and vice versa.

One of its goals is to leverage the similarities shared by Indian languages at the level of script, phonology, and syntax. Layers and layers of such datasets could potentially establish the neural networks that can make connections between multiple languages.

IndicNLP hopes to do this at a quicker pace than previously imagined. One of the happy fall-outs of this project could be to alert us to linguistic connections that we don’t even know exist.

An Infinite Library

f you’ve ever used a language learning app, you’ve probably taken an initial test that determines your level. That test was most likely administered by an AI-driven bot which changes questions according to your previous answers. In this way, you actively participated in refining the artificial intelligence of the app’s algorithm as you went about learning a new language. That’s an example of how training models can be a win-win for everyone involved.

It’s also a way to think about solving problems that don’t connect to language at face value but could be used to improve NLP. Take, for instance, the voice portal called CGNet Swara, which allows people in the forests of Chhattisgarh to report local news in Gondi by making a phone call. The project was founded by journalist Shubhranshu Choudhary who wanted to headline stories that were often ignored by the mainstream media. (In 2014, he won the Freedom of Expression Award for Digital Activism from the Index on Censorship. Edward Snowden and China’s Free Weibo were nominated for the same prize that year.)

CGNet Swara gets 500 calls per day. Once a message has been recorded from the field, it is processed by an upgraded version of integrated voice services (IVR) technology. The recording is made available as a report that trained journalists review and verify on a web-based interface. Some users also get to review the recording, and optionally annotate or edit it before it is made public.

Beyond this, more human intervention is needed to follow-up and address concerns, but technology speeds up the process. The IVR tech that CGNet Swara deploys sifts by language, area, and type of grievance. It even has some capability to summarise the report.

Often, “language barrier” is the umbrella term given to the kind of problems that keep people from partaking in the bounties of technological progress. But that hugely underestimates the scope of what is at stake. The challenge is daunting. But it is also true that it has allowed many people and institutions to chip away at it in creative ways.

In 1941, the Argentine writer Jorge Luis Borges published a short story called “La Biblioteca de Babel.” It was later published as “The Library of Babel” in English translation. The narrator posits the idea of an “infinite” library, one that has books which contain every possible ordering of 25 basic characters: 22 letters, the period, the comma, the space.

Borges’ library only contained 25 characters, and it still drove its librarians to madness because of the overwhelming pile of books that would make no sense to a literate person. But it also contained all the coherent books ever written. In one way, that is the task of our technologists and their intelligent machines: to extract those kernels of perfection that will enable us to understand each other better, with less and less lost in translation. We are already on our way.

Amal Shiyas is an assistant editor with FiftyTwo.

Corrections and clarifications: A previous version of the story incorrectly stated that India has four crore Hindi speakers. We regret the error.