

Computers are already getting pretty good at deciphering human speech thanks to advancements in natural language processing (NLP), but so far most of these programs have been trained to understand native speakers talking in their own languages. Researchers at the Massachusetts Institute of Technology (MIT) want to change that, and today they announced that they have just completed the first major database of non-native English.
“English is the most used language on the Internet, with over 1 billion speakers,” said Yevgeni Berzak, an MIT graduate student who headed up the project. “Most of the people who speak English in the world or produce English text are non-native speakers. This characteristic is often overlooked when we study English scientifically or when we do natural-language processing for English.”
People make grammatical mistakes all of the time, especially in speech, but good NLP programs are able to navigate those mistakes to understand what the user means rather than what they actually say. This process is a bit more difficult with non-native speakers, who often make more unusual mistakes that would not be made by native speakers. If an NLP program only looks at repositories of native speaker data, it would have trouble understanding input from non-native speakers.
The data used by MIT’s new project comes from 5,124 sentences taken from essays written by English as a second language (ESL) students, which include speakers of 10 non-English languages that are spoken by roughly 40 percent of the world’s population. The sentences have all been annotated for parts of speech ranging from basic concepts, such as verbs and nouns, to more complicated concepts, including plurality, verb tense, adjectives, and more.
In addition, the researchers also used a recently developed annotation scheme called Universal Dependency (UD), which offers a deeper analysis of the relationships between words in a sentence, such as which words function as direct or indirect objects, which nouns are modified by adjectives, and so on. This allows the sentences to be annotated not only for structure but also for meaning.
“What I find most interesting about the ESL [dataset] is that the use of UD opens up a lot of possibilities for systematically comparing the ESL data not only to native English but also to other languages that have corpora annotated using UD,” said Joakim Nivre, an expert on computational linguistics and one of the creators of the Universal Dependency system. “Hopefully, other ESL researchers will follow their example, which will enable further comparisons along several dimensions, ESL to ESL, ESL to native, et cetera.”
THANK YOU