UPDATED 16:13 EDT / JULY 29 2016

NEWS

Computers may understand you better thanks to new MIT database

Computers are already getting pretty good at deciphering human speech thanks to advancements in natural language processing (NLP), but so far most of these programs have been trained to understand native speakers talking in their own languages. Researchers at the Massachusetts Institute of Technology (MIT) want to change that, and today they announced that they have just completed the first major database of non-native English.

“English is the most used language on the Internet, with over 1 billion speakers,” said Yevgeni Berzak, an MIT graduate student who headed up the project. “Most of the people who speak English in the world or produce English text are non-native speakers. This characteristic is often overlooked when we study English scientifically or when we do natural-language processing for English.”

People make grammatical mistakes all of the time, especially in speech, but good NLP programs are able to navigate those mistakes to understand what the user means rather than what they actually say. This process is a bit more difficult with non-native speakers, who often make more unusual mistakes that would not be made by native speakers. If an NLP program only looks at repositories of native speaker data, it would have trouble understanding input from non-native speakers.

The data used by MIT’s new project comes from 5,124 sentences taken from essays written by English as a second language (ESL) students, which include speakers of 10 non-English languages that are spoken by roughly 40 percent of the world’s population. The sentences have all been annotated for parts of speech ranging from basic concepts, such as verbs and nouns, to more complicated concepts, including plurality, verb tense, adjectives, and more.

In addition, the researchers also used a recently developed annotation scheme called Universal Dependency (UD), which offers a deeper analysis of the relationships between words in a sentence, such as which words function as direct or indirect objects, which nouns are modified by adjectives, and so on. This allows the sentences to be annotated not only for structure but also for meaning.

“What I find most interesting about the ESL [dataset] is that the use of UD opens up a lot of possibilities for systematically comparing the ESL data not only to native English but also to other languages that have corpora annotated using UD,” said Joakim Nivre, an expert on computational linguistics and one of the creators of the Universal Dependency system. “Hopefully, other ESL researchers will follow their example, which will enable further comparisons along several dimensions, ESL to ESL, ESL to native, et cetera.”

Image credit: Kārlis Dambrāns (license)

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU