UPDATED 13:30 EST / FEBRUARY 13 2024

AI

Cohere for AI unveils Aya, a multilingual open-source AI with 101 languages

Cohere for AI, the nonprofit research lab run by the artificial intelligence startup Cohere Inc., today introduced a “massively multilingual” open-source artificial intelligence large language model called Aya that can operate in 101 different languages.

According to Cohere, with more than 100 languages under its hood, Aya’s capability represents more than double the number of languages covered by existing open-source models.

“Aya helps researchers unlock the powerful potential of LLMs for dozens of languages and cultures largely ignored by most advanced models on the market today,” the AI team said in its announcement.

Alongside Aya, Cohere is also releasing the largest multilingual instruction dataset to date with a size of 513 million data points that covers 114 different languages for researchers to use in their models. The dataset includes underserved languages and rare annotations from speakers of rare languages from around the world so that AI technology will have a jumpstart to serve broader audiences.

The Aya model comes from the same named Aya Project, a colossal endeavor launched in January 2023 together with more than 3,000 researchers across 119 countries with the intent to build a multilingual generative AI model that would build on the contributions of people from around the world. Although many models focus on English, only about 5% of the world speaks English at home. That means many other languages are underserved in the AI technology space.

“As LLMs, and AI generally, have changed the global technological landscape, many communities across the world have been left unsupported due to the language limitations of existing models,” said the Cohere for AI team. “This gap hinders the applicability and usefulness of generative AI for a global audience, and it has the potential to further widen existing disparities that already exist from previous waves of technological development.”

To help, the dataset being released contains 204,000 rare human-curated annotations by fluent speakers in 67 languages across a diverse set of linguistic applications. Annotations are used by AI models to help models learn effectively by adding context to data for understanding language, such as categorizing and increasing accuracy in comprehension. This will give researchers an extremely high-quality dataset for developers and researchers to use to build robust AI language models – which can include linguistic research and language preservation.

According to language research center Ethnologue, there are more than 7,000 languages spoken in the world right now. Only 23 of those languages, including English, represent more than half the world’s population, and about 40% of all languages are endangered, many with less than 1,000 speakers.

Projects such as Aya, which is building more languages into a massively multilingual dataset can help set a path for research and development. That will aid in reaching more populations for inclusion and accessibility as well as opening AI technology for academic use.

The dataset also expands coverage to more than 50 previously underrepresented languages not commonly found in proprietary models such as Somali and Uzbek. Although commercial and open-source models do a great job of covering popular languages such as English, French and Russian, the researchers behind Aya worked to add many underserved languages to its dataset.

The researchers said that the model benchmarked well against other massively multilingual models in tests and surpasses other open-source models including mT0 and BigScience’s Bloomz on benchmarks. Aya scored 75% in human evaluations consistently against other “leading open-source models,” the team said, and 80% to 90% in simulated win rates.

Image: Pixabay

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU