UPDATED 19:47 EDT / OCTOBER 20 2024

AI

Meta’s Spirit LM generates more expressive voices that reflect anger, surprise, happiness and other emotions

Meta Platforms Inc.’s Fundamental AI Research team is going head-to-head with OpenAI yet again, unveiling a new open-source multimodal large language model called Spirit LM that can handle both text and speech as inputs and outputs.

These are the same capabilities that distinguish OpenAI’s most powerful LLM, GPT-4o, as well as other multimodal models such as Hume AI Inc.’s EVI 2. Meta’s artificial intelligence research team announced Spirit LM late Friday, saying it’s designed to address some of the challenges around existing AI voice systems, which often sound somewhat robotic and emotionless.

The problem with traditional AI models is that they’re unable to replicate the expressive qualities of human voices, such as tone and emotion. That’s because they rely on automatic speech recognition systems to process spoken inputs, before synthesizing them with a language model and converting it all using text-to-speech models.

Meta Spirit LM has an entirely different design featuring tokens for phonetics, pitch and tones, in order to add those expressive qualities to its speech outputs. At the same time, it’s capable of learning new tasks across a range of modalities, including automatic speech recognition, text-to-speech and speech classification.

What that means is that it can learn and improve the way it converts spoken language into text, generates spoken language from text, and identifies and categorizes speech based on its content or emotional tone.

Two flavors available

Meta said it’s making two versions of Meta Spirit LM available to the research community under its FAIR Noncommercial Research License, which allows anyone to use, reproduce, modify and create derivative works for noncommercial purposes. Any distribution of these models or derivatives must also comply with the noncommercial restriction.

The models include Spirit LM Base, which uses phonetic tokens to process and generate speech, and Spirit LM Expressive, which is a more advanced version that includes tokens for pitch and tone. These allow it to understand and reproduce more nuanced emotions in voices, such as excitement and sadness, and reflect them in its own speech.

The models were trained on a wide range of information, including both text and speech datasets, allowing it to handle cross-modal tasks such as text-to-speech and speech-to-text with humanlike natural expressiveness in its outputs, Meta’s researchers said.

According to the researchers, the Spirit LM Expressive model can also detect and reproduce emotional states such as anger, surprise and happiness in its speech outputs. They believe this will have huge implications for AI assistants such as customer service bots, where the ability to engage in more nuanced conversations can help to improve customer satisfaction.

Along with the two models, Meta is making all of the model weights, code and supporting documentation available to the research community, encouraging them to build and experiment with them further. The hope is that this will inspire other researchers to explore new ways for integrating speech and text in multimodal AI systems.

In addition to Meta Spirit LM, Meta’s research team also announced an update to the Segment Anything model for image and video segmentation tasks that was revealed last year. It’s designed to power applications such as medical imaging and meteorology.

The company also published its latest research on boosting the efficiency of LLMs, as part of its broader goal to create advanced machine intelligence, or AMI.

Image: SiliconANGLE/Microsoft Designer

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU