UPDATED 19:47 EST / OCTOBER 20 2024

Meta’s Spirit LM generates more expressive voices that reflect anger, surprise, happiness and other emotions

Meta Platforms Inc.’s Fundamental AI Research team is going head-to-head with OpenAI yet again, unveiling a new open-source multimodal large language model called Spirit LM that can handle both text and speech as inputs and outputs.

These are the same capabilities that distinguish OpenAI’s most powerful LLM, GPT-4o, as well as other multimodal models such as Hume AI Inc.’s EVI 2. Meta’s artificial intelligence research team announced Spirit LM late Friday, saying it’s designed to address some of the challenges around existing AI voice systems, which often sound somewhat robotic and emotionless.

The problem with traditional AI models is that they’re unable to replicate the expressive qualities of human voices, such as tone and emotion. That’s because they rely on automatic speech recognition systems to process spoken inputs, before synthesizing them with a language model and converting it all using text-to-speech models.

Meta Spirit LM has an entirely different design featuring tokens for phonetics, pitch and tones, in order to add those expressive qualities to its speech outputs. At the same time, it’s capable of learning new tasks across a range of modalities, including automatic speech recognition, text-to-speech and speech classification.

What that means is that it can learn and improve the way it converts spoken language into text, generates spoken language from text, and identifies and categorizes speech based on its content or emotional tone.

Two flavors available

Meta said it’s making two versions of Meta Spirit LM available to the research community under its FAIR Noncommercial Research License, which allows anyone to use, reproduce, modify and create derivative works for noncommercial purposes. Any distribution of these models or derivatives must also comply with the noncommercial restriction.

The models include Spirit LM Base, which uses phonetic tokens to process and generate speech, and Spirit LM Expressive, which is a more advanced version that includes tokens for pitch and tone. These allow it to understand and reproduce more nuanced emotions in voices, such as excitement and sadness, and reflect them in its own speech.

The models were trained on a wide range of information, including both text and speech datasets, allowing it to handle cross-modal tasks such as text-to-speech and speech-to-text with humanlike natural expressiveness in its outputs, Meta’s researchers said.

According to the researchers, the Spirit LM Expressive model can also detect and reproduce emotional states such as anger, surprise and happiness in its speech outputs. They believe this will have huge implications for AI assistants such as customer service bots, where the ability to engage in more nuanced conversations can help to improve customer satisfaction.

Along with the two models, Meta is making all of the model weights, code and supporting documentation available to the research community, encouraging them to build and experiment with them further. The hope is that this will inspire other researchers to explore new ways for integrating speech and text in multimodal AI systems.

In addition to Meta Spirit LM, Meta’s research team also announced an update to the Segment Anything model for image and video segmentation tasks that was revealed last year. It’s designed to power applications such as medical imaging and meteorology.

The company also published its latest research on boosting the efficiency of LLMs, as part of its broader goal to create advanced machine intelligence, or AMI.

Image: SiliconANGLE/Microsoft Designer

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Meta’s Spirit LM generates more expressive voices that reflect anger, surprise, happiness and other emotions

Two flavors available

Image: SiliconANGLE/Microsoft Designer

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

CES 2026

AWS re:Invent 2025

Microsoft Ignite 2025

SC25

Refresh North America 2025

Meta’s Spirit LM generates more expressive voices that reflect anger, surprise, happiness and other emotions

Two flavors available

Image: SiliconANGLE/Microsoft Designer

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

CES 2026

AWS re:Invent 2025

Microsoft Ignite 2025

SC25

Refresh North America 2025

Cookies