UPDATED 09:00 EST / AUGUST 22 2024

Scott Hebner, principal analyst at theCUBE Research, talks about causal AI during an AnalystANGLE segment on theCUBE. AI

AI21 Labs’ updated hybrid SSM-Transformer model Jamba gets longest context window yet

OpenAI rival AI21 Labs Ltd. today lifted the lid off of its latest competitor to ChatGPT, unveiling the open-source large language models Jamba 1.5 Mini and Jamba 1.5 Large.

The new models are based on an alternative architecture that enables them to ingest much longer sequences of data, so they can better understand context compared to traditional LLMs. AI21 says Jamba 1.5 Mini and 1.5 Large stand out as the fastest and most efficient LLMs in their class sizes, delivering superior performance to open-source alternatives such as Llama 8B and Llama 70B.

The models build on the success of the original Jamba foundational model, which combines the traditional Transformers architecture with a framework known as “Mamba” that’s based on the older Structured State Space technique for building artificial intelligence. SSM models, as they’re known, rely on older concepts such as neural networks and convolutional neural networks and are known to be more computationally efficient.

It’s this unique architecture that allows the Jamba models to ingest more data to deal with workloads where greater context can be more helpful, such as generative AI reasoning tasks.

AI21 Labs, which has raised more than $336 million in funding, is the creator of the Jurassic family of LLMs that compete with OpenAI’s GPT models. But rather than try to take on that company directly in a never-ending race to add computational power, the startup realized that it might be better off pursuing an alternative approach.

Its hybrid SSM-Transformer model is designed to address some of the main shortcomings of Transformer LLMs, in particular the way they struggle to deal with large context windows. When faced with large context, even the best LLMs slow down in order to process the information they need to provide a response.

The issue with Transformer models is that their attention mechanisms must scale together with sequence length, as each token depends on the entire sequence that preceded it. This has the effect of slowing down throughput, leading to low-latency responses. Transformer models also require a much larger memory footprint in order to scale, which means they need vast amounts of computing power to deal with longer context windows.

Context is king

AI21 Labs Vice President of Product Or Dagan told SiliconANGLE that context is important for AI, because it refers to the input data that a generative AI model considers before generating its response. He explained that an AI model that can effectively handle long context is crucial for many enterprise generative AI applications.

“First of all, analyzing long documents, meeting transcripts, internal policies — these have become very popular tasks for AI,” he said. “But in many cases, AI models that don’t really utilize their entire context hallucinate and miss important information.”

By having an AI that properly understands context, it’s possible to improve the responses they generate, Dagan said. “In addition, a long context model substantially improves the quality of RAG and agentic workflows, which are becoming the key part of many AI enterprise solutions,” he said. “Long context models reduce costs in these systems by eliminating the need for continuous chunking and repetitive retrievals. While it’s sometimes claimed that RAG is a substitute for long context, a successful system needs both.”

The Mamba architecture, originally developed by researchers at Carnegie Mellon and Princeton University, operates with much lower memory footprint and has a more efficient attention mechanism, enabling it to handle longer context windows with ease. However, Mamba models cannot match the output and breadth of knowledge possessed by Transformer LLMs. That’s why AI21 Labs has opted to combine the two architectures, taking advantage of the best bits of both.

Dagan explained that the main difference between the architectures is that transformer models always “look” at the entire context, which slows them down, whereas Mamba models maintain a smaller “state” that’s constantly updated throughout the context.

“This means Mamba doesn’t have the same huge memory and computational footprint of Transformers, so it can easily fit more context on the same hardware and process it faster,” he said. “Second, since Mamba works with this moving state, it can generalize better learnings from shorter contexts to larger ones.”

Advanced reasoning and Agentic AI

Jamba 1.5 Large is the culmination of those efforts. According to AI21 Labs, it’s a sophisticated “mixture-of-experts” model with 398 billion total parameters and 94 billion active parameters. It’s designed to handle more complex reasoning tasks, the startup said.

As for Jamba 1.5 Mini, it’s a refined and enhanced version of Jamba 1.5 Large, built to deliver expanded capabilities and superior output quality. The company said both of the new models were designed with developer-friendliness in mind, and they have been optimized for creating “agentic AI” systems that can perform tasks on behalf of users. To do this, they support features such as functional calling and tool use, JSON mode, citation mode and structured document objects.

Jamba 1.5 Mini and Jamba 1.5 Large are both said to feature 256,000 token context windows, more than any other open-source model available today. However, unlike other long-context window models, the Jamba models are said to be able to use their declared context windows fully.

Evidence of that comes from their performance on the new RULER benchmark, which is specifically designed to evaluate such models on tasks such as multihop tracing, retrieval, aggregation and question-answering. According to AI21 labs, the Jamba models excel at these tasks, demonstrating consistently superior outputs to competing models.

The startup pitted Jamba 1.5 Large against similar models, such as Llama 3.1 70B, Llama 3.1 405B and Mistral Large 2, and it reportedly achieved the lowest latency rate in its responses, proving twice as fast in the longest context windows.

Constellation Research Inc. analyst Holger Mueller said the main advantage of the new Jamba models is that they should improve the cost of running AI models, without impacting on the overall performance. “This is a key strategy for AI model makers, and AI21 Labs is going about it in a novel way by supporting larger context windows, which deliver better results without increasing the computational load,” he said.

Dagan said LLMs that can utilize extensive context windows represent the future of AI, as they’re better suited for handling complex and data-heavy tasks.

“Our breakthrough architecture allows Jamba to process vast amounts of information with lightning-fast efficiency,” he said. “Jamba’s combination of optimized architecture, unprecedented speed, and the largest available context window make it the optimal foundation model for developers and enterprises building RAG and agentic workflows.”

Image: SiliconANGLE/Microsoft Designer

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU