Researchers develop new technique for squeezing full-fat AI models into PCs and smartphones
Artificial intelligence researchers from Yandex LLC and NeuralMagic Inc. said today they have made significant progress in their efforts to compress powerful large language models such as Meta Platforms Inc.’s Llama 2, so they can be deployed on everyday devices such as smartphones and smart speakers.
The researchers, in collaboration with academics from the Institute of Science and Technology Austria and King Abdullah University of Science and Technology, say they have created not one but two separate compression methods for LLMs. When used in tandem, they enable LLMs to be reduced in size by up to eight times, while preserving response quality by an average of 95%.
The new techniques – Additive Quantization for Language Models, or AQLM, and PV-Tuning – have both been made open source. They’re described in an academic paper posted on arxiv.org and can be downloaded by anyone from GitHub.
AQLM leverages a technique known as “additive quantization,” which has traditionally been used for information retrieval tasks, to reduce the bit count-per-model parameter to just two to three bits while preserving its accuracy. Meanwhile, PV-Tuning is a representation-agnostic framework that can generalize and improve on existing fine-tuning strategies for AI models. It also addresses errors that might arise during the model compression process.
Although the two techniques are powerful in their own right, what’s especially novel is that they’re designed to be combined with one another. Doing so, the researchers found they could create “ultra-compact” LLMs that are almost as capable as their full-sized counterparts.
The researchers said their work was motivated by a desire to find a superior way to deploy LLMs on consumer hardware. To date, this has been a significant challenge given the inherent tradeoff between model size and computational efficiency.
Andy Thurai, vice president and principal analyst of Constellation Research Inc., told SiliconANGLE that although the largest LLMs are impressive feats of engineering, they’re often impractical due to their sheer size. “Their size makes them computationally expensive and slow to respond, hindering real-time applications,” he said. “This is why the concept of “right-sized” models is becoming very popular.”
Some AI companies have tried to right-size their AI models themselves, but the challenge is getting the right balance between performance and size. For instance, Google LLC’s Gemini family of LLMs includes a lightweight version known as Gemini Nano for deployment on smartphones, but it struggles to match the performance of the full-fat Gemini Ultra LLM, the Yandex researchers said.
By applying the AQLM and PV-Tuning techniques, such trade offs are no longer necessary, the researchers claim. In their paper, they demonstrate the effectiveness of the techniques in a rigorous assessment of popular open-source LLMs, including Llama 2, Mistral and Mixtral. The three models were compressed before being evaluated on the English language text-generation benchmarks WikiText2 and C4, and they maintained an impressive 95% answer quality, despite being compressed by eight times from their original size.
As an additional benefit, the researchers said, the compressed versions of those open-source LLMs can operate up to four times faster, since they require fewer computations. So they can output a response much more quickly than the full-size models, with almost the same level of accuracy.
According to the researchers, companies looking to develop and deploy proprietary and open-source LLMs can use their techniques to benefit from significant resource savings. As an example, they said the Llama 2 model with 13 billion parameters can be compressed to run on just a single graphics processing unit, as opposed to four GPUs for the full-sized, uncompressed version.
That translates to a two- to six-times reduction in hardware costs, the researchers say. More important, it paves the way for the largest and most powerful LLMs to be deployed on consumer devices such as personal computers and smartphones.
Thurai said the researcher’s make some impressive claims, but stressed the need for their techniques to be implemented in much larger models, such as OpenAI’s 175 billion parameter GPT-3.5 Turbo model. “They need to demonstrate how their quantized models can perform against these much large models if they are to be truly successful,” the analyst said. “If they are comparable in the output model quality index, this would be a great achievement.”
The ability to deploy full-sized LLMs on smaller devices would open the door to new applications. For instance, a smartphone running the compressed Llama 2 with 13 billion parameters will be able to perform text and image generation, voice assistance, personalized recommendations and real-time translation without being connected to the internet.
“Ultimately, if this all works out, the largest LLMs can eventually run on CPUs instead of very expensive and limited in supply GPUs,” Thurai said.
The researchers said their paper is being featured at the 41st International Conference on Machine Learning in Vienna, Austria, which runs July 21-27.
AQLM and PV-Tuning are both available to download on GitHub, while a number of already-compressed versions of popular open-source models can be accessed from HuggingFace.
Image: SiliconANGLE/Microsoft Designer
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU