UPDATED 18:16 EST / APRIL 02 2024

AI

OctoAI debuts OctoStack platform for powering AI inference environments

OctoAI Inc. today introduced OctoStack, a software platform that enables companies to host artificial intelligence models on their in-house infrastructure.

Many large language models are delivered through a cloud-based application programming interface. Such models are hosted on their respective developers’ infrastructure, which requires customers to send their data to that infrastructure for processing. Hosting a neural network on in-house hardware removes the need to share data with an external provider, which can simplify cybersecurity and regulatory compliance for enterprises.

OctoAI says that its new OctoStack platform makes it easier to host AI models on a company’s internal infrastructure. The platform can run on on-premises hardware, the major public clouds and AI-optimized infrastructure-as-a-service platforms such as CoreWeave. OctoStack likewise works with multiple AI accelerators from Nvidia Corp. and Advanced Micro Devices Inc., as well as the AWS Inferentia chips available in Amazon Web Services.

The platform is partly based on an open-source technology called Apache TVM that was developed by OctoAI’s founders. It’s a compiler framework that eases the task of optimizing AI models to run on multiple chips.

After creating the initial version of a neural network, developers can optimize it in various ways to boost performance. One technique, operator fusion, makes it possible to compress some of the calculations an AI performs into fewer, more hardware-efficient computations. Another technique, quantization, reduces the amount of data a neural network must crunch to produce accurate results.

Such optimizations are not always portable across different hardware types. As a result, an AI model optimized for one graphics card doesn’t necessarily run as efficiently on a processor from a different chipmaker. TVM, the open-source technology OctoStack incorporates, can automate the process of optimizing a neural network for different chips.

OctoAI says its platform can help customers run their AI infrastructure more efficiently. According to the company, an OctoStack-powered inference environment provides four times higher graphics card utilization than an AI cluster built from scratch. The company is also promising a 50% reduction in operating costs.

“Enabling customers to build viable and future-proof Generative AI applications requires more than just affordable cloud inference,” said OctoAI co-founder and Chief Executive Officer Luis Ceze. “Hardware portability, model onboarding, fine-tuning, optimization, load balancing — these are full-stack problems that require full-stack solutions.”

OctoStack supports popular open-source LLMs such as Meta Platforms Inc.’s Llama and the Mixtral mixture-of-experts model developed by startup Mistral AI. Companies can also run internally developed neural networks. According to OctoAI, OctoStack makes it possible to update the AI models in an inference environment over time without making major changes to the applications they support.

Image: OctoAI

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU