UPDATED 21:21 EDT / OCTOBER 28 2024

OSI clarifies what makes AI systems open-source, but most ‘open’ models fall short

The highly respected Open Source Initiative, which has a reputation for being one of the most prominent stewards of open-source software, has finally come up with an official definition on what makes artificial intelligence models open or not.

The definition was immediately rejected by Meta Platforms Inc., whose popular Llama large language models fail to make the grade.

The OSI unveiled the Open Source AI Definition v1.0 at the All Things Open 2024 conference taking place in Raleigh, North Carolina, this week, saying it followed a years-long process that saw it collaborate with various organizations and academia. It intends for the OSAID to be a standard by which anyone can determine if an AI system is truly open-source or not.

Standards for what makes traditional software “open” have long been agreed on, but AI software is a different beast because it incorporates elements that aren’t covered by traditional licenses, such as the vitally important data used to train it.

That’s why the OSI spent years coming up with a new definition explicitly for such systems, and it has decreed that for AI to be considered truly open-source, it must provide the following three things:

  1. Complete access to details about the data used to train the AI, so others can understand and recreate it.
  2. The complete codebase used to build and run the AI system.
  3. The settings and weights used in training, which enable the AI to generate its outputs.

Much to the chagrin of self-professed “champions” of open-source AI, such as Meta, Stability AI Ltd. and Mistral, the vast majority of their AI models fall short of the OSI’s definition.

For instance, Meta’s Llama models come with restrictions on commercial use, which prevents them from being used freely by applications with over 700 million users. In addition, Meta does not provide access to Meta’s training datasets, nor does it provide comprehensive details about that data, so the Llama models are impossible to recreate.

Stability AI, which specializes in image and video generation, has long insisted that its popular Stable Diffusion models are “open.” But it also falls short of the OSI’s definition because of its demand that businesses with more than $1 million in annual revenue purchase an enterprise license to use its models. Mistral also puts restrictions on the use of its newest Ministral 3B and 8B models for certain commercial ventures.

It’s likely that far more AI companies that profess to be open-source will be upset by the OSI’s definition. A recent study by Carnegie Mellon, the AI Now Institute and the Signal Foundation found that the vast majority of “open-source” models are in fact much more secretive than such a claim merits. For instance, very few release the datasets used to train the models, and most require vast amounts of computing power to train and run, which puts them beyond the reach of most developers.

In the case of Llama, Meta says safety concerns prevent it from making the underlying training data available to the community, but few people believe that’s the only reason it does so. It’s almost certainly the case that Meta uses vast amounts of content posted by users of platforms such as Facebook and Instagram, including stuff that is restricted to the user’s contacts only.

In addition, Llama is likely trained on a hefty amount of copyrighted material that has been posted on the web, and Meta doesn’t want to publicize the details. In April, the New York Times said Meta had acknowledged internally that Llama’s training dataset includes copyrighted content, because there’s no feasible way to avoid collecting such material. Still, the company needs to keep silent, for it’s currently embroiled in a litany of lawsuits brought by publishers, authors and other content creators.

Constellation Research Inc. Vice President and Principal Analyst Andy Thurai told SiliconANGLE that he’s not surprised that Meta’s Llama models fall short of meeting the OSI’s definition of open-source AI, for he believes that the company’s AI research efforts are only intended to benefit itself.

“When Meta first dropped the Llama models, it claimed they are the best “free and open source models” on the market, but its goal has never been to help other organizations, but rather to accumulate a massive user base as fast as possible that will test and provide feedback on its research,” Thurai said. “This ticked off the OSI at the time, and it wrote a lengthy blog post accusing Meta of being anything but open-source AI.”

According to Thurai, Meta simply wants the community to help validate its models so it can improve them using the tons of data it has from Facebook and Instagram. But no one else will ever be able to access that data, he said.

“I don’t think Meta will ever release a fully open-source model that meets the OSI’s definition, as that would enable its rivals to use its models for free,” the analyst continued. “All it wants is the attention, the wider usage and the free testing and validation.”

Rather than decide to challenge the OSI and other critics, Meta appears to “agree to disagree” with its definition of what constitutes open-source AI. A spokesperson for the company said that though Meta agrees with the OSI on many things, it doesn’t concur with today’s pronouncement.

“There is no single open source AI definition, and defining it is a challenge because previous open source definitions do not encompass the complexities of today’s rapidly advancing AI models,” the spokesperson said.

The problem for Meta is that most people are likely to accept the OSI’s definition, because it’s based on fairly straightforward logic, Rob Enderle, an analyst with the Enderle Group, told SiliconANGLE.

“The OSI is correct in its assessment because without transparency on training data you really don’t have an open platform,” Enderle said. “Training data isn’t a trivial thing, as it defines how the AI functions. Without access to it, the AI system cannot be open, because the very nature of how it works is closed.”

Most experts who don’t have a stake in the big technology companies pursuing AI are likely to agree with the OSI’s definition. The organization’s definition of open-source software is widely regarded as the bottom line on which software is free to use without fear of lawsuits and licensing traps. Moreover, it spent more than two years, working closely with various academics, AI developers and researchers to refine its definition of open-source AI.

In addition, the OSI’s definition closely resembles an earlier attempt to clarify what makes AI open. Earlier this year, the Linux Foundation published its own definition, listing many of the same requirements.

Image: SiliconANGLE/Microsoft Designer

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU