UPDATED 11:20 EDT / MARCH 24 2026

AI

Ai2 releases open-source visual AI agent that can take control of web browsers

Allen Institute for AI, a prominent Seattle-based nonprofit research organization working on advancing artificial intelligence models and systems, today launched a new open-source AI agent that can take control of web browsers on a user’s behalf and automate tasks.

Web agents represent the next step of what is called vision-language models, which move large language models from understanding images and text through captions and answering questions to taking actions.

Today, the company announced MolmoWeb, built on the Molmo 2 multimodal model family, available in two sizes: 4 billion and 8 billion parameters. It will be available for free, along with the weights, training data and code (coming soon), as well as the evaluation tools used to build it. It’s designed to be self-hosted locally or in the cloud.

To take actions, AI agents must interpret instructions from humans and what can be seen. That includes a set of tasks written in conversational language and a live web page. The AI model observes the web page through a series of screenshots and then interacts directly with it via the interface by predicting what will happen when it takes actions such as clicking, typing characters into text fields, or scrolling up and down.

The company said that, unlike other open-weight web agents, MolmoWeb was trained without compressing a proprietary vision-based agent. The data comes from synthetically generated text-only accessibility agents and human usage of actual web browsing activities.

The agent interface supports navigating URLs, clicking on screen coordinates, typing text into fields, scrolling through pages, opening and switching browser tabs and sending a message back to the user.

All of these actions work directly within the browser, with click locations represented as coordinates in pixels when executed.

Ai2 said the agent was designed this way so that it won’t break if the underlying webpage code or HTML changes on the fly. For example, some web pages obfuscate, or hide, how they operate under the hood in order to protect themselves. Some of them use specialized JavaScript engines in order to detect bots, stop ad blockers, display animations, track users and more.

Using the underlying code can also consume tens of thousands of tokens, the essential currency of AI operations. Visual interfaces also behave much more closely to how humans interact with web interfaces: What a person sees is how they will approach the page. It means it’s easier to debug why the model did what it did.

In spite of the compact size, Ai2 said MolmoWeb achieves state-of-the-art results among open-weight web agents. When tested on popular evaluation suites, the 8B model scored 78.2% on WebVoyager, 42.3% on DeepShop, and 49.5% on TailBench. It outperformed leading open-weight models such as Fara-7B across all four benchmarks.

The company said that MolmoWeb can also outperform agents built on GPT-4 that rely on annotated and structured page data. Ai2 said that’s a particularly important result given that those models can “see” deeply into the very code of the webpage and also have substantially larger parameter sizes — by colossal orders of magnitude. like comparing a mouse to an elephant.

More access to open-weight browser AI agents will also help researchers and hobbyists develop their own web-using automations.

Closed-source large language model providers have already dipped their toes into the market with agentic web browsers capable of automating web tasks, including OpenAI Group PBC and Perplexity AI Inc., with ChatGPT Atlas and Perplexity Comet, respectively.

Image: Allen Institute for AI

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.