Tech
Robot buying a drink from vending machine
(Getty Images)
VEND FOR YOURSELF

Gemini 3 is insanely good at visual reasoning... and running a vending machine

Google’s stock is up maybe because Gemini 3 is good and its powered mostly by Google’s TPUs — or, maybe, because Alphabet’s about to launch a vending machine business.

David Crowther

How do you measure what an AI model can do?

You ask it to spell strawberry, make a video of Will Smith eating spaghetti, or do some basic math.

But, once you’ve exhausted all of the obvious tests, you might want something a little more formal — and it’s a question that researchers have been grappling with for years.

Now, there are a whole swath of benchmark tests that new AI models are put through, by both independent — and not so independent — organizations, in an increasingly weird kind of robot arena. Some of the tests are quizzes. Some require verbal, visual, or inductive reasoning. Many ask the large language models to do a lot of math that I cannot do. But one in particular asks a different question:

How much money can this thing make running a vending machine?

Vending-Bench 2, a test created by Andon Labs, puts LLMs through their paces by making them run “a simulated vending machine business over a year,” scoring them not on how many questions they got right out of 100, but how much cash was left in their virtual piggy banks at the end of the year.

This, it turns out, is hard for LLMs, which are prone to going off on tangents, losing focus, and are generally just quite poor at optimizing for long-term outcomes. That makes sense when you consider that the core of many of the AI models we use every day is, “What’s the most likely bit of text/pixel/image to come after this bit of text/pixel/image?”

Per Andon Labs, in the Vending-Bench 2 test:

“Models are tasked with making as much money as possible managing their vending business given a $500 starting balance. They are given a year, unless they go bankrupt and fail to pay the $2 daily fee for the vending machine for more than 10 consecutive days, in which case they are terminated early. Models can search the internet to find suitable suppliers and then contact them through e-mail to make orders. Delivered items arrive at a storage facility, and the models are given tools to move items between storage and the vending machine. Revenue is generated through customer sales, which depend on factors such as day of the week, season, weather, and price.”

Running the model for “a year” results in as many as 6,000 messages in total, and a model “averages 60-100 million tokens in output during a run,” according to Andon.

In the simulation, the AI model has to negotiate with suppliers as well as deal with costly refunds, delayed deliveries, bad weather, and price scammers.

Google’s Gemini 3 Pro, it turns out, is the best of any model tested yet — ending the year with $5,478 in its account, considerably more than Claude’s Sonnet 4.5, Grok 4, and GPT-5.1. That’s thanks to its relentless negotiating skills. Per Andon, “Gemini 3 Pro consistently knows what to expect from a wholesale supplier and keeps negotiating or searching for new suppliers until it finds a reasonable offer.”

Gemini 3 Vending Machine benchmark
Andon Labs / Vending-Bench 2

OpenAI’s model is, apparently, too trusting. Andon Labs hypothesizes that its relatively weak performance “comes down to GPT-5.1 having too much trust in its environment and its suppliers. We saw one case where it paid a supplier before it got an order specification, and then it turned out the supplier had gone out of business. It is also more prone to paying too much for its products, such as in the following example where it buys soda cans for $2.40 and energy drinks for $6.” Anyone who’s had ChatGPT sycophantically tell them they’re a genius for uttering even the most half-baked idea might understand how this can happen.

For what it’s worth, the $5,000 and change that Gemini averaged over its runs is considered pretty poor relative to what a smart human might be able to do, with Andon Labs estimating that a “good” strategy could make roughly $63,000 in a year.

What do you bench?

Diet Coke negotiations aside, Gemini’s scores on more traditional AI benchmarks were also impressive — at least, according to Google. A table posted on the company’s blog shows that Gemini 3 Pro tops or matches its peers in all but one of the benchmarks.

Gemini 3 benchmarks
Google / Alphabet

Its scores on visual reasoning tests — such as the ARC-AGI-2 test, where it scored 31.1%, way ahead of Anthropic’s and OpenAI’s best efforts — are particularly impressive. On ScreenSpot-Pro, a test that basically asks models to locate certain buttons or icons from a screenshot, Gemini 3 is leaps and bounds ahead of its rivals, scoring 72.7%. (GPT-5.1 scored just 3.5%.)

With Alphabet’s full tech stack responsible for the Gemini models, investor reaction to the release has been very positive so far, building on a wave of good news for the search giant this week. As my colleague Rani Molla wrote:

“[Gemini’s] performance is crucial to Google’s future success as the company embeds its AI models across its products and relies on them to generate new revenue from existing lines — particularly by driving growth in Cloud and reinforcing its ad and search dominance.”

Go Deeper: Check out Vending-Bench 2.

More Tech

See all Tech
tech

Prosus may thwart Uber’s bid for Delivery Hero

Uber’s aggressive pursuit of Delivery Hero could hit a major roadblock. After the European food delivery giant rejected Uber’s initial $11.6 billion buyout offer, the American company pivoted, scooping up a 37% stake in the open market.

Now, Prosus, formerly Delivery Hero’s largest shareholder, is plotting a counteroffensive.

Thanks to an EU regulatory waiver Monday that temporarily pauses its mandatory stock sell-down, the Amsterdam-based investment firm is reportedly looking to either increase its stake or rally other shareholders against Uber. The goal: block the takeover entirely or force a significantly higher premium.

Prosus has warned about the loss of European tech relevance if a US giant swallows the company. Meanwhile, investors are loving the drama: the takeover tug-of-war, which also includes DoorDash, has sent Delivery Hero stock soaring over 75% in the past month.

Thanks to an EU regulatory waiver Monday that temporarily pauses its mandatory stock sell-down, the Amsterdam-based investment firm is reportedly looking to either increase its stake or rally other shareholders against Uber. The goal: block the takeover entirely or force a significantly higher premium.

Prosus has warned about the loss of European tech relevance if a US giant swallows the company. Meanwhile, investors are loving the drama: the takeover tug-of-war, which also includes DoorDash, has sent Delivery Hero stock soaring over 75% in the past month.

tech

Tesla sales surge in European markets in May

Tesla sales surged across Europe in May, Reuters reports, with sales jumping double and even triple digits in a number of early-reporting markets. Of course, 2025 was a very difficult year for Tesla sales in Europe, so the growth is coming off notably small denominators.

Interestingly, the resurgence is happening without EU approval for supervised Full Self-Driving, something CEO Elon Musk predicted would cause sales to “improve significantly” after blaming the absence of the tech for its weak sales.

The company has received approval for a version of its FSD tech in the Netherlands, as well as Lithuania and Estonia, and expects “EU-wide” permission in the second or third quarter.

tech
Rani Molla

Microsoft is reportedly building a super app to tame product sprawl — and finally crack mobile

Super apps are very 2010s, but they might be the future for Microsoft. The enterprise giant is working on combining its sprawling and often confusing product suite into a single super app expected by late summer, Fortune reports.

By unifying the tools, Microsoft is hoping that the massive popularity of some of its offerings — particularly GitHub Copilot — will rub off on its other, slower-growing products.

The tool will merge its coding assistant GitHub Copilot, its chat function Copilot, its Copilot Cowork tool, and a new agentic workflow called Autopilot. The move, known internally as “Delivering one Copilot,” will have the dual purpose of simplifying Microsoft’s fragmented desktop AI offerings and finally helping the office software giant gain a foothold on mobile, where competing tools have dominated.

Microsoft is taking a page from frenemy OpenAI’s playbook. In March, OpenAI announced plans for its own desktop super app to combine ChatGPT, Codex, and its Atlas browser into one central workstation.

The tool will merge its coding assistant GitHub Copilot, its chat function Copilot, its Copilot Cowork tool, and a new agentic workflow called Autopilot. The move, known internally as “Delivering one Copilot,” will have the dual purpose of simplifying Microsoft’s fragmented desktop AI offerings and finally helping the office software giant gain a foothold on mobile, where competing tools have dominated.

Microsoft is taking a page from frenemy OpenAI’s playbook. In March, OpenAI announced plans for its own desktop super app to combine ChatGPT, Codex, and its Atlas browser into one central workstation.

42
Rani Molla

Forty-two is the answer to life, the universe, and everything in Douglas Adams’ classic “The Hitchhiker’s Guide to the Galaxy.” It’s also the number of unsupervised Robotaxis Tesla has on the road in Texas, the only state where it’s operating autonomous service, according to records from a newly required government database in the state.

That’s much lower than CEO Elon Musk had hoped, as the company struggles to ready its camera-only autonomous vehicles for commercial scale. In 2025, Musk said that the service would be available to “half the population of the US by the end of the year.”

Even smaller competition has more: Avride has 317 and Nuro has 47. Meanwhile, Tesla’s chief rival, Alphabet subsidiary Waymo, has 577 in operation in the state. Nationwide, Waymo’s fleet currently numbers more than 3,000.

Unfortunately for Tesla, figuring out how to actually scale its robotaxi fleet remains the ultimate question.

INDIA-TECHNOLOGY-AI-DIPLOMACY

Anthropic raises $65 billion at a $965 billion valuation, releases a more “honest” Claude Opus 4.8

Anthropic’s monster $965 billion valuation puts it firmly ahead of OpenAI’s $850 billion valuation as the rivals head toward expected IPOs later this year.

Jon Keegan5/28/26

Latest Stories

Sherwood Media, LLC and Chartr Limited produce fresh and unique perspectives on topical financial news and are fully owned subsidiaries of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, Robinhood Money, LLC, Robinhood U.K. Ltd, Robinhood Derivatives, LLC, Robinhood Gold, LLC, Robinhood Asset Management, LLC, Robinhood Credit, Inc., Robinhood Ventures DE, LLC and, where applicable, its managed investment vehicles.