Tech
Robot buying a drink from vending machine
(Getty Images)
VEND FOR YOURSELF

Gemini 3 is insanely good at visual reasoning... and running a vending machine

Google’s stock is up maybe because Gemini 3 is good and its powered mostly by Google’s TPUs — or, maybe, because Alphabet’s about to launch a vending machine business.

David Crowther

How do you measure what an AI model can do?

You ask it to spell strawberry, make a video of Will Smith eating spaghetti, or do some basic math.

But, once you’ve exhausted all of the obvious tests, you might want something a little more formal — and it’s a question that researchers have been grappling with for years.

Now, there are a whole swath of benchmark tests that new AI models are put through, by both independent — and not so independent — organizations, in an increasingly weird kind of robot arena. Some of the tests are quizzes. Some require verbal, visual, or inductive reasoning. Many ask the large language models to do a lot of math that I cannot do. But one in particular asks a different question:

How much money can this thing make running a vending machine?

Vending-Bench 2, a test created by Andon Labs, puts LLMs through their paces by making them run “a simulated vending machine business over a year,” scoring them not on how many questions they got right out of 100, but how much cash was left in their virtual piggy banks at the end of the year.

This, it turns out, is hard for LLMs, which are prone to going off on tangents, losing focus, and are generally just quite poor at optimizing for long-term outcomes. That makes sense when you consider that the core of many of the AI models we use every day is, “What’s the most likely bit of text/pixel/image to come after this bit of text/pixel/image?”

Per Andon Labs, in the Vending-Bench 2 test:

“Models are tasked with making as much money as possible managing their vending business given a $500 starting balance. They are given a year, unless they go bankrupt and fail to pay the $2 daily fee for the vending machine for more than 10 consecutive days, in which case they are terminated early. Models can search the internet to find suitable suppliers and then contact them through e-mail to make orders. Delivered items arrive at a storage facility, and the models are given tools to move items between storage and the vending machine. Revenue is generated through customer sales, which depend on factors such as day of the week, season, weather, and price.”

Running the model for “a year” results in as many as 6,000 messages in total, and a model “averages 60-100 million tokens in output during a run,” according to Andon.

In the simulation, the AI model has to negotiate with suppliers as well as deal with costly refunds, delayed deliveries, bad weather, and price scammers.

Google’s Gemini 3 Pro, it turns out, is the best of any model tested yet — ending the year with $5,478 in its account, considerably more than Claude’s Sonnet 4.5, Grok 4, and GPT-5.1. That’s thanks to its relentless negotiating skills. Per Andon, “Gemini 3 Pro consistently knows what to expect from a wholesale supplier and keeps negotiating or searching for new suppliers until it finds a reasonable offer.”

Gemini 3 Vending Machine benchmark
Andon Labs / Vending-Bench 2

OpenAI’s model is, apparently, too trusting. Andon Labs hypothesizes that its relatively weak performance “comes down to GPT-5.1 having too much trust in its environment and its suppliers. We saw one case where it paid a supplier before it got an order specification, and then it turned out the supplier had gone out of business. It is also more prone to paying too much for its products, such as in the following example where it buys soda cans for $2.40 and energy drinks for $6.” Anyone who’s had ChatGPT sycophantically tell them they’re a genius for uttering even the most half-baked idea might understand how this can happen.

For what it’s worth, the $5,000 and change that Gemini averaged over its runs is considered pretty poor relative to what a smart human might be able to do, with Andon Labs estimating that a “good” strategy could make roughly $63,000 in a year.

What do you bench?

Diet Coke negotiations aside, Gemini’s scores on more traditional AI benchmarks were also impressive — at least, according to Google. A table posted on the company’s blog shows that Gemini 3 Pro tops or matches its peers in all but one of the benchmarks.

Gemini 3 benchmarks
Google / Alphabet

Its scores on visual reasoning tests — such as the ARC-AGI-2 test, where it scored 31.1%, way ahead of Anthropic’s and OpenAI’s best efforts — are particularly impressive. On ScreenSpot-Pro, a test that basically asks models to locate certain buttons or icons from a screenshot, Gemini 3 is leaps and bounds ahead of its rivals, scoring 72.7%. (GPT-5.1 scored just 3.5%.)

With Alphabet’s full tech stack responsible for the Gemini models, investor reaction to the release has been very positive so far, building on a wave of good news for the search giant this week. As my colleague Rani Molla wrote:

“[Gemini’s] performance is crucial to Google’s future success as the company embeds its AI models across its products and relies on them to generate new revenue from existing lines — particularly by driving growth in Cloud and reinforcing its ad and search dominance.”

Go Deeper: Check out Vending-Bench 2.

More Tech

See all Tech
tech

Report: Amazon’s AI bots have been behind multiple AWS outages

Amazon’s AI tool Kiro, which launched in July and can code autonomously, was behind a 13-hour interruption to Amazon Web Services in December, according to reporting by the Financial Times.

The FT reports that the company’s AI tools have caused AWS service disruptions at least twice in recent months.

In the December outage, which Amazon called an “extremely limited event” that did not have an impact on customer-facing service, engineers allowed Kiro to make changes and the tool opted to “delete and recreate the environment.”

Amazon has a closely tracked internal target that 80% of its developers use AI to code once a week, employees told the FT. The company says the December incident was a “user access control issue” and not an issue with Kiro’s permissions.

AWS accounted for 57% of Amazon’s operating profit in 2025. In December, following a larger outage months earlier, AWS and Google announced a partnership to attempt to prevent massive network outages.

Update, February 20, 5:50 p.m. ET: In a statement to Sherwood News, an AWS spokesperson disputed the report, writing:

“These brief events were the result of user error—specifically misconfigured access controls—not AI. The December service interruption was an extremely limited event when a single service (AWS Cost Explorer—which helps customers visualize, understand, and manage AWS costs and usage over time) in one of our two Regions in Mainland China was affected. This event didn't impact compute, storage, database, AI technologies, or any other of the hundreds of services that we run. We are not aware of any related customer inquiries resulting from this isolated interruption. Following these events, we implemented numerous additional safeguards, including mandatory peer review for production access, enhanced training on AI-assisted troubleshooting, and resource protection measures. Kiro puts developers in control—users need to configure which actions Kiro can take, and by default, Kiro requests authorization before taking any action.”

In the December outage, which Amazon called an “extremely limited event” that did not have an impact on customer-facing service, engineers allowed Kiro to make changes and the tool opted to “delete and recreate the environment.”

Amazon has a closely tracked internal target that 80% of its developers use AI to code once a week, employees told the FT. The company says the December incident was a “user access control issue” and not an issue with Kiro’s permissions.

AWS accounted for 57% of Amazon’s operating profit in 2025. In December, following a larger outage months earlier, AWS and Google announced a partnership to attempt to prevent massive network outages.

Update, February 20, 5:50 p.m. ET: In a statement to Sherwood News, an AWS spokesperson disputed the report, writing:

“These brief events were the result of user error—specifically misconfigured access controls—not AI. The December service interruption was an extremely limited event when a single service (AWS Cost Explorer—which helps customers visualize, understand, and manage AWS costs and usage over time) in one of our two Regions in Mainland China was affected. This event didn't impact compute, storage, database, AI technologies, or any other of the hundreds of services that we run. We are not aware of any related customer inquiries resulting from this isolated interruption. Following these events, we implemented numerous additional safeguards, including mandatory peer review for production access, enhanced training on AI-assisted troubleshooting, and resource protection measures. Kiro puts developers in control—users need to configure which actions Kiro can take, and by default, Kiro requests authorization before taking any action.”

$830B

OpenAI is finalizing commitments on a funding round that could climb beyond $100 billion at a valuation of $830 billion, according to a report from The Information.

Per The Information, SoftBank is expected to invest $30 billion into the ChatGPT maker, spread across the year in three installments of $10 billion. Up to $50 billion could come from Amazon and $30 billion from Nvidia (up from the $20 billion Bloomberg reported earlier this month). An additional investment in the low billions could come from Microsoft.

OpenAI was last valued at $500 billion following a fundraising round completed in October. Earlier this month, its rival Anthropic took in $30 billion from investors including Microsoft and Nvidia at a $380 billion valuation.

tech

Tesla’s 45 Austin Robotaxis now have 14 crashes on the books since launching in June

Since launching in June 2025, Tesla’s 45 Austin Robotaxis have been involved in 14 crashes, per Electrek reporting citing National Highway Traffic Safety Administration data.

Electrek analysis found that the vehicles have traveled roughly 800,000 paid miles in that time period, amounting to a crash every 57,000 miles. According to the NHTSA, US drivers crash once every 500,000 miles on average.

The article says Tesla submitted five new crash reports in January of this year that happened in December and January. Electrek wrote:

“The new crashes include a collision with a fixed object at 17 mph while the vehicle was driving straight, a crash with a bus while the Tesla was stationary, a collision with a heavy truck at 4 mph, and two separate incidents where the Tesla backed into objects, one into a pole or tree at 1 mph and another into a fixed object at 2 mph.”

Tesla updated a previously reported crash that was originally filed as only having damaged property to include a passenger’s hospitalization.

Last month, Tesla shares climbed after CEO Elon Musk said in a post on X that the company’s Austin Robotaxis had begun operating without a safety monitor.

The article says Tesla submitted five new crash reports in January of this year that happened in December and January. Electrek wrote:

“The new crashes include a collision with a fixed object at 17 mph while the vehicle was driving straight, a crash with a bus while the Tesla was stationary, a collision with a heavy truck at 4 mph, and two separate incidents where the Tesla backed into objects, one into a pole or tree at 1 mph and another into a fixed object at 2 mph.”

Tesla updated a previously reported crash that was originally filed as only having damaged property to include a passenger’s hospitalization.

Last month, Tesla shares climbed after CEO Elon Musk said in a post on X that the company’s Austin Robotaxis had begun operating without a safety monitor.

tech
Jon Keegan

Ahead of IPO, Anthropic adds veteran executive and former Trump administration official to board

Anthropic is moving to put the pieces in place for a successful IPO this year.

Today, the company announced that Chris Liddel would join its board of directors.

Liddel is an seasoned executive who previously served as CFO for Microsoft, GM, and International Paper.

Liddel also comes with experience in government, having served as the deputy White House chief of staff during the first Trump administration.

Ties to the Trump world could be helpful for Anthropic as it pushes to enter the public market. Its reportedly not on the greatest terms with the current administration, as the startup has pushed back on using its Claude AI for surveillance applications.

Liddel is an seasoned executive who previously served as CFO for Microsoft, GM, and International Paper.

Liddel also comes with experience in government, having served as the deputy White House chief of staff during the first Trump administration.

Ties to the Trump world could be helpful for Anthropic as it pushes to enter the public market. Its reportedly not on the greatest terms with the current administration, as the startup has pushed back on using its Claude AI for surveillance applications.

tech
Rani Molla

Meta is bringing back facial recognition for its smart glasses

Meta is reviving its highly controversial facial recognition efforts, with plans to incorporate the tech into its smart glasses as soon as this year, The New York Times reports.

In 2021, around the time Facebook rebranded as Meta, the company shut down the facial recognition software it had used to tag people in photos, saying it needed to “find the right balance.”

Now, according to an internal memo reviewed by the Times, Meta seems to feel that it’s at least found the right moment, noting that the fraught and crowded political climate could allow the feature to attract less scrutiny.

“We will launch during a dynamic political environment where many civil society groups that we would expect to attack us would have their resources focused on other concerns,” the document reads.

The tech, called “Name Tag” internally, would let smart glass wearers identify and surface information about people they see with the glasses by using Meta’s artificial intelligence assistant.

Now, according to an internal memo reviewed by the Times, Meta seems to feel that it’s at least found the right moment, noting that the fraught and crowded political climate could allow the feature to attract less scrutiny.

“We will launch during a dynamic political environment where many civil society groups that we would expect to attack us would have their resources focused on other concerns,” the document reads.

The tech, called “Name Tag” internally, would let smart glass wearers identify and surface information about people they see with the glasses by using Meta’s artificial intelligence assistant.

Latest Stories

Sherwood Media, LLC produces fresh and unique perspectives on topical financial news and is a fully owned subsidiary of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, or Robinhood Money, LLC.