Tech
Robot Among City Ruins
(CSA Images)

How well can top AI models do these jobs?

An OpenAI benchmark tests how well AI models can perform “economically valuable” jobs.

One of the biggest fears fueling the public’s apprehension toward AI is that the technology will eventually take their jobs.

We’ve already seen evidence that some roles like entry-level software development, customer service, and marketing are feeling the effects of automation powered by generative AI. Being able to track the real-world work capabilities of AI models will become increasingly important as models get more and more powerful.

To that end, OpenAI has created a new AI benchmark called “GDPval” that aims to measure just how well leading AI models can do realistic tasks for a variety of “economically valuable” jobs.

OpenAI describes the benchmark as an evolutionary step away from the first wave of benchmarks that followed a more academic, exam-style model:

“[GDPval] measures model performance on tasks drawn directly from the real-world knowledge work of experienced professionals across a wide range of occupations and sectors, providing a clearer picture on how models perform on economically valuable tasks. Evaluating models on realistic occupational tasks helps us understand not just how well they perform in the lab, but how they might support people in the work they do every day.”

Working with experienced industry professionals, the researchers created a dataset of 220 realistic tasks from 44 occupations that someone might do in the course of their work in a particular role.

Here’s an example of one of the tasks in the benchmark’s training data for a real estate broker:

Screenshot 2025-09-26 at 3.41.51 PM
Sample task for a real estate broker from the GDPval benchmark’s training dataset (Huggingfacce.co)

We went through the data and picked a few common jobs from the benchmark’s results. Unsurprisingly, software developers were the most impacted job, with Anthropic’s Claude model getting an average 70% win rate on the test, which was then compared to a human in that role. For example, a score of 50% would put the model on par with a human expert. Audio and video technicians should feel that their job is secure (for now), as the models executed those tasks with very low scores.

OpenAI acknowledges there are limitations with this benchmark. For instance, currently, each task comes with some background materials that are required to do the task — but generating those background materials itself requires complex work and the benchmark doesn't assess current model's ability to complete those necessary preparatory tasks. Instead that work is done by the humans testing the AI. The paper also notes that this is a small dataset, and the current jobs tested are mainly those of “knowledge workers” that can be performed on a computer.

Maybe a future version will be used to test how well a robot can scrub your toilet.

More Tech

See all Tech
tech

Snap jumps on new revenue stream, continued social media buzz

Snap jumped as high as 5% Monday after the social media company announced that it would be charging users for its Memories features after they reach 5 gigabytes of storage. Snapchat, which has clocked more than 1 trillion saved Memories on its platform, told TechCrunch the Memory Storage plans would range from $1.99 a month for 100 gigabytes of storage to $15.99 for 5-terabyte plans. The fees will be a new revenue stream for the company, whose ad revenue isn’t growing as fast as its peers’.

Snap rose more than 20% this month amid positive r/WallStreetBets chatter, buyout speculation, and increased investment by Saudi investor Prince Al Waleed bin Talal Al Saud. And the US spin-off of TikTok doesn’t seem to be taking the wind out of Snap’s sales.

tech

Alibaba jumps as Macquarie and Jefferies up price targets on AI cloud demand

Alibaba is up about 4% this morning after Macquarie analyst Ellie Jiang raised her price target on the stock to a Street high of $235.60, up from $177.90, and Jefferies analyst Thomas Chong upped his price target to $230 from $178, based on a strong cloud outlook and synergies in its rapid-delivery model of e-commerce. The duo is among a string of analysts lately, including those at Morgan Stanley, Baird, and Bank of America, to raise their price targets on the stock.

The Jefferies analyst cited the company’s “remarkable progress made in multiple areas,” including foundation models, AI infrastructure, and agents. Alibaba also jumped up last week on news of an AI spending hike, a new model launch, and a partnership with Nvidia.

Separately, Bloomberg Intelligence analysts Robert Lea and Jasmine Lyu highlighted the e-commerce and cloud giant as a key beneficiary of Huawei’s reported plan to double output of its top AI chip next year.

“The doubling of production of Huawei’s marque AI accelerator chip in 2026 could help ease the semiconductor bottleneck at Alibaba, Tencent and Baidu,” they wrote.

tech
Rani Molla

Apple has built an app like ChatGPT to test AI Siri

Back in 2024, Apple previewed a new AI Siri that the iPhone maker has since mostly failed to deliver, with the overhaul now slated for the spring of 2026. But Bloomberg’s Mark Gurman says Apple is making moves.

Apple has built an internal ChatGPT-like app to test the new Siri, Bloomberg reports. Workers are using the app, code-named Veritas, to test Siri’s ability to search through personal data like emails and perform in-app actions like editing photos — stuff its competitor Google is already offering.

“The app essentially takes the still-in-progress technology from the new Siri and puts it in a form employees can test out more efficiently,” Gurman wrote. “Even without a public launch, the internal tool marks a new phase in Apple’s preparations for Siri’s overhaul, a high-stakes release that could reshape perceptions of its AI efforts.”

“The app essentially takes the still-in-progress technology from the new Siri and puts it in a form employees can test out more efficiently,” Gurman wrote. “Even without a public launch, the internal tool marks a new phase in Apple’s preparations for Siri’s overhaul, a high-stakes release that could reshape perceptions of its AI efforts.”

Latest Stories

Sherwood Media, LLC produces fresh and unique perspectives on topical financial news and is a fully owned subsidiary of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, or Robinhood Money, LLC.