(CSA Images)

How well can top AI models do these jobs?

An OpenAI benchmark tests how well AI models can perform “economically valuable” jobs.

9/29/25 11:59AM

One of the biggest fears fueling the public’s apprehension toward AI is that the technology will eventually take their jobs.

We’ve already seen evidence that some roles like entry-level software development, customer service, and marketing are feeling the effects of automation powered by generative AI. Being able to track the real-world work capabilities of AI models will become increasingly important as models get more and more powerful.

To that end, OpenAI has created a new AI benchmark called “GDPval” that aims to measure just how well leading AI models can do realistic tasks for a variety of “economically valuable” jobs.

OpenAI describes the benchmark as an evolutionary step away from the first wave of benchmarks that followed a more academic, exam-style model:

“[GDPval] measures model performance on tasks drawn directly from the real-world knowledge work of experienced professionals across a wide range of occupations and sectors, providing a clearer picture on how models perform on economically valuable tasks. Evaluating models on realistic occupational tasks helps us understand not just how well they perform in the lab, but how they might support people in the work they do every day.”

Working with experienced industry professionals, the researchers created a dataset of 220 realistic tasks from 44 occupations that someone might do in the course of their work in a particular role.

Here’s an example of one of the tasks in the benchmark’s training data for a real estate broker:

Screenshot 2025-09-26 at 3.41.51 PM — Sample task for a real estate broker from the GDPval benchmark’s training dataset (Huggingfacce.co)

We went through the data and picked a few common jobs from the benchmark’s results. Unsurprisingly, software developers were the most impacted job, with Anthropic’s Claude model getting an average 70% win rate on the test, which was then compared to a human in that role. For example, a score of 50% would put the model on par with a human expert. Audio and video technicians should feel that their job is secure (for now), as the models executed those tasks with very low scores.

OpenAI acknowledges there are limitations with this benchmark. For instance, currently, each task comes with some background materials that are required to do the task — but generating those background materials itself requires complex work and the benchmark doesn’t assess current models’ ability to complete those necessary preparatory tasks. Instead that work is done by the humans testing the AI. The paper also notes that this is a small dataset, and the current jobs tested are mainly those of “knowledge workers” that can be performed on a computer.

Maybe a future version will be used to test how well a robot can scrub your toilet.

🚀 $100B

Jon Keegan8h

Alphabet’s 2015 investment in SpaceX is about to pay off handsomely with the company’s hotly anticipated IPO later this year, which is expected to be the largest in history.

Bloomberg reports that according to new financial filings, Alphabet’s investment could be worth up to $100 billion.

Google invested in SpaceX in 2015 when it, along with Fidelity, invested $1 billion in a round that valued SpaceX at $10 billion. At the end of 2025, Google owned just over 6% of SpaceX, per Bloomberg’s reporting on the more recent filings. That stake has likely been diluted due to SpaceX’s merger with xAI.

Jon Keegan

10h

OpenAI pulls out of Stargate Norway, hands data center off to Microsoft

This is the third piece of Project Stargate that OpenAI is retreating from since a flurry of announcements last year.

Rani Molla13h

Barclays says autonomous couriers — think sidewalk robots and drones — could push delivery costs down to as little as $1 per order, from between $5 and $7 today and closer to $9 for traditional deliveries in high-labor-cost markets. If robots save $4 on every delivery, and enough companies start using them, the food delivery industry, including companies like DoorDash and Uber, could end up with $16 billion in extra profit every year, according to Barclays.

The catch: we’re nowhere near that world yet. Robots and drones handle less than 1% of deliveries today. Even by 2035, Barclays only sees penetration hitting around 10%.

Google’s Wing and Amazon have also been trying to crack last-mile product delivery — a reminder that this is part of a broader race to automate the most expensive leg of e-commerce.

Rani Molla

15h

Lyft is building the infrastructure robotaxis can’t avoid

The ride-hailing company is building an 80,000-square-foot Nashville warehouse where humans will help robotaxis with everything but driving.

Person working at Lyft Nashville warehouse

$10B

Rani Molla15h

Uber has long had an asset-light business model: it provided the ride-hailing platform, and its contract workers brought their own vehicles. That’s changing as Uber positions itself at the center of the robotaxi era.

The Financial Times estimates that Uber has committed more than $10 billion to buying robotaxi fleets ($7.5 billion) and investing in the companies that make them ($2.5 billion). That includes yesterday’s announcement that it’s expanding its investment in Lucid, a deal worth about $2 billion, with plans to buy 35,000 vehicles.

This shift pits Uber against industry leaders like Google’s Waymo and Tesla, whose models involve company-owned vehicles running on proprietary platforms. While these autonomous fleets eliminate the need for drivers, they introduce new capital-intensive requirements for charging, cleaning, storage, and repair.

How well can top AI models do these jobs?

More Tech

Latest Stories