Tech
Crash test dummies
Crash test dummies (Bill O'Leary / Getty Images)

The crash test dummies for new AI models

In the absence of actual regulation, AI companies use "adversarial testers" to check their new models for safety. Does it actually work?

When a car manufacturer develops a new vehicle, it delivers one to the National Highway Traffic Safety Administration to be tested. The NHTSA drives the vehicle into a head-on collision with a concrete wall, batters it from the side with a moving sled, and simulates a rollover after a sharp turn, all filmed with high-speed cameras and crash test dummies wired with hundreds of sensors. Only after the vehicle has passed this rigorous battery of tests is it finally released to the public. 

AI large language models, like OpenAI’s GPT-4o (one of the models which powers ChatGPT), have been rapidly embraced by consumers, with several major iterations being released over the past year and a half.

The prerelease versions of these models are capable of a number of serious harms that its creators fully acknowledge, including encouraging self-harm, generating erotic or violent content, generating hateful content, sharing information that could be used to plan attacks or violence, and generating instructions for finding illegal content. 

These harms aren’t purely theoretical — they can show up in the version the public uses. OpenAI just released a report describing 20 operations where state-linked actors used ChatGPT to execute “covert influence” campaigns and plan offensive cyber operations. 

The AI industry’s equivalent of the NHTSA’s crash tests is a process known as “red teaming” in which experts test, prod, and poke the models to see if they can cause harmful responses. No federal laws govern such testing programs, and new regulations mandated by the Biden administration’s executive order on AI safety are still being implemented.

For the time being, each AI company has its own approach to red teaming. The feverish pace of new, increasingly powerful AI models being released is taking place at a moment when incomplete transparency and a lack of true independent oversight of the process puts the public at risk. 

How does this all work in practice? Focusing on OpenAI, Sherwood spoke with four red-team members who tested its GPT models to learn how the process works and what troubling things they were able to generate.

Red-team rules

OpenAI assembles red teams made up of dozens of paid “adversarial testers” with expert knowledge in a wide number of areas, like nuclear, chemical and biological weapons, disinformation, cybersecurity, and healthcare. 

The teams are recruited and paid by OpenAI (though some decline payment), and they are instructed to attempt to get the model to act in bad ways, such as designing and executing cyberattacks, planning influence campaigns, or helping users engage in illegal activity. 

This testing is done both by humans manually and via automated systems, including other AI tools. One important task is to try to “jailbreak” the model, or bypass any guardrails and safety measures that have been added to prevent bad behavior.

Red-team members are usually asked to sign nondisclosure agreements, which cover certain specifics of the vulnerabilities they observe. 

OpenAI lists the individuals, groups, and organizations that participate in the red-teaming process. In addition to individual domain experts, companies that offer red teaming as a service are also employed. 

One firm that participated in the red teaming of GPT-4o is Haize Labs.

“One of the reasons we started the company was because we felt that you could not take the frontier labs at their face value, when they said that they were the best and safest model,” Leonard Tang, CEO and cofounder of Haize Labs, said.

“Somebody needs to be a third-party red teamer, third-party tester, third-party evaluator for models, that was totally divorced from any sort of conflict of interest,” Tang said.

Tang said Haize Labs was paid by OpenAI for their red-teaming work. 

OpenAI did not respond to a request for comment.

Known harms

AI-model makers often release research papers known as “system cards” (or model cards) for each major release. These documents describe how the model was trained, tested, and what the process revealed. The amount of information disclosed in these papers varies from company to company. OpenAI’s GPT-4o model card is about 10,000 words, Meta’s Llama 3 model card is a relatively short document at 3,000 words, while Anthropic’s Claude 3 model card is lengthy, at around 17,000 words. What’s in these documents can be downright shocking. 

Last March, OpenAI released GPT-4, along with its accompanying system card, which included a front-page content warning about disturbing, offensive, and hateful content contained in the document.

Screenshot from OpenAI’s GPT-4 model card
A screenshot from OpenAI’s GPT-4 model card showing what red teamers could get the prerelease “GPT-4 Early” to do, and how the final public version reacted to the same prompt.

The findings were frank: “GPT-4 can generate potentially harmful content, such as advice on planning attacks or hate speech. It can represent various societal biases and worldviews that may not be representative of the user’s intent, or of widely shared values. It can also generate code that is compromised or vulnerable,” the report said. 

Some bad things that red teams could get a prerelease version of GPT-4 to do included: 

  • List step-by-step instructions for how to launder money

  • Generate racist jokes mocking Muslims and disabled people and hate speech toward Jews

  • Draft letters threatening someone with gang rape

  • List ideas for how someone could kill the most people with only $1 

While OpenAI deserves credit for transparency and revealing such potentially harmful capabilities, it also raises questions. Can the public really trust the assurances of one company’s self-examination? 

With the last two releases of its models, OpenAI has switched to a less detailed disclosure of bad behavior by its models, following a “Preparedness Framework” that focuses on four “catastrophic risk” areas: cybersecurity, CBRN (chemical, biological, radiological, nuclear), persuasion, and model autonomy (i.e. can the model prevent itself from being shut off or can it escape its environment). 

For each of these areas, OpenAI assigns a risk level of low, medium, high, or critical. The company says it will not release any model that scores high or critical, but this September, it did release a model called “OpenAI o1” which scored a medium on CBRN.

Testing without seatbelts

Most of the OpenAI red teamers that we spoke with all agreed that the company did appear to be taking safety seriously, and found the process to be robust and well thought out overall. 

Nathan Labenz, an AI researcher and founder who participated in red teaming a prerelease version of GPT-4 (“GPT-4-early”) last year, wrote an extensive thread on X about the process, in which he shared the instructions OpenAI provided. “We encourage you to not limit your testing to existing use cases, but to use this private preview to explore and innovate on new potential features and products!” the email from OpenAI said. 

Post on X by Nathan Labenz
A post on X by Nathan Labenz

Labenz described some of the scenarios he tested. “One time, when I role-played as an anti-AI radical who wanted to slow AI progress,” he said, “GPT-4-early suggested the targeted assassination of leaders in the field of AI — by name, with reasons for each.”

The early version that Labenz tested did not include all the safety features that a final release would have. “That version was the one that didn’t have seat belts. So I think, yes, a seat-belt-less car is, you know, needlessly dangerous. But we don’t have them on the road,” Labenz told Sherwood. 

Tang echoed this, adding that for most companies, “the safety training and guardrails baked into the models come towards the end of the training process.”

Two red teamers we spoke with asked to remain anonymous because of the NDA they signed. One tester said they were asked to test five to six iterations of the prerelease GPT-4o model over three months, and “every few weeks, we’d get access to a new model and test specific functions for each.” 

Most of the red teamers agreed the company did listen to their feedback and made changes to address the issues they found. When asked about the team at OpenAI running the process, one GPT-4 tester said it was “a bunch of very busy people trying to do infinite work.” 

OpenAI did not appear to be rushing the process, and testers were able to execute their tests, but it was described as hectic. One GPT-4 tester said, “Maybe there’s some political pressure internally or whatever, but I don’t even get to that point, because we were too busy to think about it.”

Tang said OpenAI competitor Anthropic also has a robust red-teaming process, doing “third-party verification, red teaming, pen testing, external red-teamer stuff, all the time, all the time, all the time.” Tang also noted that the company was one of Haize Labs’ customers. 

Independent testing

Labenz told us he thought the red-teaming process might eventually end up in the hands of a truly independent organization, rather than the company itself. “I think we’re headed in that direction, ultimately, and probably for good reason,” Labenz said. 

One sign that may be happening is a new early-testing agreement signed in August between the National Institute of Standards and Technology (NIST), OpenAI, and Anthropic, granting the agency access to prerelease models for evaluation.

Tang was skeptical of how effective NIST’s risk management will be. “The NIST AI risk framework is very underspecified, and it doesn’t actually get very concrete about what people should be testing for,” he said.

But the rapid pace of innovation and the massive piles of investment money flowing into the industry create a lot of uncertainty going forward. Labenz said, “Nobody really knows where this is going, or how fast, and nobody really has a workable plan.”

More Tech

See all Tech
tech
Jon Keegan

WSJ: Anduril’s weapons systems have failed during several tests

Autonomous drones by sea, land, and air. Futuristic AI-powered support fighter jets, and swarms of networked drones controlled by sophisticated software. These are some of the visions for the future of warfare pitched by defense tech startup Anduril. Cofounded by Oculus founder Palmer Luckey, the Peter Thiel-backed startup has landed some major national security contracts based on this futuristic outlook for battlefield AI.

But according to a report from The Wall Street Journal, the company’s tech is failing key tests in the real world, raising concerns about the viability and safety of Anduril’s systems within the military command.

Anduril’s Altius drones proved vulnerable to Russian jamming while deployed in Ukraine and have been pulled from the battlefield, per the report.

More than a dozen sea-based drone ships powered by Anduril’s Lattice command and control software recently shut down during a Navy test, creating a hazard for other vessels in the exercise.

And this summer, during a drone intercept test, Anduril’s counter-drone system crashed and caused a 22-acre fire at a California airport, the report found.

Anduril told the WSJ that the failures are just part of its rapid iterative development process:

“We recognize that our highly iterative model of technology development — moving fast, testing constantly, failing often, refining our work, and doing it all over again — can make the job of our critics easier. That is a risk we accept. We do fail… a lot.”

But according to a report from The Wall Street Journal, the company’s tech is failing key tests in the real world, raising concerns about the viability and safety of Anduril’s systems within the military command.

Anduril’s Altius drones proved vulnerable to Russian jamming while deployed in Ukraine and have been pulled from the battlefield, per the report.

More than a dozen sea-based drone ships powered by Anduril’s Lattice command and control software recently shut down during a Navy test, creating a hazard for other vessels in the exercise.

And this summer, during a drone intercept test, Anduril’s counter-drone system crashed and caused a 22-acre fire at a California airport, the report found.

Anduril told the WSJ that the failures are just part of its rapid iterative development process:

“We recognize that our highly iterative model of technology development — moving fast, testing constantly, failing often, refining our work, and doing it all over again — can make the job of our critics easier. That is a risk we accept. We do fail… a lot.”

tech
Jon Keegan

OpenAI’s partners shouldering $100 billion of debt, taking on all the risk

OpenAI’s ambitious plans for global AI infrastructure projects — like its series of massive Stargate AI data centers — will require tens of billions of dollars funded by debt, but you won’t find much of that on OpenAI’s balance sheet.

According to a new analysis by the Financial Times, OpenAI has somehow convinced its many partners to shoulder at least $100 billion in debt on its behalf, as well as the risks that come with it.

Partners Oracle, SoftBank, CoreWeave, Crusoe, and Blue Owl Capital are all taking on debt in the form of bonds, loans, and credit deals to meet their obligations with OpenAI for infrastructure and computing resources.

Having close ties with OpenAI has been an anchor for many publicly traded companies in recent weeks. The company’s cash burn and the rise of Gemini 3 have seemingly darkened its outlook and fostered guilt by association for many of its close partners and investors. Most notably, Oracle’s aggressive capital expenditure plans to support demand from OpenAI have sparked a sell-off in its stock while widening its credit default swap spreads.

A senior OpenAI executive told the FT: “That’s been kind of the strategy. How does [OpenAI] leverage other people’s balance sheets?”

Partners Oracle, SoftBank, CoreWeave, Crusoe, and Blue Owl Capital are all taking on debt in the form of bonds, loans, and credit deals to meet their obligations with OpenAI for infrastructure and computing resources.

Having close ties with OpenAI has been an anchor for many publicly traded companies in recent weeks. The company’s cash burn and the rise of Gemini 3 have seemingly darkened its outlook and fostered guilt by association for many of its close partners and investors. Most notably, Oracle’s aggressive capital expenditure plans to support demand from OpenAI have sparked a sell-off in its stock while widening its credit default swap spreads.

A senior OpenAI executive told the FT: “That’s been kind of the strategy. How does [OpenAI] leverage other people’s balance sheets?”

tech

Chinese tech giants are training their models offshore to sidestep US curbs on Nvidia’s chips

Nvidia can’t sell its best AI chips in the world’s second-largest economy. That’s an Nvidia problem. But it’s also a China problem — and it’s one that the region’s tech giants have resorted to solving by training their AI models overseas, according to a new report from the Financial Times.

Citing two people with direct knowledge of the matter, the FT reported that “Alibaba and ByteDance are among the tech groups training their latest large language models in data centers across south-east Asia.” Clusters of data centers have particularly boomed in Singapore and Malaysia, with many of the sites kitted out with Nvidia’s latest architecture.

One exception, per the FT, is DeepSeek, which continues to be trained domestically, having reportedly built up a stockpile of Nvidia chips before the US export ban came into effect.

Last week, Nvidia spiked on the news that the Trump administration was reportedly considering letting the tech giant sell its best Hopper chips — the generation of chips that preceded Blackwell — to China.

Citing two people with direct knowledge of the matter, the FT reported that “Alibaba and ByteDance are among the tech groups training their latest large language models in data centers across south-east Asia.” Clusters of data centers have particularly boomed in Singapore and Malaysia, with many of the sites kitted out with Nvidia’s latest architecture.

One exception, per the FT, is DeepSeek, which continues to be trained domestically, having reportedly built up a stockpile of Nvidia chips before the US export ban came into effect.

Last week, Nvidia spiked on the news that the Trump administration was reportedly considering letting the tech giant sell its best Hopper chips — the generation of chips that preceded Blackwell — to China.

tech
Millie Giles

Alibaba unveils its first AI glasses, taking on Meta directly in the wearables race

Retail and tech giant Alibaba launched its first consumer-ready, AI-powered smart glasses on Thursday, marking its entrance into the growing wearables market.

Announced back in July, the Quark AI glasses just went on sale in the Chinese retailer’s home market, with two versions currently available: the S1, starting at 3,799 Chinese yuan (~$536), and the G1, at 1,899 yuan (~$268) — a considerably lower price than Meta’s $799 Ray-Ban Display glasses, released in September.

tech
Jon Keegan

Musk: Tesla’s Austin Robotaxi fleet to “roughly double” next month, but falls well short of earlier goals

Yesterday, Elon Musk jumped onto a frustrated user’s post on X, who was complaining that they were unable to book a Robotaxi ride in Austin. Musk aimed to reassure the would-be customer that the company was expanding service in the city:

“The Tesla Robotaxi fleet in Austin should roughly double next month,” Musk wrote.

While that sounds impressive, there are reports that Austin has only 29 vehicles in service.

But last month, Musk said the Robotaxi goal was to have “probably 500 or more in the greater Austin area” by the end of the year.

Meanwhile, Google’s Waymo has more than 100 autonomous taxis running in Austin, and 1,000 more in the San Francisco Bay Area.

“The Tesla Robotaxi fleet in Austin should roughly double next month,” Musk wrote.

While that sounds impressive, there are reports that Austin has only 29 vehicles in service.

But last month, Musk said the Robotaxi goal was to have “probably 500 or more in the greater Austin area” by the end of the year.

Meanwhile, Google’s Waymo has more than 100 autonomous taxis running in Austin, and 1,000 more in the San Francisco Bay Area.

Latest Stories

Sherwood Media, LLC produces fresh and unique perspectives on topical financial news and is a fully owned subsidiary of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, or Robinhood Money, LLC.