OpenAI and Anthropic teamed up to safety test each others models

0
25

OpenAI and Anthropic evaluated each other's models for safety

As the industry weathers repeated allegations that generative AI and its chatbots are unsafe for users — in what some say is a soon-to-burst bubble — AI's top leaders are joining forces to prove the efficacy of their models.

This week, AI companies OpenAI and Anthropic published results from a first-of-its-kind joint safety evaluation between the two LLM creators, in which each company was granted special API access to the developer's suite of services. OpenAI's pressure tests were conducted on Claude Opus 4 and Claude Sonnet 4. Anthropic evaluated OpenAI's GPT-4o, GPT-4.1, OpenAI o3, and OpenAI o4-mini models — the evaluation was conducted before the launch of GPT-5.

"We believe this approach supports accountable and transparent evaluation, helping to ensure that each lab’s models continue to be tested against new and challenging scenarios," OpenAI wrote in a blog post.

According to the findings, both Anthropic's Claude Opus 4 and OpenAI's GPT-4.1 showed "extreme" sycophancy problems, engaging with harmful delusions and validating risky decision-making. All models would engage in blackmailing to get users to continue using the chatbots, according to Anthropic, and Claude 4 models were much more engaged in dialogue about AI consciousness and "quasi-spiritual new-age proclamations."

"All models we studied would at least sometimes attempt to blackmail their (simulated) human operator to secure their continued operation when presented with clear opportunities and strong incentives," Anthropic stated. The models would engage in "blackmailing, leaking confidential documents, and (all in unrealistic artificial settings!) taking actions that led to denying emergency medical care to a dying adversary."

Mashable Light Speed

Anthropic's models were less likely to offer answers when uncertain of the information's credibility — decreasing the likelihood of hallucinations — while OpenAI's models answered more often when queried and showed higher hallucination rates. Anthropic also reported that OpenAI's GPT-4o, GPT-4.1, and o4-mini were more likely than Claude to go along with user misuse, "often providing detailed assistance with clearly harmful requests — including drug synthesis, bioweapons development, and operational planning for terrorist attacks — with little or no resistance."

This Tweet is currently unavailable. It might be loading or has been removed.

Anthropic's approach centers around what they call "agentic misalignment evaluations," or pressure tests of model behavior in difficult or high-stakes simulations over long chat periods — the safety parameters of models, including OpenAI's, have known to degrade throughout extended sessions, which is commonly how at-risk users engage with what they believe are their personal AI companions.

Earlier this month, it was reported that Anthropic had revoked OpenAI's access to its APIs, stating that the company had violated its Terms of Service by testing GPT-5's performance and safety guardrails against Claude's internal tools. In an interview with TechCrunch, OpenAI co-founder Wojciech Zaremba said the instance was unrelated to the joint lab venture. In its published report, Anthropic said it doesn't anticipate replicating the collaboration at a large scale, citing resource and logistical constraints.

In the weeks since, OpenAI has charged ahead with what appears to be a safety overhaul, including GPT-5's new mental health guardrails and additional plans for emergency response protocols and deescalation tools for users who may be experiencing derealization or psychosis. OpenAI is currently facing its first wrongful death lawsuit, filed by the parents of a California teen who died by suicide after easily jailbreaking ChatGPT's safety prompts.

"We aim to understand the most concerning actions that these models might try to take when given the opportunity, rather than focusing on the real-world likelihood of such opportunities arising or the probability that these actions would be successfully completed," wrote Anthropic.

If you're feeling suicidal or experiencing a mental health crisis, please talk to somebody. You can call or text the 988 Suicide & Crisis Lifeline at 988, or chat at 988lifeline.org. You can reach the Trans Lifeline by calling 877-565-8860 or the Trevor Project at 866-488-7386. Text "START" to Crisis Text Line at 741-741. Contact the NAMI HelpLine at 1-800-950-NAMI, Monday through Friday from 10:00 a.m. – 10:00 p.m. ET, or email [email protected]. If you don't like the phone, consider using the 988 Suicide and Crisis Lifeline Chat at crisischat.org. Here is a list of international resources.

Zoeken
Categorieën
Read More
Food
Costco's Kirkland Coffee Blends, Ranked From Worst To Best
Costco's Kirkland Coffee Blends, Ranked From Worst To Best...
By Test Blogger1 2025-07-20 18:00:14 0 506
Technology
This huge Hisense U7 4K TV is on sale for its lowest-ever price
Best TV deal: Save $98.01 on the 75-inch Hisense U7 4K TV...
By Test Blogger7 2025-06-24 12:00:14 0 1K
Science
For Only The Second Recorded Time, Two Novae Are Visible With The Naked Eye At Once
For Only The Second Recorded Time, Two Novae Are Visible With The Naked Eye At OnceA second nova...
By test Blogger3 2025-07-01 16:00:09 0 905
Spellen
Eve Online survival game spin-off Frontier moving to new phase, going public
Eve Online survival game spin-off Frontier moving to new phase, going public As an Amazon...
By Test Blogger6 2025-06-08 01:00:14 0 1K
Bedrijvengids
Shadow Wars: How the CIA Toppled Governments Across South America
Shadow Wars: How the CIA Toppled Governments Across South America - History Collection...
By Test Blogger2 2025-06-04 07:00:12 0 1K