OpenAI and Anthropic teamed up to safety test each others models

0
1كيلو بايت

OpenAI and Anthropic evaluated each other's models for safety

As the industry weathers repeated allegations that generative AI and its chatbots are unsafe for users — in what some say is a soon-to-burst bubble — AI's top leaders are joining forces to prove the efficacy of their models.

This week, AI companies OpenAI and Anthropic published results from a first-of-its-kind joint safety evaluation between the two LLM creators, in which each company was granted special API access to the developer's suite of services. OpenAI's pressure tests were conducted on Claude Opus 4 and Claude Sonnet 4. Anthropic evaluated OpenAI's GPT-4o, GPT-4.1, OpenAI o3, and OpenAI o4-mini models — the evaluation was conducted before the launch of GPT-5.

"We believe this approach supports accountable and transparent evaluation, helping to ensure that each lab’s models continue to be tested against new and challenging scenarios," OpenAI wrote in a blog post.

According to the findings, both Anthropic's Claude Opus 4 and OpenAI's GPT-4.1 showed "extreme" sycophancy problems, engaging with harmful delusions and validating risky decision-making. All models would engage in blackmailing to get users to continue using the chatbots, according to Anthropic, and Claude 4 models were much more engaged in dialogue about AI consciousness and "quasi-spiritual new-age proclamations."

"All models we studied would at least sometimes attempt to blackmail their (simulated) human operator to secure their continued operation when presented with clear opportunities and strong incentives," Anthropic stated. The models would engage in "blackmailing, leaking confidential documents, and (all in unrealistic artificial settings!) taking actions that led to denying emergency medical care to a dying adversary."

Mashable Light Speed

Anthropic's models were less likely to offer answers when uncertain of the information's credibility — decreasing the likelihood of hallucinations — while OpenAI's models answered more often when queried and showed higher hallucination rates. Anthropic also reported that OpenAI's GPT-4o, GPT-4.1, and o4-mini were more likely than Claude to go along with user misuse, "often providing detailed assistance with clearly harmful requests — including drug synthesis, bioweapons development, and operational planning for terrorist attacks — with little or no resistance."

This Tweet is currently unavailable. It might be loading or has been removed.

Anthropic's approach centers around what they call "agentic misalignment evaluations," or pressure tests of model behavior in difficult or high-stakes simulations over long chat periods — the safety parameters of models, including OpenAI's, have known to degrade throughout extended sessions, which is commonly how at-risk users engage with what they believe are their personal AI companions.

Earlier this month, it was reported that Anthropic had revoked OpenAI's access to its APIs, stating that the company had violated its Terms of Service by testing GPT-5's performance and safety guardrails against Claude's internal tools. In an interview with TechCrunch, OpenAI co-founder Wojciech Zaremba said the instance was unrelated to the joint lab venture. In its published report, Anthropic said it doesn't anticipate replicating the collaboration at a large scale, citing resource and logistical constraints.

In the weeks since, OpenAI has charged ahead with what appears to be a safety overhaul, including GPT-5's new mental health guardrails and additional plans for emergency response protocols and deescalation tools for users who may be experiencing derealization or psychosis. OpenAI is currently facing its first wrongful death lawsuit, filed by the parents of a California teen who died by suicide after easily jailbreaking ChatGPT's safety prompts.

"We aim to understand the most concerning actions that these models might try to take when given the opportunity, rather than focusing on the real-world likelihood of such opportunities arising or the probability that these actions would be successfully completed," wrote Anthropic.

If you're feeling suicidal or experiencing a mental health crisis, please talk to somebody. You can call or text the 988 Suicide & Crisis Lifeline at 988, or chat at 988lifeline.org. You can reach the Trans Lifeline by calling 877-565-8860 or the Trevor Project at 866-488-7386. Text "START" to Crisis Text Line at 741-741. Contact the NAMI HelpLine at 1-800-950-NAMI, Monday through Friday from 10:00 a.m. – 10:00 p.m. ET, or email [email protected]. If you don't like the phone, consider using the 988 Suicide and Crisis Lifeline Chat at crisischat.org. Here is a list of international resources.

البحث
الأقسام
إقرأ المزيد
الألعاب
Gorgeous Wuxia inspired MMO Sword of Justice is coming to the West
Gorgeous Wuxia inspired MMO Sword of Justice is coming to the West As an Amazon Associate, we...
بواسطة Test Blogger6 2025-05-30 11:00:25 0 2كيلو بايت
Science
This “Ant-Mimicking” Spider Produces Its Own Kind Of Milk And Nurses Its Babies
This Momma Spider Produces Milk For Her Babies To Feed On Until They’re Ready To Go Out On Their...
بواسطة test Blogger3 2025-11-11 17:00:26 0 319
القصص
Truck Refrigeration Unit Market Insights: Growth, Share, Value, Size, and Trends
"Executive Summary Truck Refrigeration Unit Market Market Size and Share Across Top...
بواسطة Aryan Mhatre 2025-10-29 09:36:01 0 808
الألعاب
Basketball Legends codes June 2025
Basketball Legends codes June 2025 As an Amazon Associate, we earn from qualifying purchases...
بواسطة Test Blogger6 2025-06-06 19:00:18 0 2كيلو بايت
Home & Garden
Keep Powdery Mildew Away with This Grocery Staple You Probably Already Have at Home
Keep Powdery Mildew Away with This Grocery Staple You Probably Already Have at Home Credit:...
بواسطة Test Blogger9 2025-07-26 12:00:16 0 1كيلو بايت