Why Anthropic’s new model has cybersecurity experts rattled
Anthropic released Claude Opus 4.5 and related models in 2025, with internal safety evaluations revealing that the AI demonstrated capabilities approaching what Anthropic classifies as 'high' risk in cybersecurity domains, including the ability to provide meaningful uplift to attackers seeking to create cyberweapons or conduct sophisticated intrusions. Anthropic published a model card and safety report acknowledging elevated cyber-offense capability scores compared to prior Claude versions. Cybersecurity researchers and policy experts have raised concerns that frontier AI models are crossing thresholds where they can materially assist malicious actors in ways previously requiring significant human expertise.
Does the promise of AI advancement justify the risk of putting powerful hacking-capable tools into the world, or should safety concerns halt deployment?
Anthropic's own model card scores Claude Opus 4.5 near or at the 'high' risk tier for cyber-offense capability — and then Anthropic itself decided deployment was safe. That is not a safety architecture; that is a company grading its own homework and choosing its own grade. The social contract around frontier AI is not being honored when the evaluator and the deployer are the same institution with the same profit motive.
L
We agree the disclosure happened — but Conservative is conflating transparency with accountability. Disclosing a near-'high' risk score and deploying anyway with no external check means the disclosure functions as liability management, not governance. When there are no consequences attached to what gets revealed, a model card is a press release, not a safety system.
C
We actually agree on the diagnosis. The conservative answer is to make those voluntarily disclosed thresholds the statutory floor — codified into law and enforced by the AI Safety Institutes — so that what Anthropic chose to reveal becomes binding on every lab, not just a reputational gesture from the most transparent one.
L
Codifying the thresholds only works if the entity defining them isn't the same one profiting from clearing them — which is exactly why the RSP revision matters, and why threshold authority needs to transfer to an independent body, not just get stamped into statute as Anthropic originally wrote it.
RSP revision signals voluntary governance failure
C
The 2023 voluntary commitments and Anthropic's Responsible Scaling Policy established a meaningful baseline — pre-deployment testing, government information-sharing, published capability thresholds. The appropriate move is to formalize and enforce those commitments through statutory backstops, not construct new bureaucratic architectures from scratch. What exists is imperfect but it is real infrastructure.
L
The January 2025 RSP revision directly undermines that case. Anthropic adjusted evaluation criteria and timelines at precisely the moment Claude 4 capabilities were approaching the thresholds that were supposed to trigger mandatory safeguards. Conservative calls this 'imperfect infrastructure' — critics with standing call it goalpost drift under commercial pressure, and the pattern matters as much as the intent.
C
We acknowledge this directly: if the standard-setter keeps moving its own red lines, no appeal to incumbent-entrenchment risk fully answers the accountability problem. That is precisely why the fix is to strip Anthropic of unilateral authority over those thresholds — transferring definitional power to the AI Safety Institutes by statute, not trusting the company to hold its own line.
L
That concession is the whole argument — once Conservative grants that Anthropic cannot be trusted to hold its own thresholds, the case for voluntary governance as a foundation collapses, and what remains is a dispute about which binding external authority gets the mandate, not whether one is needed.
Collective action makes unilateral restraint unstable
C
The marginal uplift argument cuts both ways. If Claude 4's offensive cyber assistance is modest relative to freely available exploit databases, open-source penetration testing tools, and existing search engines, then regulatory interventions targeting frontier labs impose large compliance costs while doing essentially nothing to reduce the actual attack surface available to malicious actors. Precision matters — we should regulate demonstrated harms, not speculative capability profiles.
L
The 'marginal uplift is modest' argument misses the structural point. Competitive pressure between Anthropic, OpenAI, Google DeepMind, and open-source replicants means the most cautious actor bears the highest competitive cost — unilateral restraint is structurally unstable without binding external governance. The question isn't whether today's marginal uplift is modest; it's whether the race dynamic produces adequate caution across successive generations without an external brake.
C
A race dynamic argument proves too much — it would justify pre-emptive regulation of any competitive technology sector on the theory that competition produces underweighting of externalities. The targeted answer is binding pre-deployment evaluation with specific, measurable capability thresholds, not sector-wide capability ceilings that can't distinguish meaningful risk from noise.
L
Binding pre-deployment evaluation with specific thresholds is exactly what we're asking for — the disagreement is whether the AI Safety Institutes get genuine authority to delay deployment when those thresholds are crossed, or whether they remain advisory observers while labs make the final call.
AI Safety Institutes lack binding deployment authority
C
The US and UK AI Safety Institutes exist, are building real evaluation capacity, and secured pre-deployment testing agreements from Anthropic and six other labs through the 2023 White House commitments. The priority should be resourcing them adequately and making their evaluation timelines binding before labs ship — not leaping to licensing regimes that require congressional micromanagement of rapidly evolving technical thresholds.
L
Conservative says make the timelines binding — but the current evaluation agreements are explicitly advisory and non-binding, with labs controlling access and timing. That isn't a resourcing problem; it's a mandate problem. The Institutes cannot block or condition deployment under their current authority, which means 'binding timelines' without deployment authority is a procedural requirement with no consequence for non-compliance.
C
Granting deployment veto power to federal agencies with inconsistent track records of technical competence introduces its own failure mode — political and bureaucratic delays on capabilities that pose no serious risk. The FAA certification model imposes structured, standards-based requirements without giving any single agency discretionary veto; that template preserves both accountability and competitive incentives.
L
The FAA analogy actually supports mandatory pre-deployment sign-off — the FAA can and does ground aircraft before commercial deployment when safety criteria aren't met. If Conservative endorses that model, the argument is about which agency holds the authority, not whether binding pre-deployment authority should exist at all.
China competition argument against capability ceilings
C
If the United States imposes capability ceilings or deployment moratoria on frontier AI while China's state-backed development continues unconstrained, we do not reduce the global supply of dangerous cyber-capable models — we relocate their development to jurisdictions with fewer safety norms and less transparency. The conservative case is for American leadership and rigorous standards over the models being built on American soil, not a regulatory structure that drives the most capable development offshore.
L
The 'regulation drives development to China' argument proves too much — by that logic, no safety standard on any dual-use technology is ever permissible because a less scrupulous competitor exists somewhere. What mandatory pre-deployment evaluation actually does is establish a global baseline: US-regulated labs that disclose and clear safety thresholds create international pressure and norm-setting that purely voluntary regimes cannot.
C
Norm-setting through US regulation has a reasonable track record in pharmaceuticals and finance, but those are sectors where global market access to the US creates compliance leverage. AI models can be deployed globally from outside US jurisdiction with no equivalent chokepoint — the analogy to FDA leverage overstates how much American regulatory authority actually travels.
L
The chokepoint exists at compute, not deployment — US export controls on advanced chips already demonstrate that American regulatory reach into global AI development is real and consequential, which means the 'it all moves offshore' objection has much less force than Conservative is claiming.
Conservative's hardest question
The most difficult challenge to my argument is Anthropic's own January 2025 RSP revision, which critics credibly argue softened the capability thresholds that were supposed to serve as mandatory safeguards. If the company that defined its own red lines is quietly adjusting those lines as models grow more capable, the case for trusting voluntary commitments over binding external oversight becomes structurally weaker — and no appeal to incumbent-entrenchment risks fully answers the question of who enforces the standards when the standard-setter keeps moving the goalposts.
Liberal's hardest question
The most difficult challenge to this argument is that Anthropic did voluntarily disclose near-'high' risk scores in its public model card rather than obscure them — a degree of transparency that mandatory regimes rarely compel incumbents to provide and that itself creates public accountability pressure. If transparency without enforcement is already producing meaningful disclosure, the case for binding authority must contend with the risk that heavy-handed mandates could drive evaluations underground or offshore to less transparent actors.
Both sides agree: Both sides agree that the current oversight architecture — voluntary commitments, self-reported model cards, and advisory-only government evaluations — lacks binding enforcement mechanisms sufficient to function as genuine external accountability.
The real conflict: The sides genuinely disagree on a factual-causal question: whether Claude 4's marginal capability uplift over existing freely available tools (exploit databases, open-source penetration testing software) is large enough to justify the compliance costs of binding pre-deployment regulation, or whether such regulation would impose large costs while doing little to reduce the actual attack surface available to malicious actors.
What nobody has answered: If Anthropic's 'approaching high' cybersecurity scores do not formally trigger mandatory ASL-3 safeguards under its own RSP, what exactly are those thresholds — and has either side verified whether any existing Claude 4 deployment has actually triggered them, or whether the thresholds were simply never reached in a way that required halting deployment?
Sources
Anthropic Claude 4 Model Card and Safety Report (Anthropic.com, 2025)